WO2023109748A1 - 一种神经网络的调整方法及相应装置 - Google Patents

一种神经网络的调整方法及相应装置 Download PDF

Info

Publication number
WO2023109748A1
WO2023109748A1 PCT/CN2022/138377 CN2022138377W WO2023109748A1 WO 2023109748 A1 WO2023109748 A1 WO 2023109748A1 CN 2022138377 W CN2022138377 W CN 2022138377W WO 2023109748 A1 WO2023109748 A1 WO 2023109748A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
scale
network
operator
gradient
Prior art date
Application number
PCT/CN2022/138377
Other languages
English (en)
French (fr)
Inventor
陈官富
陈敏琪
黄泽毅
唐少华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023109748A1 publication Critical patent/WO2023109748A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of computer technology, in particular to a neural network adjustment method and a corresponding device.
  • the application of mixed precision in the training of neural networks refers to the mixed application of two or more operations with different precisions in the training process of neural networks, for example: half-precision floating point (FP16) and single
  • FP16 half-precision floating point
  • FP32 precision floating-point type
  • FP16 can speed up the training process, the expression range of FP16 is narrow, and the neural network will generate a lot of gradients during the training process, and the distribution of these gradients has a large range, especially as the neural network becomes more and more complex, it will generate more and more gradients. Small gradients, among them, many smaller gradients exceed the lower limit of the expression range of FP16, causing these small gradients to overflow the expression range of FP16, so that there will be a problem of a large gradient underflow rate, A large underflow rate will affect the accuracy of neural network training.
  • FP32 is used for training, because FP32 has high precision, the speed of reading and writing data is much slower than that of FP16, which affects the computing power of the chip.
  • the data in FP32 format needs to occupy a large amount of storage on the chip. Space also affects the direction of chip development towards miniaturization.
  • the present application provides a method for adjusting a neural network, which is used to reduce the gradient underflow rate during the training process of the neural network and improve the training efficiency of the neural network.
  • the present application also provides a corresponding device, a computer-readable storage medium, a computer program product, a chip system, and the like.
  • the first aspect of the present application provides a method for adjusting a neural network, including: obtaining a first neural network using mixed-precision operations, the first neural network including multiple scale layers, wherein each scale layer has a scaling scale, and each The scaling scale of the scale layer refers to the scale used to amplify or reduce the gradient associated with each scale layer in the backpropagation direction when training the first neural network.
  • the mixed precision operation includes the first precision operation; the input to The training samples of the first neural network are forward-propagated to obtain the value of the loss function; in the direction of back-propagation, the first gradient of the first operator in the target scale layer is processed according to the scaling scale of the target scale layer Scaling operation to obtain the second gradient of the first operator, the target scale layer is any scale layer in multiple scale layers, the first gradient of the first operator comes from the value of the loss function, and the scaling operation is zoom in Operation or reduction operation, the second gradient of the first operator is used to determine the weight gradient of each operator in the target scale layer; according to the weight gradient of each operator in the target scale layer is within the expression range of the first precision operation performance, adjust the scaling scale of the target scale layer.
  • the adjustment method of the neural network provided in this application can be applied in the training process of the neural network.
  • the adjustment method of the application can be used to adjust the scaling scale of each layer in the neural network.
  • the solution provided by this application can be stored on the network (could be the cloud) in the form of a software package/plug-in, and the user can download it and install it on a computer device to execute the process of this application; this application provides The solution of this application can also be provided to the user in the form of cloud service, etc., and the user can upload the neural network to be trained to the cloud, and the cloud uses the solution provided by this application to train the neural network; it is also possible to configure the solution of this application in the chip for The computer equipment for model training can execute the process of the present application by installing the chip; it is also possible to configure the scheme of the application into the computer equipment, and the computer equipment can execute the process of the application when training the neural network. It should be understood that, except for the above-mentioned manner
  • mixed-precision operations refer to the mixed application of two or more operations with different precisions, such as in the training process of neural networks.
  • the first-precision operation may be a half-precision floating-point FP16 operation
  • the second-precision operation may be a single-precision floating-point FP32 operation.
  • the first-precision operation and the second-precision operation may also be other types of precision operations.
  • the first-precision operation can be single-precision floating-point FP32 operation
  • the second-precision operation can be double-precision floating-point FP64 operation, as long as the accuracy of the second-precision operation is higher than that of the first-precision operation, that is, the second
  • the expression range of the precision operation is larger than the expression range of the first precision operation.
  • the first neural network refers to a neural network that has been scaled and layered, and the first neural network may be obtained through automatic scaling or manual scaling.
  • the scale layer can be understood as a layer obtained after scale layering, each scale layer has a scaling scale, and the scaling scale of each scale layer is usually different, of course, this application does not As a limitation, the scaling scales of different scale layers can also be the same.
  • Scaling refers to the scale by which the gradients associated with each scale layer in the backpropagation direction are enlarged or reduced.
  • forward propagation refers to the process of processing the training samples input to the neural network until the value of the loss function (error loss) is obtained.
  • Backpropagation refers to: using the loss function generated by forward propagation to update the parameters in the neural network. This process can determine the weight gradient of each layer operator through the value of the loss function, and then Update the operator weights so that the error loss converges.
  • the backpropagation algorithm is a backpropagation process dominated by the value of the loss, aiming to obtain the optimal parameters of the neural network, such as the weight matrix.
  • the gradient related to each scale layer includes the input gradient to the scale layer, the weight gradient used to update the weight, and the output gradient to be output from the scale layer.
  • the target scale layer includes at least the first operator mentioned above. If there are multiple operators in the target scale layer, then the multiple operators will have a logical order. One The output of an operator may be used as the input of the next operator.
  • the first operator refers to the operator that ranks first in the logical order among multiple operators.
  • Each operator in the above target scale layer includes the target scale All operators in the layer; if there is only one operator in the target scale layer, each operator in the above target scale layer is the first operator above.
  • the zoom operation and the reverse zoom operation are two opposite operations. If the zoom operation is a zoom-in operation, then the reverse zoom operation is a zoom-out operation. If the zoom operation is a zoom-out operation, then the reverse zoom operation is a zoom-in operation.
  • the first gradient can be understood as an input gradient
  • the second gradient is a gradient obtained after a scaling operation is performed according to the scaling scale of the target scale layer.
  • the weight gradient of each operator in the target scale layer can be determined through the second gradient.
  • the value of the first gradient of the first operator derived from the loss function may include a value directly derived from the loss function and a value indirectly derived from the loss function.
  • the first gradient of the first operator in the target scale layer can be obtained by deriving the value of the loss function, which can be understood as the target
  • the first gradient of the first operator in the scale layer directly comes from the value of the loss function; if the target scale layer is not the first scale layer in the backpropagation direction, the first gradient of the first operator in the target scale layer
  • the gradient is obtained from the gradient output by the previous scale layer, and by analogy layer by layer, the gradient output by the previous scale layer will be related to the first gradient of the first operator in the first scale layer in the backpropagation direction. This situation can be understood as the first gradient of the first operator in the target scale layer indirectly comes from the value of the loss function.
  • using the first precision operation may include using the first precision type for calculation and/or storage, such as FP16 operation can be understood as using FP16 calculation and/or storage
  • using the second precision operation may include using the second precision type type for calculation and/or storage, such as FP32 operation can be understood as using FP32 calculation and/or storage.
  • the gradient underflow rate refers to the ratio of the number of gradients that exceed the expression range of a certain precision operation to the total number of gradients.
  • the first gradient of the first operator can be scaled according to the scaling scale of the scale layer, and then the weight gradient of each operator in the scale layer can be calculated. Then observe the performance of the weight gradient of each operator within the expression range of the first-precision operation to adjust the scaling scale of the corresponding scale layer, so that the gradient of the first-precision operation can be effectively reduced through a small amount of calculation.
  • the mixed precision training can be well applied to the training of the neural network. While maintaining a high training accuracy, the training efficiency is improved, and the data using low-precision calculations only needs to occupy a small amount of storage. Space and running memory are also conducive to the chip's fast reading and writing of low-precision computing data, saving computing resources and costs when training neural networks.
  • the technical solution provided by this application can use low-precision and high-precision hardware resources to achieve the accuracy of training neural networks when only using high-precision hardware resources when training neural networks, and maintain considerable training efficiency; or, can use When training a neural network with the same mixed-precision hardware resources, the training efficiency is increased by 1-3 times.
  • using the technical solution provided by this application to train a neural network can either save computing resources or shorten training time while ensuring training accuracy.
  • the effect is better.
  • the above step: obtaining the first neural network using mixed precision operations includes: receiving the initial neural network to be trained; marking the first type of operator in the initial neural network as using The first precision operation is used to obtain a network using mixed precision operation, and the second type operator in the network using mixed precision operation adopts the second precision operation; scale layering is performed on the network using mixed precision operation to obtain the first Neural Networks.
  • the initial neural network may be a neural network constructed using single-precision operations, such as neural networks such as ResNet50 and MobileNet.
  • the initial neural network can be a neural network for performing classification/recognition, compression/decompression, denoising, segmentation, enhancement, transformation, feature extraction, etc. on various types of data such as image/video data, audio data, and text data , which is not limited in this application.
  • the first type of operator can be a convolution (Convolution, Conv) operator and/or a fully connected (Fully Connect, FC) operator
  • the second type of operator can be anything other than the first type of operator in the initial neural network All operators or some operators, such as: normalization operator.
  • Scale layering refers to manually or automatically implementing the layering of the initial neural network and the configuration of the scaling scale of the corresponding layering. It can be seen from this method that converting a single-precision neural network into a mixed-precision neural network, and assigning different scaling scales to different network layers, can greatly reduce the underflow rate of the gradient, and can stably train each network layer Neural Networks with Large Dynamic Range of Gradient Distributions.
  • the above step: performing scale layering on the network using mixed-precision computing includes: obtaining the initial scale of each network layer in the network using mixed-precision computing; The network layers are combined to obtain the above-mentioned first neural network.
  • the layer in the network before scale layering may be referred to as a network layer.
  • Each network layer can have an initial scale, which can be obtained through training or configuration.
  • the network layers with the same initial scale are combined, and the combined network can be called a layer-combined network, and the layer-combined network can be used as the first neural network.
  • This kind of merging of network layers with the same initial scale can further improve the efficiency of neural network training.
  • the above step: performing scale layering on the network using mixed-precision computing includes: obtaining the initial scale of each network layer in the network using mixed-precision computing; The network layers of the network are merged to obtain a layer-merging network; the first scaling operation of the output interface of the first network layer in the layer-merging network is combined with the second scaling operation of the input interface of the second network layer to obtain the first neural network In the network, the first network layer and the second network layer are adjacent, and the first network layer is the previous layer of the second network layer in the direction of backpropagation.
  • the layer merging network is further combined with the scaling operation and inverse scaling operation of two adjacent network layers, for example, the output interface of network layer 1 needs to execute
  • the input interface of the network layer 2 needs to perform an enlargement operation with a scaling scale of M, then the two operations can be combined, and the scaling operation of M/S can be performed, which is equivalent to one time
  • the scaling operation and an inverse scaling operation are reduced to one scaling operation, and the combined scaling operation can be performed at the output interface of the previous scale layer or at the input interface of the latter scale layer.
  • This implementation combines the scaling operation and the inverse scaling operation of two adjacent layers, which reduces calculation steps and further improves the efficiency of neural network training.
  • the above step: obtaining the initial scale of each network layer in the network using mixed precision computing includes: determining the initial scale of each layer in the network using mixed precision computing according to a preset underflow rate The initial scale of each network layer, or receive configuration information for configuring the initial scale of each network layer in the network using mixed precision computing, and determine each network layer in the network using mixed precision computing according to the configuration information the initial scale of .
  • the initial scale of each network layer can be automatically determined according to a preset underflow rate, which belongs to automatic scale layering. It is also possible to receive the configuration information of the initial scale of each network layer configured by the user, and then determine the initial scale of each network layer according to the configuration information, which belongs to manual scale layering. It can be seen that the present application provides a variety of scale layering methods, which improves the flexibility of scale layering.
  • the above step determining the initial scale of each network layer in the network using mixed-precision calculations according to a preset underflow rate, including:
  • the preset underflow rate can be used to calculate multiple sets of scale values for each network layer through different training samples, and then the average value can be calculated as each The initial scale of a network layer. This improves the accuracy of the initial scale.
  • the above steps adjust the scaling scale of the target scale layer according to the performance of the weight gradient of each operator in the target scale layer within the expression range of the first precision operation, including: When the weight gradient of each operator in the target scale layer includes infinite values or invalid numbers, reduce the scaling scale of the target scale layer; when the weight gradient of each operator in the target scale layer is within the expression of the first precision operation Within the range, increase the zoom scale of the target scale layer.
  • the invalid numbers can be invalid numbers such as fractions with a denominator of 0, which means the scaling scale of the current target scale layer If it is too large, the zoom scale of the target scale layer needs to be reduced.
  • the weight gradient of each operator is within the expression range of the first precision operation, it is also possible to try to further expand the scaling scale of the target scale layer by increasing the scaling scale of the target scale layer. This is conducive to finding the optimal scaling scale suitable for each scale layer, thereby improving the convergence speed of the neural network.
  • the method when the weight gradient of each operator is within the expression range of the first precision operation, the method further includes: according to the scaling scale of the target scale layer, The weight gradient of each operator performs the inverse scaling operation of the scaling operation to obtain the weight gradient of each operator after the inverse scaling operation in the target scale layer; according to the weight of each operator after the inverse scaling operation in the target scale layer The gradient updates the weights of each operator in the target scale layer.
  • the weight gradient of each operator in the target scale layer is calculated by using the second gradient of the first operator obtained through the scaling operation, when updating the weights, it is necessary to first perform the The inverse scaling operation of the weight gradient of each operator, and then update the weight of each operator, which is more conducive to the convergence of the neural network.
  • the method further includes: determining the output gradient of the target scale layer according to the second gradient of the first operator; The inverse scaling operation of the scaling operation is performed on the output gradient to obtain the output gradient of the target scale layer.
  • the inverse scaling operation of the corresponding scaling scale should be performed on the gradient to be output , to get a suitable output gradient for the calculation of the next scale layer.
  • the method further includes: when the output gradient of the target scale layer is an infinite value or an invalid number, correcting the output gradient of the target scale layer to an expression calculated in the first precision The effective value in the range, and the corrected output gradient of the target scale layer is transmitted to the adjacent scale layer of the target scale layer.
  • the output gradient of the target scale layer is an infinite value or an invalid number, it means that the output gradient is not suitable for the weight update of the operator in each scale layer, and the weight update is directly skipped. step, but in order not to affect the subsequent calculation process, the output gradient can be corrected to an effective value within the expression range of the first precision operation, and then transmitted to the next scale layer in the back propagation direction for calculation, which is beneficial to improve Scale update efficiency for neural networks.
  • the method further includes: in the process of forward propagation, if the feature value of the target scale layer includes an infinite value or an invalid number, skip The update of the scale layer.
  • the eigenvalues of the target scale layer are generated by each scale layer during the forward propagation process.
  • the performance is used to determine whether to update the operator weights in the scale layer. If the forward features include infinite values or invalid numbers, there is no need to update the operator weights. This will help improve the training efficiency of the neural network.
  • the method further includes: when the training of the first neural network reaches a preset condition, re-scaling the first neural network to obtain the second neural network.
  • the preset condition can be that the number of training times reaches a certain threshold, for example, 300 cycles have been trained, or that the neural network has been trained to a certain extent, such as: the difference between the scaling scales of each scale layer is less than
  • the default value can re-scale the first neural network.
  • the method of scale layering can be understood by referring to the previous description, and then a new second neural network is obtained, and then the second neural network is trained. This way of dynamically updating the scale layer can improve the training efficiency of the neural network.
  • the second aspect of the present application provides a neural network adjustment device, and the neural network adjustment device has the function of realizing the method of the above-mentioned first aspect or any possible implementation manner of the first aspect.
  • This function may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions, for example: an acquisition unit, a first processing unit, a second processing unit, and a third processing unit. These units can be processed by one processing unit or multiple processing units unit to achieve.
  • a third aspect of the present application provides a computer device, the computer device includes at least one processor, a memory, an input/output (input/output, I/O) interface, and a computer executable program stored in the memory and operable on the processor Instructions, when the computer-executed instructions are executed by the processor, the processor executes the method according to the above first aspect or any possible implementation manner of the first aspect.
  • the fourth aspect of the present application provides a computer-readable storage medium that stores one or more computer-executable instructions.
  • the computer-executable instructions are executed by a processor, one or more processors execute any of the above-mentioned first aspect or first aspect.
  • the fifth aspect of the present application provides a computer program product that stores one or more computer-executable instructions.
  • the computer-executable instructions are executed by one or more processors, one or more processors execute the above-mentioned first aspect or first A method for any one of the possible implementations of the aspect.
  • the sixth aspect of the present application provides a chip system, the chip system includes at least one processor, and at least one processor is used to support the adjustment device of the neural network to realize the above-mentioned first aspect or any possible implementation of the first aspect. the functions involved.
  • the system-on-a-chip may also include memory, memory, necessary program instructions and data for adjusting the neural network.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • Fig. 1 is a schematic structural diagram of a neural network processor provided by an embodiment of the present application
  • Fig. 2 is a schematic diagram of an adjustment architecture of the neural network provided by the embodiment of the present application.
  • Fig. 3 is a schematic diagram of the training process of the neural network provided by the embodiment of the present application.
  • Fig. 4 is a schematic diagram of an embodiment of a neural network adjustment method provided by an embodiment of the present application.
  • Fig. 5 is a structural schematic diagram of FP16 and FP32
  • Fig. 6 is a schematic diagram of another embodiment of the neural network adjustment method provided by the embodiment of the present application.
  • Fig. 7 is a schematic diagram of an embodiment of scale layering provided by the embodiment of the present application.
  • Fig. 8 is a schematic diagram of another embodiment of scale layering provided by the embodiment of the present application.
  • Fig. 9 is a schematic diagram of another embodiment of scale layering provided by the embodiment of the present application.
  • Fig. 10 is a schematic diagram of another embodiment of the neural network adjustment method provided by the embodiment of the present application.
  • Fig. 11 is a schematic diagram of another embodiment of the neural network adjustment method provided by the embodiment of the present application.
  • Fig. 12 is a schematic diagram of another embodiment of the neural network adjustment method provided by the embodiment of the present application.
  • Fig. 13 is a schematic diagram of another embodiment of the neural network adjustment method provided by the embodiment of the present application.
  • Fig. 14 is a schematic structural diagram of the neural network adjustment device provided by the embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • An embodiment of the present application provides a method for adjusting a neural network, which is used to reduce the underflow rate of a gradient of the neural network during training, and improve the training efficiency of the neural network.
  • Embodiments of the present application also provide corresponding devices, computer-readable storage media, computer program products, chip systems, and the like. Each will be described in detail below.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the nature of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that involves a wide range of fields, including both hardware-level technology and software-level technology.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the neural network model is usually trained on the model owner's computer device or platform (such as: server, virtual machine (VM) or container (container)), and the trained model will be stored in the form of a model file .
  • the model user's device such as: terminal device, server or edge device, VM or container, etc.
  • the model user's device actively loads the model file of the model or the model owner's device actively sends it to the model
  • the model file of the model is installed on the user's device, and then the model is applied on the model user's device to perform corresponding functions.
  • a server refers to a physical machine.
  • Terminal equipment also called user equipment (UE) is a device with wireless transceiver function, which can be deployed on land, including indoor or outdoor, handheld or vehicle-mounted; it can also be deployed on water (such as ships etc.); can also be deployed in the air (such as aircraft, balloons and satellites, etc.).
  • the terminal may be a mobile phone, a tablet computer (pad), a computer with a wireless transceiver function, a virtual reality (virtual reality, VR) terminal, an augmented reality (augmented reality, AR) terminal, an industrial control (industrial control) Wireless terminals in self driving, wireless terminals in remote medical, wireless terminals in smart grid, wireless terminals in transportation safety, Wireless terminals in smart cities, wireless terminals in smart homes, etc.
  • Both the VM and the container may be virtualized devices that are divided by virtualization on the hardware resources of the physical machine.
  • a neural network model refers to a neural network built for one or more business objectives.
  • the neural network may be composed of neural units, and the neural unit may refer to an operation unit with x s and the intercept b as input, and the output of the operation unit may be:
  • W s is the weight of x s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next layer of neurons.
  • the activation function may be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • the neural network in the embodiment of the present application may be a deep neural network (Deep Neural Network, DNN), or a convolutional neural network (Convolutional Neural Network, CNN), or other neural networks.
  • DNN Deep Neural Network
  • CNN convolutional Neural Network
  • a deep neural network can be understood as a neural network with many hidden layers. There is no special metric for "many” here. The essence of the often-called multi-layer neural network and deep neural network is the same. According to the position of different layers of DNN, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, it is actually not complicated in terms of the work of each layer.
  • a convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolutional feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learned image information.
  • multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • an AI chip such as a neural network processor (neural network processor unit, NPU), as shown in Figure 1, the neural network processor 10 is mounted on the main On the CPU (Host CPU), the Host CPU assigns tasks.
  • the core part of the neural network processor is the operation circuit 103, and the operation circuit 103 is controlled by the controller 104 to extract matrix data in the memory and perform multiplication operations.
  • the operation circuit 103 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 103 is a two-dimensional systolic array.
  • the arithmetic circuit 103 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 103 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 102, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 101 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator 108 accumulator.
  • the unified memory 106 is used to store input data and output data.
  • the weight data directly accesses the controller 105 Direct Memory Access Controller through the storage unit, and the DMAC is transferred to the weight storage 102.
  • Input data is also transferred to unified memory 106 by DMAC.
  • the bus interface unit (Bus Interface Unit, BIU) 110 is used for the interaction between the AXI bus and the DMAC and the instruction fetch memory 109 Instruction Fetch Buffer.
  • the bus interface unit 110 is used for the instruction fetch memory 109 to obtain instructions from the external memory, and for the storage unit access controller 101 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to move the input data in the external memory DDR to the unified memory 106 , to move the weight data to the weight memory 102 , or to move the input data to the input memory 101 .
  • the vector calculation unit 107 is a plurality of calculation processing units, and if necessary, further processes the output of the calculation circuit, such as vector multiplication, vector addition, exponent operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/FC layer network calculations in neural networks, such as pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), etc.
  • the vector computation unit can 107 store the processed output vectors to the unified buffer 106 .
  • the vector calculation unit 107 may apply a non-linear function to the output of the operation circuit 103, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 107 generates normalized values, binned values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 103, for example for use in a subsequent layer in a neural network.
  • An instruction fetch buffer 109 connected to the controller 104 is used for storing instructions used by the controller 104 .
  • the unified memory 106, the input memory 101, the weight memory 102 and the fetch memory 109 are all On-Chip memories. External memory is private to the hardware architecture of the neural network processor.
  • the above introduces the neural network, deep neural network, convolutional neural network, and AI chips that may be used for neural network training.
  • the following describes the forward propagation and back propagation processes involved in the neural network training process in conjunction with Figure 2.
  • the neural network as shown in Figure 2 comprises four layers (here just enumerates neural network with Figure 2, in fact neural network can comprise many layers), is respectively layer 1, layer 2, layer 3 and layer 4, in this neural network
  • the structure of each layer and the relationship between layers can be understood by referring to the content of the lower example in FIG. 2 .
  • the content of the lower example in FIG. 2 is just an example and is not limited to this inter-layer relationship.
  • the process of forward propagation refers to the layer-by-layer processing of the training samples input to the neural network, such as: firstly through layer 1 processing, then to layer 2, layer 3, until layer 4 outputs the value of the loss function, the loss function
  • the value can also be called error loss.
  • Backpropagation refers to using the value of the loss function generated by forward propagation to update the parameters in the neural network. This process can determine the weight gradient of each layer operator through the value of the loss function. Then the operator weight is updated, so that the error loss converges.
  • backpropagation is the process from layer 4 to layer 3, then to layer 2, and then to layer 1.
  • the backpropagation algorithm is a backpropagation process dominated by the loss function, aiming to obtain the optimal parameters of the neural network, such as the weight matrix.
  • the neural network adjustment method provided in the embodiment of the present application can be applied to the training process of the neural network as shown in Figure 3, as shown in Figure 3, the computer device can use the training samples to train the initial neural network, through multiple rounds
  • the target neural network can be obtained through training, and the target neural network obtained through training can be applied to corresponding terminal devices for business applications.
  • the first neural network may be obtained through scale layering, and then the scaling scale of the scale layer in the first neural network may be scaled.
  • the solution provided by this application can be stored on the network (could be the cloud) in the form of a software package/plug-in, and the user can download it and install it on a computer device to execute the process of this application; this application provides The solution of this application can also be provided to the user in the form of cloud service, etc., and the user can upload the neural network to be trained to the cloud, and the cloud uses the solution provided by this application to train the neural network; it is also possible to configure the solution of this application in the chip for The computer equipment for model training can execute the process of the present application by installing the chip; it is also possible to configure the scheme of the application into the computer equipment, and the computer equipment can execute the process of the application when training the neural network. It should be understood that, except for the above-mentioned manner of deploying the technical solution provided by the present application, the present application does not limit the specific form of using the technical solution provided by the present application.
  • the adjustment method of the neural network provided by the embodiment of the present application is introduced below with reference to FIG. 4 .
  • an embodiment of the neural network adjustment method provided by the embodiment of the present application includes:
  • a computer device obtains a first neural network employing mixed-precision operations, the first neural network comprising a plurality of scale layers, wherein each scale layer has a scaling scale.
  • the scaling scale of each scale layer refers to the scale used to amplify or reduce the gradient associated with each scale layer in the backpropagation direction when training the first neural network.
  • Mixed-precision computing refers to the mixed application of two or more operations with different precisions, such as in the training process of neural networks.
  • the mixed-precision operation includes first-precision operation and second-precision operation, and the expression range of the second-precision operation is larger than that of the first-precision operation.
  • the first-precision operation can be a half-precision floating-point FP16 operation
  • the second-precision operation can be a single-precision floating-point FP32 operation.
  • the first-precision operation and the second-precision operation can also be other types of precision operations, for example,
  • the first-precision operation can be a single-precision floating-point FP32 operation
  • the second-precision operation can be a double-precision floating-point FP64 operation, as long as the accuracy of the second-precision operation is higher than that of the first-precision operation, that is, the second-precision operation
  • the expressive range of is larger than that of first-precision operations.
  • the first precision may also be a precision lower than FP16, such as FP8, INT8 and so on.
  • Using the first precision operation may include using the first precision type for calculation and/or storage, such as FP16 operation can be understood as using FP16 calculation and/or storage
  • using the second precision operation may include using the second precision type for calculation and /or storage, such as FP32 operation can be understood as using FP32 calculation and/or storage.
  • FP16 is a half-precision floating-point number, and 1 bit is used to represent the symbol, as shown in Figure 5.
  • the bit corresponding to FP16 is 15, and 5 bits are used to represent the exponent, as shown in Figure 5.
  • FP16 corresponds to bits 10 to 14, and 10 bits represent decimals, such as the bits 0 to 9 corresponding to FP16 in Figure 5;
  • FP32 is a single-precision floating-point number, and 1 bit is used to represent the symbol, as shown in Figure 5, the bit 31 corresponding to FP32 is represented by 8bit represents the exponent, such as bits 23 to 30 corresponding to FP32 in Figure 5, and 23bit represents a decimal, such as bits 0 to 22 corresponding to FP32 in Figure 5.
  • the data range of FP16 is (6 ⁇ 10 -8 to 65504), the data range of FP32 is (1.4 ⁇ 10 -45 to 1.7 ⁇ 10 38 ), and the data ranges represented by FP32 and FP16 are different. In big data calculation, FP16 There is a risk of overflow.
  • the gradient underflow rate refers to the ratio of the number of gradients that exceed the expression range of a certain precision operation to the total number of gradients.
  • the first neural network refers to a neural network that has undergone scale layering, and the first neural network may be obtained through automatic scale layering or manual scale layering.
  • the scale layer can be understood as a layer obtained through scale layering.
  • Each scale layer has a scaling scale.
  • the scaling scale of each scale layer is usually different. Of course, this application does not make a limitation on this.
  • the scaling scale of the scale layer can also be the same. Scaling refers to the scale by which the gradients associated with each scale layer in the backpropagation direction are enlarged or reduced.
  • the gradients related to each scale layer include the input gradient to the scale layer, the weight gradient used to update the weight, and the output gradient to output the scale layer, etc.
  • the computer device performs forward propagation processing on the training samples input to the first neural network to obtain the value of the loss function.
  • an applicable loss function can be selected.
  • the loss function adopted by the first neural network may include but not limited to the following types: 0-1 loss, hinge loss (hinge loss), softmax loss, logistic-loss (Logistic-loss), cross entropy (cross entropy), softmax cross entropy (softmax cross entropy), triplet loss (triplet loss), mean squared error (mean squared error, MSE), mean absolute error (mean absolute error, MAE), smooth L1 loss, L1 loss, L2 loss, Center loss (center loss), etc.
  • the computer device performs a scaling operation on the first gradient of the first operator in the target scale layer according to the scaling scale of the target scale layer, so as to obtain the second gradient of the first operator.
  • the target scale layer is any one of the multiple scale layers.
  • the first gradient of the first operator comes from the value of the loss function.
  • the scaling operation is an enlargement or reduction operation.
  • the second gradient of the first operator is used To determine the weight gradient of each operator in the target scale layer.
  • the target scale layer There may be one or more operators in the target scale layer. If there are multiple operators, then the multiple operators will have a logical order. The output of one operator may be used as the input of the next operator.
  • the first operator refers to is the operator that ranks first in the logical order among multiple operators.
  • the first gradient can be understood as the input gradient
  • the second gradient is the gradient after the scaling operation is performed according to the scaling scale of the target scale layer.
  • the weight gradient of each operator in the target scale layer can be determined through the second gradient.
  • the value of the first gradient of the first operator derived from the loss function may include a value directly derived from the loss function and a value indirectly derived from the loss function. If the target scale layer is the first scale layer in the backpropagation direction, the first gradient of the first operator in the target scale layer can be obtained by deriving the value of the loss function, which can be understood as The first gradient of the first operator in the target scale layer is directly derived from the value of the loss function; if the target scale layer is not the first scale layer in the backpropagation direction, the first gradient of the first operator in the target scale layer The first gradient is obtained through the gradient output by the previous scale layer.
  • the gradient output by the previous scale layer will be related to the first gradient of the first operator in the first scale layer in the backpropagation direction. This situation can be understood as the first gradient of the first operator in the target scale layer indirectly comes from the value of the loss function.
  • the computer device adjusts the zoom scale of the target scale layer according to the performance of the weight gradient of each operator in the target scale layer within the expression range of the first precision operation.
  • the first gradient of the first operator can be scaled according to the scaling scale of the scaling layer, and then the weight gradient of each operator in the scaling layer can be calculated, and then Observe the performance of the weight gradient of each operator within the expression range of the first-precision operation to adjust the scaling scale of the corresponding scale layer, so that the underflow of the gradient of the first-precision operation can be effectively reduced through a small amount of calculation
  • the mixed precision training can be well applied to the training of the neural network. While maintaining a high training accuracy, the training efficiency is improved, and the data that uses low-precision calculations only needs to occupy a small area on the chip.
  • the storage space is also conducive to the chip's fast reading and writing of low-precision computing data.
  • Scale layering is performed on the network using mixed precision operations to obtain the first neural network.
  • Scale adjustment in addition, the scale layer can be corrected in the process of forward propagation and back propagation.
  • the adjustment method of the above-mentioned neural network may include the following parts: 1. Scale and layer the initial neural network to obtain the first neural network; 2. Adjust the scaling scale of the scale layer through trial and error within the layer; 3. Interlayer corrections to optimize the adjustment process.
  • the computer device after the computer device receives the initial neural network to be trained, it can mark the first type operator in the initial neural network as using the first precision operation, and the second type operator in the initial neural network uses the first precision operation by default. performing double-precision operations to obtain a network using mixed-precision operations; and then performing scale layering on the network using mixed-precision operations to obtain the first neural network.
  • the initial neural network may be a neural network constructed using single-precision operations, such as neural networks such as ResNet50 and MobileNet.
  • the first type of operator can be a convolution (Convolution, Conv) operator and/or a fully connected (Fully Connect, FC) operator
  • the second type of operator can be anything other than the first type of operator in the initial neural network All operators or some operators, such as: normalization operator.
  • the scheme of scale layering may include automatic scale layering and manual scale layering, which will be introduced respectively below.
  • step 301 is just a way to calculate the initial scale of the network layer, and the initial scale of the network layer can also be determined in other ways.
  • training sample in the embodiment of the present application refers to a batch sample, not a training sample, as shown in Figure 7, training sample 1, training sample 2, ..., training samples n are from different batches sample.
  • a set of scale values will be obtained with the preset underflow rate as the target.
  • This set of scale values includes a scale value for each network layer. For example, if you input training sample 1, you will get group 1, If the network adopting mixed precision operation includes m network layers, then the group 1 includes scale value 11 of network layer 1, scale value 21 of network layer 2, ..., scale value m1 of network layer m. In this way, if training sample 2 is input, group 2 will be obtained, and group 2 includes scale value 12 of network layer 1, scale value 22 of network layer 2, ..., scale value m2 of network layer m.
  • group n when training sample n is input, group n will be obtained, and this group n includes scale value 1n of network layer 1, scale value 2n of network layer 2, ..., scale value mn of network layer m.
  • the layer-merging network may be directly used as the first neural network, or step 303 may be performed on the basis of the layer-merging network.
  • the merged network can be called layer merging network
  • the layer-merged network can be directly used as the first neural network
  • the first neural network including (m-1) scale layers on the right side of Figure 8 is obtained.
  • the scaling scale of scale layer 1 a
  • the scaling scale of scale layer 2 b
  • the scaling scale of scale layer (m ⁇ 1) f.
  • the dotted arrow indicates backpropagation
  • the solid arrow indicates forward propagation.
  • the process from the layer merging network to the first neural network can be understood by referring to FIG. 9.
  • FIG. 9 based on the merging process on the left side of FIG.
  • the scaling scale of scale layer 1 a
  • the scaling scale of scale layer 2 is still b, but only one b/a scaling operation needs to be performed on the output interface of scale layer 1 or the input interface of scale layer 2
  • the scaling scale of the scale layer (m-1) is still f, but only one f/e scaling operation needs to be performed on the output interface of the scale layer (m-2) or the input interface of the scale layer (m-1)
  • e is the initial scale of the previous layer of the network layer m.
  • the dotted arrows indicate backpropagation
  • the solid arrows indicate forward propagation.
  • a scaling operation may be performed at the output interface of the previous scale layer, or may be performed at the input interface of the subsequent scale layer.
  • the initial scale of the network layer of automatic scale layering is to determine the initial scale of each network layer in the network using mixed precision operation according to the preset underflow rate, while the manual scale layering is obtained by receiving the network using mixed precision operation
  • the initial scale of each network layer in the network is configured according to the configuration information, and the initial scale of each network layer in the network using mixed precision computing is determined according to the configuration information.
  • the layer merging of other network layers and the merging of scaling operations are the same as the description in the automatic scale layering section. You can refer to the content of the automatic scale layering section for understanding, and will not repeat them here.
  • the single-precision computing neural network is converted into a mixed-precision computing neural network, and different scaling scales are assigned to different network layers, which can greatly reduce the underflow rate of the gradient, and can stably train each network layer Neural Networks with Large Dynamic Range of Gradient Distributions.
  • step 302 or step 303 above the process of performing scale adjustment using the first neural network obtained in step 302 or step 303 above is slightly different, and will be introduced respectively below.
  • step 302 Use the layer-merging network in step 302 as the first neural network.
  • the scale layers in the backpropagation direction are scale layer 1, scale layer 2, . . . , scale layer (m-1).
  • the scale layer 1 includes three operators, namely operator 1, operator 2 and operator 3.
  • the logical relationship between the three operators is shown in Figure 10.
  • the output of operator 1 is operator
  • the input of operator 2 is the input of operator 3, where operator 1 is the first operator of scale layer 1.
  • the first gradient of operator 1 can be obtained by deriving the value of the loss function, and then the scaling operation is performed on the first gradient according to the scaling scale a of scale layer 1, the embodiment of this application
  • the first gradient is enlarged by a times to obtain the second gradient of operator 1.
  • the output gradient 01 of operator 1 is output to operator 2
  • the output gradient O2 of operator 2 will also be calculated
  • the output gradient O2 of operator 2 will be output to operator 3.
  • the weight gradient S3 of operator 3 and the output gradient of operator 3 will be calculated through the output gradient O2 of operator 2. O3.
  • the adjusted scaling scale is When the weight gradient of each operator in the target scale layer includes infinite values (infinite, INF) or invalid numbers (Not a number, NAN), then reduce the scaling scale of the target scale layer, for example, on the basis of scaling scale a zoom out
  • the adjusted scaling scale is When the weight gradient of each operator is within the expression range of the first precision operation, increase the scaling scale of the target scale layer, such as enlarging 2 (1/1000) on the basis of scaling scale a, the adjusted scaling scale is is b ⁇ 2 (1/1000) .
  • the passages listed here And obtaining the adjusted zoom scale by b ⁇ 2 (1/1000) is only one way, and the zoom scale can also be reduced or increased in other ways, such as subtracting a value from the current zoom scale, or , adding a value to the current zoom scale. Of course, it can also be obtained in other ways, as long as the zoom scale can be reduced or increased, it can be applied to the zoom scale adjustment of the present application.
  • the weight gradient of each operator when the weight gradient of each operator includes infinite values or invalid numbers, and the invalid numbers can be invalid numbers such as fractions with a denominator of 0, it means that the scaling scale of the current target scale layer is too large , the scaling scale of the target scale layer needs to be reduced.
  • the weight gradient of each operator is within the expression range of the first precision operation, it is also possible to try to further expand the scaling scale of the target scale layer by increasing the scaling scale of the target scale layer. This is conducive to finding the optimal scaling scale suitable for each scale layer, thereby improving the convergence speed of the neural network.
  • the inverse scaling operation of the scaling operation is performed on the weight gradient of each operator in the target scale layer according to the scaling scale of the target scale layer , to obtain the weight gradient of each operator after the inverse scaling operation in the target scale layer; update the weight of each operator in the target scale layer according to the weight gradient of each operator after the inverse scaling operation in the target scale layer.
  • the scale layer 1 when the weight gradient S1 of operator 1, the weight gradient S2 of operator 2, and the weight gradient S3 of operator 3 are all within the expression range of the first precision operation, the scale layer The scaling scale a of 1 performs the inverse scaling operation on the weight gradient S1, the weight gradient S2, and the weight gradient S3, that is, performs the operation of reducing the weight gradient by a times, and obtains the corresponding weight gradient U1, weight gradient U2, and weight gradient U3 after the scaling operation. , further, the weight of operator 1 may be updated according to the weight gradient U1, the weight of operator 2 may be updated according to the weight gradient U2, and the weight of operator 3 may be updated according to the weight gradient U3.
  • the gradient to be output in the target scale layer can also be determined according to the second gradient of the first operator; according to the scaling scale of the target scale layer, the inverse of the scaling operation is performed on the gradient to be output in the target scale layer Scaling operation to get the output gradient of the target scale layer.
  • the output gradient O3 of the operator 3 can be used as the gradient to be output, and then the output gradient O3 is reduced by a times to obtain the output gradient to be output to the scale layer 2, and then transmitted to the scale layer 2.
  • the execution process of the solution in the embodiment of the present application at scale layer 2 is basically the same as that at scale layer 1.
  • the difference is that the first gradient of the first operator in scale layer 2 is the received output gradient of scale layer 1 output.
  • the first gradient is directly calculated through the value of the loss function.
  • the process of adjusting the zoom scale in the scale layer can also be understood with reference to FIG. 11 .
  • the process includes:
  • step 303 Use the scaling operation in step 303 to merge the network as the first neural network.
  • the scale layers in the backpropagation direction are scale layer 1, scale layer 2, ..., scale layer (m-1), and the scaling scale of scale layer 1 is a , the scaling scale of scale layer 2 is b, ..., the scaling scale of scale layer (m-1) is f, but only one b/a scaling operation needs to be performed on the output interface of scale layer 1 or the input interface of scale layer 2 , and so on, only one f/e scaling operation needs to be performed on the output interface of the scale layer (m-2) or the input interface of the scale layer (m-1).
  • the b/a scaling operation is performed on the input interface of the scale layer 2 as an example for illustration.
  • the execution process in the scale layer 1 can be understood by referring to the content of the previous part in Figure 10.
  • the difference is that the processing of reducing a is not performed for the gradient to be output, but in the scale layer 2 At the input interface of , it is directly amplified by b/a times, instead of the output gradient of scale layer 1 output in scale layer 2 being amplified by b times as in Figure 10.
  • the execution process in the scale layer 2 is shown in Figure 12.
  • the scale layer 2 includes three operators, namely Operator 4, Operator 5 and Operator 6.
  • the logical relationship between the three operators is shown in Figure 12
  • the output of operator 4 is the input of operator 5
  • the output of operator 5 is the input of operator 6
  • operator 4 is the first operator of the scaling layer 4
  • the output gradient of scale layer 1 is the first gradient of operator 4
  • the scaling operation is performed on the first gradient of operator 4 according to the scaling scale b/a of scale layer 2.
  • a b/a times enlargement operation is performed on the first gradient to obtain the second gradient of the operator 4 .
  • the output gradient 04 of operator 4 is output to operator 5, and calculate the weight gradient S5 of operator 5 through the output gradient O4 , the output gradient O5 of operator 5 will also be calculated, and the output gradient 05 of operator 5 will be output to operator 6.
  • the weight gradient S6 of operator 6 and the output gradient of operator 6 will be calculated through the output gradient O5 of operator 5 O6.
  • the scaling scale of the target scale layer is reduced, such as scaling down on the basis of scaling b
  • the adjusted scaling scale is
  • the scaling scale of the target scale layer is enlarged, such as enlarging 2 (1/1000) on the basis of the scaling scale b, and the adjusted scaling scale is b ⁇ 2 (1/1000) .
  • the inverse scaling operation of the scaling operation is performed on the weight gradient of each operator according to the scaling scale of the target scale layer to obtain the inverse scaling operation
  • the weight gradient of each operator update the weight of each operator according to the weight gradient of each operator after the inverse scaling operation.
  • the scaling scale b of scale layer 2 In order to perform an inverse scaling operation on the weight gradient S4, weight gradient S5, and weight gradient S6, that is, perform a b-fold reduction operation to obtain the corresponding weight gradient U4, weight gradient U5, and weight gradient U6 after the scaling operation. Further, you can The weight of operator 4 is updated according to the weight gradient U4, the weight of operator 5 is updated according to the weight gradient U5, and the weight of operator 6 is updated according to the weight gradient U6.
  • the output gradient of scale layer 2 is not processed, and is directly output to the next scale layer.
  • the processing process in the next scale layer can be understood by referring to the processing process in scale layer 2. It is only the scaling of the next scale layer Scales may vary.
  • the inter-layer correction can be performed in the backward propagation direction, and the inter-layer correction can also be performed in the forward propagation direction, which will be introduced respectively below.
  • the output gradient of the target scale layer is an infinite value or an invalid number
  • the output gradient of the target scale layer is corrected to an effective value, and the corrected output gradient of the target scale layer is transmitted to the adjacent scale layer of the target scale layer.
  • the output gradient matrix of scale layer 1 is Among them, if there are two infs, the output gradient needs to be corrected. You can correct the inf in the matrix to 0, and then output the corrected output gradient, such as:
  • the output gradient of the target scale layer is an infinite value or an invalid number, it means that the output gradient is not suitable for the weight update of operators in each scale layer, and the step of weight update is skipped directly.
  • the output gradient can be corrected to an effective value within the first precision range, and then transmitted to the next layer for calculation, which is conducive to improving the training efficiency of the neural network.
  • the update of the target scale layer is skipped.
  • the feature value of the target scale layer is the feature value generated by each scale layer during the forward propagation process.
  • the feature value can also be within the expression range of the first precision operation To determine whether to update the operator weights in the scale layer, if the forward features include infinite values or invalid numbers, there is no need to update the operator weights, which is conducive to improving the training efficiency of the neural network.
  • the following process can also be performed: when the training of the first neural network reaches the preset condition, re-scale the first neural network to Get the second neural network.
  • the preset condition can be that the number of training times reaches a certain threshold, such as 300 cycles of training, or that the neural network has been trained to a certain extent, such as: the difference between the scaling scales of each scale layer is smaller than the preset value, the first neural network can be re-scaled, and the scale layered method can be understood by referring to the previous description, and then a new second neural network is obtained, and then the second neural network is trained. This way of dynamically updating the scale layer can improve the training efficiency of the neural network.
  • an embodiment of the neural network adjustment device 50 provided by the embodiment of the present application includes:
  • the adjusting device 50 of the neural network comprises:
  • the acquiring unit 501 is configured to acquire a first neural network using mixed-precision operations.
  • the first neural network includes a plurality of scale layers, wherein each scale layer has a scaling scale, and the scaling scale of each scale layer refers to the When training the first neural network, the gradients associated with each scale layer in the backpropagation direction are scaled up or down, and the mixed-precision operations include first-precision operations.
  • the acquiring unit 501 may execute step 201 in the above method embodiment corresponding to FIG. 4 .
  • the first processing unit 502 is configured to perform forward propagation processing on the training samples input to the first neural network acquired by the acquisition unit 501 to obtain the value of the loss function; Step 202 in the method embodiment.
  • the second processing unit 503 is configured to perform a scaling operation on the first gradient of the first operator in the target scale layer according to the scaling scale of the target scale layer in the direction of backpropagation, so as to obtain the second gradient of the first operator.
  • the target scale layer is any one of the multiple scale layers
  • the first gradient of the first operator comes from the value of the loss function obtained by the first processing unit 502
  • the scaling operation is an enlargement operation or a reduction operation
  • the second gradient of an operator is used to determine the weight gradient of each operator in the target scale layer; the second processing unit 503 can execute step 203 in the above method embodiment corresponding to FIG. 4 .
  • the third processing unit 504 is configured to adjust the scaling scale of the target scale layer according to the performance of the weight gradient of each operator in the target scale layer obtained by the second processing unit 503 within the expression range of the first precision operation.
  • the third processing unit 504 may execute step 204 in the above method embodiment corresponding to FIG. 4 .
  • the first gradient of the first operator can be scaled according to the scaling scale of the scaling layer, and then the weight gradient of each operator in the scaling layer can be calculated, and then Observe the performance of the weight gradient of each operator within the expression range of the first-precision operation to adjust the scaling scale of the corresponding scale layer, so that the underflow of the gradient of the first-precision operation can be effectively reduced through a small amount of calculation
  • the mixed precision training can be well applied to the training of the neural network. While maintaining a high training accuracy, the training efficiency is improved, and the data that uses low-precision calculations only needs to occupy a small area on the chip.
  • the storage space is also conducive to the chip's fast reading and writing of low-precision computing data.
  • the mixed-precision operation further includes a second-precision operation, the expression range of the second-precision operation is greater than the expression range of the first-precision operation, and the acquisition unit 501 is configured to: receive the initial neural network to be trained; One type of operator is marked as using the first precision operation to obtain a network using mixed precision operation, and the second type of operator in the network using mixed precision operation adopts second precision operation; scale layering is performed on the network using mixed precision operation , to get the first neural network.
  • the obtaining unit 501 is configured to: obtain the initial scale of each network layer in the network adopting mixed precision operation; combine the network layers with the same initial scale to obtain the first neural network.
  • the acquisition unit 501 is configured to: acquire the initial scale of each network layer in the network using mixed precision operations; combine the network layers with the same initial scale to obtain a layer-merged network; merge the layers into the first network in the network
  • the first scaling operation of the output interface of the layer and the second scaling operation of the input interface of the second network layer are combined to obtain the first neural network, the first network layer and the second network layer are adjacent, and in the backpropagation direction
  • the upper first network layer is the previous layer of the second network layer.
  • the obtaining unit 501 is configured to determine the initial scale of each network layer in the network using mixed-precision calculations according to a preset underflow rate, or receive the initial scale of each network layer in the network using mixed-precision calculations
  • the initial scale of each network layer in the network using mixed-precision computing is determined according to the configuration information.
  • the third processing unit 504 is configured to: reduce the scaling scale of the target scale layer when the weight gradient of each operator in the target scale layer includes infinite values or invalid numbers; If the weight gradient of the operator is within the expression range of the first precision operation, the zoom scale of the target scale layer is increased.
  • the third processing unit 504 is further configured to: when the weight gradient of each operator is within the expression range of the first precision operation, according to the scaling scale of the target scale layer, the weight gradient of each operator in the target scale layer The weight gradient performs the inverse scaling operation of the scaling operation to obtain the weight gradient of each operator after the inverse scaling operation in the target scale layer; update the target scale layer according to the weight gradient of each operator after the inverse scaling operation in the target scale layer The weight of each operator in .
  • the third processing unit 504 is also configured to determine the gradient to be output in the target scale layer according to the second gradient of the first operator; perform scaling on the gradient to be output in the target scale layer according to the scaling scale of the target scale layer The inverse scaling operation of the operation to obtain the output gradient of the target scale layer.
  • the third processing unit 504 is also configured to correct the output gradient of the target scale layer to an effective value within the expression range of the first precision operation when the output gradient of the target scale layer is an infinite value or an invalid number ; transfer the corrected output gradient of the target scale layer to the adjacent scale layer of the target scale layer.
  • the third processing unit 504 is further configured to skip updating the target scale layer during forward propagation if the feature value of the target scale layer includes an infinite value or an invalid number.
  • the third processing unit 504 is further configured to re-scale the first neural network to obtain a second neural network when the training of the first neural network reaches a preset condition.
  • the acquiring unit 501, the first processing unit 502, the second processing unit 503, and the third processing unit 504 may be implemented by one unit or module, or by multiple units or modules.
  • the above method flow can be executed.
  • the neural network adjustment device provided by the embodiment of the present application can be understood by referring to the corresponding content in the foregoing neural network adjustment method part, and will not be repeated here.
  • FIG. 15 is a schematic diagram of a possible logical structure of the computer device 60 provided by the embodiment of the present application.
  • the computer device 60 can be an adjustment device of a neural network.
  • the computer device 60 includes: a processor 601 , a communication interface 602 , a memory 603 and a bus 604 .
  • the processor 601 , the communication interface 602 and the memory 603 are connected to each other through a bus 604 .
  • the processor 601 is used to control and manage the actions of the computer device 60, for example, the processor 601 is used to execute the adjustment process of the neural network in the method embodiments shown in FIGS. 2 to 13 , and the communication interface 602 Used to support computer equipment 60 for communication.
  • the memory 603 is used for storing program codes and data of the computer device 60 .
  • the processor 601 may be a central processing unit, a general processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor 601 may also be a combination that implements computing functions, for example, a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the bus 604 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus or the like.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • a computer-readable storage medium is also provided, and computer-executable instructions are stored in the computer-readable storage medium.
  • the processor of the device executes the computer-executable instructions
  • the device executes the above-mentioned FIG. The method of model training in Figure 8, or the adjustment method of the neural network in Figure 2-13 above.
  • a computer program product includes computer-executable instructions stored in a computer-readable storage medium; when the processor of the device executes the computer-executable instructions , the device executes the neural network adjustment method in Figure 2-13 above.
  • a system-on-a-chip is further provided, and the system-on-a-chip includes a processor, and the processor is configured to implement the method for adjusting the neural network in FIGS. 2-13 above.
  • the system-on-a-chip may further include a memory, which is used for storing necessary program instructions and data of the device for inter-process communication.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Abstract

本申请公开了一种神经网络的调整方法,应用于AI模型训练的过程中,该调整方法通过获取包含多个尺度层的第一神经网络,该第一神经网络采用混合精度运算,每个尺度层具有一个缩放尺度,在反向传播方向上,在每个尺度层按照对应的缩放尺度确定该尺度层上算子的权重梯度,例如,通过权重梯度在FP16中的表现,(如:是否有INF或NAN)来调整相应尺度层的缩放尺度,这样通过很小的计算量,就可以有效降低FP16的梯度的下溢率,可以使混合精度训练很好的应用于神经网络的训练。本申请提供的技术方案在保持了较高的训练精度的情况下,提高了训练效率,而且采用低精度运算的数据只需要占用芯片上较小的存储空间,也有利于芯片对低精度运算的数据的快速读写。

Description

一种神经网络的调整方法及相应装置
本申请要求于2021年12月15日提交中国专利局、申请号为202111535584.9、发明名称为“一种神经网络的调整方法及相应装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,具体涉及一种神经网络的调整方法及相应装置。
背景技术
将混合精度应用在对神经网络的训练中,指的是将两种或两种以上不同精度的运算混合应用在对神经网络的训练过程中,例如:将半精度浮点型(FP16)和单精度浮点型(FP32)相结合对神经网络进行训练,这样可以在尽可能减少精度损失的情况下利用FP16加速训练过程。
虽然FP16可以加速训练过程,但FP16的表达范围窄,而神经网络在训练过程中会产生很多梯度,这些梯度分布的范围较大,尤其是随着神经网络越来越复杂,会产生更多较小的梯度,其中,很多较小的梯度都超过了FP16的表达范围的下限,导致这些较小的梯度向下溢出了FP16的表达范围,这样就会出现梯度的下溢率较大的问题,而下溢率较大会影响神经网络训练的精度。但如果只采用FP32进行训练,因为FP32的精度很高,数据的读取和写入速度都比采用FP16慢很多,影响了芯片的计算能力,另外FP32格式的数据需要占用芯片上较大的存储空间,也影响了芯片向小型化发展的方向。
因此基于芯片的计算能力和存储能力,采用混合精度来训练神经网络是较好的选择,而如何降低神经网络在混合精度训练过程中梯度的下溢率,成为当前在混合精度神经网络训练过程中需要克服的一大挑战。
发明内容
本申请提供一种神经网络的调整方法,用于降低神经网络在训练过程中梯度的下溢率,提高神经网络训练的效率。本申请还提供了相应的装置、计算机可读存储介质、计算机程序产品以及芯片系统等。
本申请第一方面提供一种神经网络的调整方法,包括:获取采用混合精度运算的第一神经网络,第一神经网络包括多个尺度层,其中,每个尺度层具有一个缩放尺度,每个尺度层的缩放尺度指的是用于训练第一神经网络时,对反向传播方向上与每个尺度层相关的梯度进行放大或缩小的尺度,混合精度运算包括第一精度运算;对输入到第一神经网络的训练样本进行前向传播的处理,以得到损失函数的值;在反向传播方向上,按照目标尺度层的缩放尺度对目标尺度层中第一个算子的第一梯度进行缩放操作,以得到第一个算子的第二梯度,目标尺度层为多个尺度层中的任意一个尺度层,第一个算子的第一梯度来源于损失函数的值,缩放操作为放大操作或缩小操作,第一个算子的第二梯度用于确定目标尺度层中每个算子的权重梯度;根据目标尺度层中每个算子的权重梯度在第一精度运算的表 达范围内的表现,调整目标尺度层的缩放尺度。
本申请提供的神经网络的调整方法可以应用于神经网络的训练的过程中,在训练不同目标的神经网络模型时,都可以使用本申请的调整方法来调整神经网络中各层的缩放尺度。需要说明的是,本申请提供的方案可以是以软件包/插件的形式存储在网络(可以是云端)上,用户使用时通过下载,安装到计算机设备即可执行本申请的流程;本申请提供的方案也可以以云服务等形式提供给用户,用户可以将要训练的神经网络上传到云端,云端采用本申请提供的方案训练神经网络;也可以是将本申请的方案配置在芯片中,用于模型训练的计算机设备安装该芯片即可执行本申请的流程;也可以是将本申请方案配置到计算机设备中,计算机设备训练神经网络时即可执行本申请的流程。应理解,除了上述部署本申请提供的技术方案的方式以外,对于使用本申请提供的技术方案的具体形式,本申请不做限定。
本申请中,混合精度运算指的是将两种或两种以上不同精度的运算混合应用,如应用在对神经网络的训练过程中。其中,第一精度运算可以为半精度浮点型FP16运算,第二精度运算可以为单精度浮点型FP32运算,当然,第一精度运算和第二精度运算也可以是其他类型的精度运算,例如,第一精度运算可以为单精度浮点型FP32运算,第二精度运算可以为双精度浮点型FP64运算,只要第二精度运算的精度高于第一精度运算即可,也就是第二精度运算的表达范围大于第一精度运算的表达范围。
本申请中,第一神经网络指的是进行过尺度分层的神经网络,该第一神经网络可以是通过自动尺度分层或者手动尺度分层得到的。
本申请中,尺度层可以理解为是通过尺度分层后所得到的一个层,每个尺度层都具有一个缩放尺度,每个尺度层的缩放尺度通常不相同,当然,对此本申请中不做限定,不同尺度层的缩放尺度也可以相同。缩放尺度指的是对反向传播方向上与每个尺度层相关的梯度进行放大或缩小的尺度。
本申请中,涉及到前向传播和反向传播,前向传播指的是对输入到神经网络的训练样本进行处理直到得到损失函数的值(误差损失)的过程。反向传播(back propagation,BP)指的是:利用前向传播产生的损失函数,来更新神经网络中的参数,该过程可以是通过损失函数的值来确定每层算子的权重梯度,进而更新算子权重,从而使误差损失收敛。反向传播算法是以损失的值为主导的反向传播过程,旨在得到最优的神经网络的参数,例如权重矩阵。
本申请中,每个尺度层相关的梯度包括输入到尺度层的输入梯度,用于更新权重的权重梯度,以及要输出尺度层的待输出梯度等。
本申请中,目标尺度层中可能有一个或多个算子,目标尺度层至少包括上述第一个算子,如果目标尺度层有多个算子,那么多个算子会有逻辑顺序,一个算子的输出可能作为下一个算子的输入,第一个算子指的是多个算子中逻辑顺序上排在第一位的算子,上述目标尺度层中每个算子包括目标尺度层中的所有算子;如果目标尺度层只有一个算子,上述目标尺度层中每个算子即为上述第一个算子。
本申请中,缩放操作与逆缩放操作是相反的两个操作,如果缩放操作为放大操作,那 么逆缩放操作为缩小操作,如果缩放操作为缩小操作,那么逆缩放操作为放大操作。
本申请中,第一梯度可以理解为是输入梯度,第二梯度是按照目标尺度层的缩放尺度进行缩放操作后的梯度。通过第二梯度可以确定出该目标尺度层中每个算子的权重梯度。
本申请中,第一个算子的第一梯度来源于损失函数的值可以包括直接来源于损失函数的值,以及间接来源于损失函数的值。如果目标尺度层是反向传播方向上的第一个尺度层,则可以通过对损失函数的值进行求导得到目标尺度层中第一个算子的第一梯度,这种情况可以理解为目标尺度层中第一个算子的第一梯度直接来源于损失函数的值;如果目标尺度层不是反向传播方向上的第一个尺度层,则目标尺度层的第一个算子的第一梯度是通过上一个尺度层输出的梯度得到的,逐层类推,上一个尺度层输出的梯度会与反向传播方向上第一个尺度层中第一个算子的第一梯度有关联,这种情况可以理解为目标尺度层中第一个算子的第一梯度间接来源于损失函数的值。
本申请中,采用第一精度运算可以包括采用第一精度的类型进行计算和/或存储,如FP16运算可以理解为采用FP16计算和/或存储,采用第二精度运算可以包括采用第二精度的类型进行计算和/或存储,如FP32运算可以理解为采用FP32计算和/或存储。
本申请中,梯度的下溢率指的是向下超出某一精度运算的表达范围的梯度数量占总的梯度数量的比例。
由上述第一方面可知,在神经网络训练的过程中,可以按照尺度层的缩放尺度对第一个算子的第一梯度进行缩放操作,进而计算出尺度层中每个算子的权重梯度,然后观察每个算子的权重梯度在第一精度运算的表达范围内的表现,来调整相应尺度层的缩放尺度,这样通过很小的计算量,就可以有效降低第一精度运算的梯度的下溢率,这样可以使混合精度训练很好的应用于神经网络的训练,在保持了较高的训练精度的情况下,提高了训练效率,而且采用低精度运算的数据只需要占用较小的存储空间和运行内存,也有利于芯片对低精度运算的数据的快速读写,节省训练神经网络时的计算资源和成本。
本申请提供的技术方案在训练神经网络时,能够使用低精度和高精度相结合的硬件资源达到单纯使用高精度硬件资源训练神经网络时的精度,而且保持相当的训练效率;或者,能够在使用相同的混合精度的硬件资源训练神经网络时,将训练效率提升1-3倍。换句话说,使用本申请提供的技术方案训练神经网络,在保证训练精度的同时,要么可以节省计算资源,要么可以缩短训练时间。尤其是训练Low-level的神经网络、异构神经网络和深层神经网络等时效果更佳。
在第一方面的一种可能的实现方式中,上述步骤:获取采用混合精度运算的第一神经网络,包括:接收待训练的初始神经网络;将初始神经网络中第一类型算子标记为采用第一精度运算,以得到采用混合精度运算的网络,采用混合精度运算的网络中第二类型算子采用所述第二精度运算;对采用混合精度运算的网络进行尺度分层,以得到第一神经网络。
该种可能的实现方式中,初始神经网络可以为采用单一精度运算构建的神经网络,如ResNet50、MobileNet等神经网络。初始神经网络可以是用于对图像/视频数据、音频数据、文本数据等各种类型的数据执行分类/识别、压缩/解压缩、去噪、分割、增强、转换、特征提取等处理的神经网络,本申请对此不做限定。第一类型算子可以为卷积(Convolution, Conv)算子和/或全连接(Fully Connect,FC)算子,第二类型算子可以是初始神经网络中除第一类型算子之外的全部算子或部分算子,如:归一化算子。尺度分层指的是通过手动或自动的方式实现对初始神经网络的分层以及相应分层的缩放尺度的配置。由该种方式可知,将单精度运算的神经网络转换为混合精度运算的神经网络,并给不同网络层分配不同的缩放尺度,可以极大减小梯度的下溢率,能稳定训练各个网络层梯度分布动态范围大的神经网络。
在第一方面的一种可能的实现方式中,上述步骤:对采用混合精度运算的网络进行尺度分层,包括:获取采用混合精度运算的网络中每个网络层的初始尺度;将初始尺度相同的网络层进行合并,以得到上述第一神经网络。
该种可能的实现方式中,针对没进行尺度分层之前的网络中的层可以称为网络层。每个网络层都可以有一个初始尺度,该初始尺度可以是通过训练得到的,也可以是通过配置得到的。将初始尺度相同的网络层进行合并,合并后的网络可以称为层合并网络,该层合并网络即可作为第一神经网络。该种将相同初始尺度的网络层进行合并,可以进一步提高神经网络训练的效率。
在第一方面的一种可能的实现方式中,上述步骤:对采用混合精度运算的网络进行尺度分层,包括:获取采用混合精度运算的网络中每个网络层的初始尺度;将初始尺度相同的网络层进行合并,以得到层合并网络;将层合并网络中第一网络层的输出接口的第一缩放操作和第二网络层的输入接口的第二缩放操作进行合并,以得到第一神经网络,第一网络层和第二网络层相邻,且在反向传播方向上第一网络层是第二网络层的前一层。
相对于上一种可能的实现方式,在这种可能的实现方式中,对层合并网络进一步进行相邻两个网络层的缩放操作与逆缩放操作的合并,如网络层1的输出接口要执行缩放尺度为S的缩小操作,网络层2的输入接口要执行缩放尺度为M的放大操作,则可以将两个操作进行合并,执行M/S的缩放操作即可,该种方式相当于将一次缩放操作和一次逆缩放操作减少为一次缩放操作,缩放操作合并后的一次缩放操作可以在前一个尺度层的输出接口处执行,也可以在后一个尺度层的输入接口处执行。该种实现方式将相邻两个层进行缩放操作与逆缩放操作的合并,减少了计算步骤,进一步提高了神经网络训练的效率。
在第一方面的一种可能的实现方式中,上述步骤:获取采用混合精度运算的网络中每个网络层的初始尺度,包括:根据预设的下溢率确定采用混合精度运算的网络中每个网络层的初始尺度,或者,接收对采用混合精度运算的网络中每个网络层的初始尺度进行配置的配置信息,根据所述配置信息确定所述采用混合精度运算的网络中每个网络层的初始尺度。
该种可能的实现方式中,可以根据预设的下溢率来自动确定每个网络层的初始尺度,属于自动尺度分层。也可以接收用户配置的对每个网络层的初始尺度的配置信息,再根据配置信息确定每个网络层的初始尺度,属于手动尺度分层。由此可知,本申请中提供了多样的尺度分层方式,提高了尺度分层的灵活性。
在第一方面的一种可能的实现方式中,上述步骤:根据预设的下溢率确定采用混合精度运算的网络中每个网络层的初始尺度,包括:
以所述预设的下溢率为目标,输入不同的训练样本确定所述每个网络层的多组尺度值,所述每个网络层的多组尺度值的平均值为所述每个网络层的初始尺度。
该种可能的实现方式中,自动确定初始尺度时,可以是以预设的下溢率为目标,通过不同的训练样本来计算多组每个网络层的尺度值,进而求出平均值作为每个网络层的初始尺度。这样可以提高初始尺度的准确性。
在第一方面的一种可能的实现方式中,上述步骤:根据目标尺度层中每个算子的权重梯度在第一精度运算的表达范围内的表现,调整目标尺度层的缩放尺度,包括:当目标尺度层中每个算子的权重梯度中包括无穷大的数值或无效数字,则减小目标尺度层的缩放尺度;当目标尺度层中每个算子的权重梯度位于第一精度运算的表达范围内,则增大目标尺度层的缩放尺度。
该种可能的实现方式中,当每个算子的权重梯度中包括无穷大的数值或无效数字时,无效数字可以是分母为0的分数等无效的数字,则表示该当前目标尺度层的缩放尺度过大,需要减小该目标尺度层的缩放尺度。当每个算子的权重梯度位于第一精度运算的表达范围内,则还可以通过增大目标尺度层的缩放尺度来尝试进一步扩大该目标尺度层的缩放尺度。这样有利于找到适合各尺度层的最优缩放尺度,从而提高神经网络的收敛速度。
在第一方面的一种可能的实现方式中,当每个算子的权重梯度位于第一精度运算的表达范围内时,该方法还包括:按照目标尺度层的缩放尺度,对目标尺度层中每个算子的权重梯度执行缩放操作的逆缩放操作,以得到目标尺度层中逆缩放操作后的每个算子的权重梯度;根据目标尺度层中逆缩放操作后的每个算子的权重梯度更新目标尺度层中每个算子的权重。
该种可能的实现方式中,因为目标尺度层中每个算子的权重梯度是采用经过缩放操作得到的第一个算子的第二梯度计算得到的,所以在更新权重时,需要先执行对每个算子的权重梯度的逆缩放操作,然后再更新每个算子的权重,这样更有利于神经网络的收敛。
在第一方面的一种可能的实现方式中,该方法还包括:根据第一个算子的第二梯度确定目标尺度层的待输出梯度;按照目标尺度层的缩放尺度,对目标尺度层中待输出梯度执行缩放操作的逆缩放操作,以得到目标尺度层的输出梯度。
该种可能的实现方式中,因为目标尺度层中待输出梯度是通过采用缩放操作后的第一个算子的第二梯度得到的,所以要对该待输出梯度进行相应缩放尺度的逆缩放操作,得到合适的输出梯度用于下一个尺度层的计算。
在第一方面的一种可能的实现方式中,该方法还包括:当目标尺度层的输出梯度为无穷大的数值或无效数字时,将目标尺度层的输出梯度修正为在第一精度运算的表达范围内的有效值,并将修正后的目标尺度层的输出梯度传输给目标尺度层的相邻尺度层。
该种可能的实现方式中,如果目标尺度层的输出梯度为无穷大的数值或无效数字时,则表示该输出梯度不适用于每个尺度层中算子的权重更新,则直接跳过权重更新的步骤,但为了不影响后续的计算过程,可以将该输出梯度修正为在第一精度运算的表达范围内的有效值,再传输给反向传播方向上的下一个尺度层进行计算,有利于提高神经网络的尺度更新效率。
在第一方面的一种可能的实现方式中,该方法还包括:在前向传播的过程中,若所述目标尺度层的特征值包括无穷大的数值或无效数字,则跳过对所述目标尺度层的更新。
该种可能的实现方式中,目标尺度层的特征值是在前向传播过程中各尺度层产生的,在前向传播的过程中,也可以根据特征值在第一精度运算的表达范围内的表现来确定是否进行该尺度层中算子权重的更新,如果前向特征包括无穷大数值或无效数字则不需要更新算子权重,这样,有利于提高神经网络的训练效率。
在第一方面的一种可能的实现方式中,该方法还包括:当对第一神经网络的训练达到预设条件时,重新对第一神经网络进行尺度分层,以得到第二神经网络。
该种可能的实现方式中,预设条件可以是训练次数达到一定阈值,如已经训练了300个周期,也可以是神经网络已经训练到一定的程度,如:各尺度层的缩放尺度的差距小于预设值,可以重新对第一神经网络进行尺度分层,尺度分层的方式可以参阅前面的描述进行理解,然后得到一个新的第二神经网络,再对第二神经网络进行训练。这种动态更新尺度层的方式可以提高神经网络的训练效率。
本申请第二方面提供一种神经网络的调整装置,该神经网络的调整装置具有实现上述第一方面或第一方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块,例如:获取单元、第一处理单元、第二处理单元和第三处理单元,这几个单元可以通过一个处理单元或多个处理单元来实现。
本申请第三方面提供一种计算机设备,该计算机设备包括至少一个处理器、存储器、输入/输出(input/output,I/O)接口以及存储在存储器中并可在处理器上运行的计算机执行指令,当计算机执行指令被处理器执行时,处理器执行如上述第一方面或第一方面任意一种可能的实现方式的方法。
本申请第四方面提供一种存储一个或多个计算机执行指令的计算机可读存储介质,当计算机执行指令被处理器执行时,一个或多个处理器执行如上述第一方面或第一方面任意一种可能的实现方式的方法。
本申请第五方面提供一种存储一个或多个计算机执行指令的计算机程序产品,当计算机执行指令被一个或多个处理器执行时,一个或多个处理器执行如上述第一方面或第一方面任意一种可能的实现方式的方法。
本申请第六方面提供了一种芯片系统,该芯片系统包括至少一个处理器,至少一个处理器用于支持神经网络的调整装置实现上述第一方面或第一方面任意一种可能的实现方式中所涉及的功能。在一种可能的设计中,芯片系统还可以包括存储器,存储器,用于神经网络的调整装置必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
附图说明
图1是本申请实施例提供的神经网络处理器的一结构示意图;
图2是本申请实施例提供的神经网络的一调整架构示意图;
图3是本申请实施例提供的神经网络的训练过程的一示意图;
图4是本申请实施例提供的神经网络的调整方法的一实施例示意图;
图5是FP16和FP32的一结构示意图;
图6是本申请实施例提供的神经网络的调整方法的另一实施例示意图;
图7是本申请实施例提供的尺度分层一实施例示意图;
图8是本申请实施例提供的尺度分层另一实施例示意图;
图9是本申请实施例提供的尺度分层另一实施例示意图;
图10是本申请实施例提供的神经网络的调整方法的另一实施例示意图;
图11是本申请实施例提供的神经网络的调整方法的另一实施例示意图;
图12是本申请实施例提供的神经网络的调整方法的另一实施例示意图;
图13是本申请实施例提供的神经网络的调整方法的另一实施例示意图;
图14是本申请实施例提供的神经网络的调整装置的一结构示意图;
图15是本申请实施例提供的计算机设备的一结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请实施例提供一种神经网络的调整方法,用于降低神经网络在训练过程中的梯度的下溢率,提高神经网络训练的效率。本申请实施例还提供了相应的装置、计算机可读存储介质、计算机程序产品以及芯片系统等。以下分别进行详细说明。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、智慧城市、智能终端等。
神经网络模型通常是在模型所有者的计算机设备或平台(如:服务器、虚拟机(virtual machine,VM)或容器(container))中进行训练得到的,训练好的模型会以模型文件的形式存储。在模型使用者的设备(如:终端设备、服务器或边缘设备、VM或容器等)需要使用该模型时,模型使用者的设备主动加载该模型的模型文件或者模型所有者的设备主动发送给模型使用者的设备安装该模型的模型文件,进而使该模型在模型使用者的设备上应用,执行相应的功能。
服务器指的是物理机。
终端设备(也可以称为用户设备(user equipment,UE))是一种具有无线收发功能的设备,可以部署在陆地上,包括室内或室外、手持或车载;也可以部署在水面上(如轮船等);还可以部署在空中(例如飞机、气球和卫星上等)。所述终端可以是手机(mobile phone)、平板电脑(pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)终端、增强现实(augmented reality,AR)终端、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程医疗(remote medical)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端等。
VM或容器都可以是在物理机的硬件资源上采用虚拟化的方式划分出来的虚拟化的设备。
神经网络模型指的是为了一个或多个业务目标而构建的神经网络。
本申请实施例中,神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距b为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022138377-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层神经单元的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
本申请实施例中的神经网络可以是深度神经网络(Deep Neural Network,DNN),也可以是卷积神经网络(Convolutional Neural Network,CNN),也可以是其他神经网络。下面对深度神经网络和卷积神经网络进行简单的介绍。
1.深度神经网络
深度神经网络可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准,常说的多层神经网络和深度神经网络其本质是相同的。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是 说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022138377-appb-000002
其中,
Figure PCTCN2022138377-appb-000003
是输入向量,
Figure PCTCN2022138377-appb-000004
是输出向量,
Figure PCTCN2022138377-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022138377-appb-000006
经过上述操作得到输出向量
Figure PCTCN2022138377-appb-000007
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2022138377-appb-000008
的数量也就更多了。
2.卷积神经网络
卷积神经网络是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,我们都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
在训练神经网络的过程中,可以通过AI芯片来实现,如:神经网络处理器(neural network processor unit,NPU),如图1所示,该神经网络处理器10作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。神经网络处理器的核心部分为运算电路103,通过控制器104控制运算电路103提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路103内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路103是二维脉动阵列。运算电路103还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路103是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器102中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器101中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器108accumulator中。
统一存储器106用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器105Direct Memory Access Controller,DMAC被搬运到权重存储器102中。输入数据也通过DMAC被搬运到统一存储器106中。
总线接口单元(Bus Interface Unit,BIU)110用于AXI总线与DMAC和取指存储器109Instruction Fetch Buffer的交互。
总线接口单元110,用于取指存储器109从外部存储器获取指令,还用于存储单元访问控制器101从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器106或将权重数据搬运到权重存储器102中或将输入数据数据搬运到输入存储器101中。
向量计算单元107多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现种,向量计算单元能107将经处理的输出的向量存储到统一缓存器106。例如,向量计算单元107可以将非线性函数应用到运算电路103的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元107生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路103的激活输入,例如用于在神经网络中的后续层中的使用。
控制器104连接的取指存储器(instruction fetch buffer)109,用于存储控制器104使用的指令。
统一存储器106,输入存储器101,权重存储器102以及取指存储器109均为On-Chip存储器。外部存储器私有于该神经网络处理器的硬件架构。
以上介绍了神经网络、深度神经网络、卷积神经网络以及神经网络训练可能会使用的AI芯片,下面结合图2介绍神经网络训练的过程中会涉及到的前向传播和反向传播的过程。
如图2所示的神经网络包括四层(这里只是以图2列举神经网络,实际上神经网络可以包括很多个层),分别为层1、层2、层3和层4,该神经网络中每层的结构以及层与层之间的关系可以参阅图2中下方示例的内容进行理解,当然,图2中下方示例的内容只是一种举例,不限于是此种层间关系。
前向传播的过程指的是对输入到神经网络的训练样本进行逐层处理,如:先经过层1处理,再到层2、层3,直到层4输出损失函数的值,该损失函数的值也可以称为误差损失。
反向传播(back propagation,BP)指的是利用前向传播产生的损失函数的值,来更新神经网络中的参数,该过程可以是通过损失函数的值来确定每层算子的权重梯度,进而更新算子权重,从而使误差损失收敛。在图2中,反向传播为从层4到层3,再到层2,再到层1的过程。反向传播算法是以损失函数为主导的反向传播过程,旨在得到最优的神经网络的参数,例如权重矩阵。
本申请实施例提供的神经网络的调整方法可以应用于如图3所示的神经网络的训练的过程中,如图3所示,计算机设备可以使用训练样本对初始神经网络进行训练,通过多轮训练可以得到目标神经网络,训练得到的目标神经网络可以应用于相应的终端设备上进行业务应用。在对初始神经网络进行训练的过程中,可以通过尺度分层得到第一神经网络,然后对第一神经网络中的尺度层的缩放尺度进行尺度调整。
需要说明的是,本申请提供的方案可以是以软件包/插件的形式存储在网络(可以是云端)上,用户使用时通过下载,安装到计算机设备即可执行本申请的流程;本申请提供的 方案也可以以云服务等形式提供给用户,用户可以将要训练的神经网络上传到云端,云端采用本申请提供的方案训练神经网络;也可以是将本申请的方案配置在芯片中,用于模型训练的计算机设备安装该芯片即可执行本申请的流程;也可以是将本申请方案配置到计算机设备中,计算机设备训练神经网络时即可执行本申请的流程。应理解,除了上述部署本申请提供的技术方案的方式以外,对于使用本申请提供的技术方案的具体形式,本申请不做限定。
下面结合图4介绍本申请实施例提供的神经网络的调整方法。
如图4所示,本申请实施例提供的神经网络的调整方法的一实施例包括:
201.计算机设备获取采用混合精度运算的第一神经网络,第一神经网络包括多个尺度层,其中,每个尺度层具有一个缩放尺度。
每个尺度层的缩放尺度指的是用于训练第一神经网络时,对反向传播方向上与每个尺度层相关的梯度进行放大或缩小的尺度。
混合精度运算指的是将两种或两种以上不同精度的运算混合应用,如应用在对神经网络的训练过程中。混合精度运算包括第一精度运算和第二精度运算,第二精度运算的表达范围大于第一精度运算的表达范围。
第一精度运算可以为半精度浮点型FP16运算,第二精度运算可以为单精度浮点型FP32运算,当然,第一精度运算和第二精度运算也可以是其他类型的精度运算,例如,第一精度运算可以为单精度浮点型FP32运算,第二精度运算可以为双精度浮点型FP64运算,只要第二精度运算的精度高于第一精度运算即可,也就是第二精度运算的表达范围大于第一精度运算的表达范围。应理解,第一精度还可以是比FP16更低的精度,例如FP8、INT8等。
采用第一精度运算可以包括采用第一精度的类型进行计算和/或存储,如FP16运算可以理解为采用FP16计算和/或存储,采用第二精度运算可以包括采用第二精度的类型进行计算和/或存储,如FP32运算可以理解为采用FP32计算和/或存储。
关于FP16和FP32的可以参阅图5进行理解,如图5所示,FP16是半精度浮点数,用1bit表示符号,如图5中FP16对应的比特位15,用5bit表示指数,如图5中FP16对应的比特位10到14,10bit表示小数,如图5中FP16对应的比特位0到9;FP32是单精度浮点数,用1bit表示符号,如图5中FP32对应的比特位31,用8bit表示指数,如图5中FP32对应的比特位23到30,23bit表示小数,如图5中FP32对应的比特位0到22。FP16的数据范围是(6×10 -8到65504),FP32的数据范围是(1.4×10 -45到1.7×10 38),FP32和FP16表示的数据范围不一样,在大数据计算中,FP16存在溢出风险。
梯度的下溢率指的是向下超出某一精度运算的表达范围的梯度数量占总的梯度数量的比例。
第一神经网络指的是进行过尺度分层的神经网络,该第一神经网络可以是通过自动尺度分层或者手动尺度分层得到的。
尺度层可以理解为是通过尺度分层后所得到的一个层,每个尺度层都具有一个缩放尺度,每个尺度层的缩放尺度通常不相同,当然,对此本申请中不做限定,不同尺度层的缩放尺度也可以相同。缩放尺度指的是对反向传播方向上与每个尺度层相关的梯度进行放大 或缩小的尺度。
每个尺度层相关的梯度包括输入到尺度层的输入梯度,用于更新权重的权重梯度,以及要输出尺度层的待输出梯度等。
202.计算机设备对输入到第一神经网络的训练样本进行前向传播的处理,以得到损失函数的值。根据第一神经网络执行的具体任务,可以选取适用的损失函数。例如,第一神经网络采用的损失函数可以包括但不限于以下种类:0-1损失、铰链损失(hinge loss)、softmax损失、逻辑斯谛损失(Logistic-loss)、交叉熵(cross entropy)、softmax交叉熵(softmax cross entropy)、三元组损失(triplet loss)、均方误差(mean squared error,MSE)、平均绝对误差(mean absolute error,MAE)、平滑L1损失、L1损失、L2损失、中心损失(center loss)等。
203.计算机设备在反向传播方向上,按照目标尺度层的缩放尺度对目标尺度层中第一个算子的第一梯度进行缩放操作,以得到第一个算子的第二梯度。
目标尺度层为多个尺度层中的任意一个尺度层,第一个算子的第一梯度来源于损失函数的值,缩放操作为放大操作或缩小操作,第一个算子的第二梯度用于确定目标尺度层中每个算子的权重梯度。
目标尺度层中可能有一个或多个算子,如果有多个算子,那么多个算子会有逻辑顺序,一个算子的输出可能作为下一个算子的输入,第一个算子指的是多个算子中逻辑顺序上排在第一位的算子。
第一梯度可以理解为是输入梯度,第二梯度是按照目标尺度层的缩放尺度进行缩放操作后的梯度。通过第二梯度可以确定出该目标尺度层中每个算子的权重梯度。
第一个算子的第一梯度来源于损失函数的值可以包括直接来源于损失函数的值,以及间接来源于损失函数的值。如果目标尺度层是反向传播方向上的第一个尺度层,则可以通过对损失函数的值进行求导得到该目标尺度层中第一个算子的第一梯度,这种情况可以理解为目标尺度层中第一个算子的第一梯度直接来源于损失函数的值;如果目标尺度层不是反向传播方向上的第一个尺度层,则目标尺度层的第一个算子的第一梯度是通过上一个尺度层输出的梯度得到的,逐层类推,上一个尺度层输出的梯度会与反向传播方向上第一个尺度层中第一个算子的第一梯度有关联,这种情况可以理解为目标尺度层中第一个算子的第一梯度间接来源于损失函数的值。
204.计算机设备根据目标尺度层中每个算子的权重梯度在第一精度运算的表达范围内的表现,调整目标尺度层的缩放尺度。
本申请实施例中,在神经网络训练的过程中,可以按照尺度层的缩放尺度对第一个算子的第一梯度进行缩放操作,进而计算出尺度层中每个算子的权重梯度,然后观察每个算子的权重梯度在第一精度运算的表达范围内的表现,来调整相应尺度层的缩放尺度,这样通过很小的计算量,就可以有效降低第一精度运算的梯度的下溢率,这样可以使混合精度训练很好的应用于神经网络的训练,在保持了较高的训练精度的情况下,提高了训练效率,而且采用低精度运算的数据只需要占用芯片上较小的存储空间,也有利于芯片对低精度运算的数据的快速读写。
本申请实施例提供的神经网络的调整方案可以参阅图6进行理解,如图6所示,该方案包括将初始神经网络模型进行混合精度的转换,以得到采用混合精度运算的网络,然后再对采用混合精度运算的网络进行尺度分层,以得到第一神经网络。在第一神经网络的基础上执行前向传播,以得到损失函数的值;根据损失函数的值再执行反向传播,在反向传播的过程中执行尺度层内的层内试错,以及缩放尺度的调整,另外,在前向传播和反向传播的过程中都可以对尺度层进行修正。
也就是说,上述神经网络的调整方法可以包括如下几个部分:一、对初始神经网络进行尺度分层得到第一神经网络;二、通过层内试错调整尺度层的缩放尺度;三、通过层间修正来优化调整过程。
一、对初始神经网络进行尺度分层得到第一神经网络。
本申请实施例中,计算机设备在接收待训练的初始神经网络后,可以将初始神经网络中第一类型算子标记为采用第一精度运算,默认初始神经网络中的第二类型算子采用第二精度运算,以得到采用混合精度运算的网络;然后对采用混合精度运算的网络进行尺度分层,以得到第一神经网络。
本申请实施例中,初始神经网络可以为采用单一精度运算构建的神经网络,如ResNet50、MobileNet等神经网络。第一类型算子可以为卷积(Convolution,Conv)算子和/或全连接(Fully Connect,FC)算子,第二类型算子可以是初始神经网络中除第一类型算子之外的全部算子或部分算子,如:归一化算子。
其中,尺度分层的方案可以包括自动尺度分层和手动尺度分层,下面分别进行介绍。
1.自动尺度分层。
该自动尺度分层的过程可以参阅图7进行理解,如图7所示,
301.以预设的下溢率为目标,输入不同的训练样本确定每个网络层的多组尺度值,每个网络层的多组尺度值的平均值为每个网络层的初始尺度。
当然,步骤301只是一种计算网络层的初始尺度的方式,也可以通过其他方式确定网络层的初始尺度。
需要说明的是,本申请实施例中的训练样本指的是一个batch的样本,不是一个训练样本,如图7中的训练样本1、训练样本2,…,训练样本n都分别是不同batch的样本。
每次输入训练样本,以预设的下溢率为目标都会得到一组尺度值,该组尺度值中包括每个网络层的一个尺度值,如:输入训练样本1,就会得到组1,如果采用混合精度运算的网络中包括m个网络层,则该组1中包括网络层1的尺度值11、网络层2的尺度值21,…,网络层m的尺度值m1。这样,输入训练样本2,就会得到组2,则该组2中包括网络层1的尺度值12、网络层2的尺度值22,…,网络层m的尺度值m2。以此类推,输入训练样本n,就会得到组n,则该组n中包括网络层1的尺度值1n、网络层2的尺度值2n,…,网络层m的尺度值mn。每个网络层的初始尺度只需要将该网络层的n个尺度值相加再求平均即可,如:网络层1的初始尺度=(尺度值11+尺度值12,…,+尺度值1n)/n;网络层2的初始尺度=(尺度值21+尺度值22,…,+尺度值2n)/n;…;网络层m的初始尺度=(尺度值m1+尺度值m2,…,+尺度值mn)/n。
302.将初始尺度相同的网络层进行合并,以得到层合并网络。
本申请实施例中,可以直接将层合并网络作为第一神经网络,也可以在层合并网络的基础上执行步骤303。
层合并的过程可以参阅图8进行理解,如图8所示,采用混合精度运算的网络包括m个网络层,分别为网络层1、网络层2、网络层3,…网络层m;其中,网络层1的初始尺度=a,网络层2的初始尺度=a,网络层3的初始尺度=b,…,网络层m的初始尺度=f。因为网络层1和网络层2的初始尺度都是a,可以将网络层1和网络层2合并,该处示例的只有网络层1和网络层2的初始尺度相同,其他网络层的初始尺度不相同,那么只合并网络层1和网络层2,如果还有其他网络层的初始尺度相同,则都可以参阅网络层1和网络层2的合并方案进行合并,合并后的网络可以称为层合并网络,该层合并网络可以直接作为第一神经网络,得到图8中右侧的包含(m-1)个尺度层的第一神经网络。(m-1)个尺度层中尺度层1的缩放尺度=a,尺度层2的缩放尺度=b,…,尺度层(m-1)的缩放尺度=f。图8中右侧的第一神经网络中,虚线箭头表示反向传播,实线箭头表示前向传播。
303.将层合并网络中第一网络层的输出接口的第一缩放操作和第二网络层的输入接口的第二缩放操作进行合并,以得到第一神经网络,第一网络层和第二网络层相邻,且在反向传播方向上第一网络层是第二网络层的前一层。
本申请实施例中,从层合并网络到第一神经网络的过程可以参阅图9进行理解,如图9所示,基于图8左侧的合并过程可以得到图9左侧的层合并网络,层合并网络中网络层1-2的初始尺度=a,网络层3的初始尺度=b,如果网络层1-2的输入接口处执行的是a倍的放大操作,那么第一缩放操作为a倍的缩小操作,如果网络层3的输入接口处执行b倍的放大操作,那么第二缩放操作为b倍的放大操作,可以将网络层1-2的a倍的缩小操作与网络层3的b倍的放大操作进行缩放合并,将这两个操作进行合并,相当于执行一次b/a倍的放大操作,这样就可以得到图9中右侧的第一神经网络。该第一神经网络中,尺度层1的缩放尺度=a,尺度层2的缩放尺度还是b但在尺度层1的输出接口或尺度层2的输入接口只需要执行一次b/a的缩放操作,以此类推,尺度层(m-1)的缩放尺度还是f,但在尺度层(m-2)的输出接口或尺度层(m-1)的输入接口只需要执行一次f/e的缩放操作,e为网络层m的前一层的初始尺度。图9中右侧的第一神经网络中,虚线箭头表示反向传播,实线箭头表示前向传播。
由该示例可知,当使用缩放操作合并网络作为第一神经网络时,将目标尺度层的一次缩放操作和一次逆缩放操作减少为一次缩放操作,减少了计算步骤,进一步提高了神经网络训练的效率。
需要说明的是,缩放操作合并后的一次缩放操作可以在前一个尺度层的输出接口处执行,也可以在后一个尺度层的输入接口处执行。
2.手动尺度分层。
自动尺度分层的网络层的初始尺度是根据预设的下溢率确定采用混合精度运算的网络中每个网络层的初始尺度的,而手动尺度分层是通过接收对采用混合精度运算的网络中每个网络层的初始尺度进行配置的配置信息,根据配置信息确定采用混合精度运算的网络中 每个网络层的初始尺度。其他的网络层的层合并,以及缩放操作合并都与自动尺度分层部分的描述相同,可以参阅自动尺度分层部分的内容进行理解,此处不再重复赘述。
本申请实施例中,将单精度运算的神经网络转换为混合精度运算的神经网络,并给不同网络层分配不同的缩放尺度,可以极大减小梯度的下溢率,能稳定训练各个网络层梯度分布动态范围大的神经网络。
二、通过层内试错调整尺度层的缩放尺度。
在反向传播方向上,使用以上步骤302或步骤303得到的第一神经网络进行尺度调整的过程略有不同,下面分别进行介绍。
1.采用步骤302的层合并网络作为第一神经网络。
如图10所示,该第一神经网络中,在反向传播方向上的尺度层分别为尺度层1、尺度层2,…,尺度层(m-1)。
其中,尺度层1包括3个算子,分别为算子1、算子2和算子3,三个算子之间的逻辑关系如图10中所示出的,算子1的输出为算子2的输入,算子2的输出为算子3的输入,其中,算子1为该尺度层1的第一个算子。当目标尺度层是尺度层1时,可以通过对损失函数的值进行求导得到算子1的第一梯度,然后按照尺度层1的缩放尺度a对第一梯度执行缩放操作,本申请实施例中,以该缩放操作是放大操作为例,对第一梯度进行a倍的放大操作,以得到算子1的第二梯度。然后使用第二梯度计算出算子1的权重梯度S1和算子1的输出梯度O1,算子1的输出梯度01输出给算子2,通过该输出梯度O1计算出算子2的权重梯度S2,还会计算出算子2的输出梯度O2,算子2的输出梯度02输出给算子3,通过算子2的输出梯度O2计算出算子3的权重梯度S3,以及算子3的输出梯度O3。
当目标尺度层中每个算子的权重梯度中包括无穷大的数值(infinite,INF)或无效数字(Not a number,NAN),则减小目标尺度层的缩放尺度,例如在缩放尺度a的基础上缩小
Figure PCTCN2022138377-appb-000009
调整后的缩放尺度就为
Figure PCTCN2022138377-appb-000010
当每个算子的权重梯度位于第一精度运算的表达范围内,则增大目标尺度层的缩放尺度,如在缩放尺度a的基础上放大2 (1/1000),调整后的缩放尺度就为b×2 (1/1000)
需要说明的是,此处列举的通过
Figure PCTCN2022138377-appb-000011
以及通过b×2 (1/1000)得到调整后的缩放尺度只是一种方式,还可以通过其他方式来减小缩放尺度或增大缩放尺度,如,在当前缩放尺度上减掉一个数值,或者,在当前缩放尺度上加上一个数值。当然,还可以通过其他方式来得到,只要能减小或增大缩放尺度都可以适用于本申请的缩放尺度调整。
本申请实施例中,当每个算子的权重梯度中包括无穷大的数值或无效数字时,无效数字可以是分母为0的分数等无效的数字,则表示该当前目标尺度层的缩放尺度过大,需要减小该目标尺度层的缩放尺度。当每个算子的权重梯度位于第一精度运算的表达范围内,则还可以通过增大目标尺度层的缩放尺度来尝试进一步扩大该目标尺度层的缩放尺度。这样有利于找到适合各尺度层的最优缩放尺度,从而提高神经网络的收敛速度。
当目标尺度层中每个算子的权重梯度位于第一精度运算的表达范围内时,按照目标尺度层的缩放尺度,对目标尺度层中每个算子的权重梯度执行缩放操作的逆缩放操作,以得到目标尺度层中逆缩放操作后的每个算子的权重梯度;根据目标尺度层中逆缩放操作后的 每个算子的权重梯度更新目标尺度层中每个算子的权重。
结合图10,尺度层1中,当算子1的权重梯度S1、算子2的权重梯度S2,以及算子3的权重梯度S3都位于第一精度运算的表达范围内时,则按照尺度层1的缩放尺度a对权重梯度S1、权重梯度S2以及权重梯度S3执行逆缩放操作,即执行缩小a倍的操作,分别得到对应的缩放操作后的权重梯度U1、权重梯度U2、以及权重梯度U3,进一步,可以根据权重梯度U1更新算子1的权重,根据权重梯度U2更新算子2的权重,权重梯度U3更新算子3的权重。
另外,本申请实施例中,还可以根据第一个算子的第二梯度确定目标尺度层的待输出梯度;按照目标尺度层的缩放尺度,对目标尺度层中待输出梯度执行缩放操作的逆缩放操作,以得到目标尺度层的输出梯度。结合图10,可以将该算子3的输出梯度O3作为待输出梯度,然后将输出梯度O3缩小a倍,得到要输出给尺度层2的输出梯度,传输给尺度层2。
本申请实施例的方案在尺度层2的执行过程与在尺度层1基本相同,不同的是,尺度层2的第一个算子的第一梯度是接收到的尺度层1输出的输出梯度,而不像算子1通过损失函数的值直接计算第一梯度。
本申请实施例中,在尺度层中对缩放尺度的调整过程还可以参阅图11进行理解。
如图11所示,该过程包括:
401.初始化第一神经网络中的异常状态位,记为0。
402.逐层遍历每个尺度层中的算子,确定每个算子的权重梯度。
如:遍历第i层内的算子,确定每个算子的权重梯度,初始i=1,i表示尺度分层所在层编号,如果从0开是对尺度层编号,则初始i=0。
确定每个算子的权重梯度的过程可以参阅前面图10部分的介绍进行理解。
403.是否遍历到算子的权重梯度有NAN或INF。
404.当遍历到算子的权重梯度有NAN或INF,则将异常状态位修改为1,并在该尺度层的缩放尺度的基础上缩小
Figure PCTCN2022138377-appb-000012
405.当遍历完该尺度层的所有算子的权重梯度,都没有出现NAN或INF,则在该尺度层的缩放尺度的基础上放大2 (1/1000)
406.执行i=i+1,遍历下一个尺度层,然后重复执行上述402到406的过程,直到第一神经网络的所有尺度层都遍历完毕。
407.遍历中所有层都未出现异常状态时,对算子的权重进行更新。
2.采用步骤303的缩放操作合并网络作为第一神经网络。
如图12所示,该第一神经网络中,在反向传播方向上的尺度层分别为尺度层1、尺度层2,…,尺度层(m-1),尺度层1的缩放尺度为a,尺度层2的缩放尺度为b,…,尺度层(m-1)的缩放尺度为f,但在尺度层1的输出接口或尺度层2的输入接口只需要执行一次b/a的缩放操作,以此类推,在尺度层(m-2)的输出接口或尺度层(m-1)的输入接口只需要执行一次f/e的缩放操作。本申请实施例中,以在尺度层2的输入接口执行一次b/a的缩放操作为例进行说明。
其中,针对本申请的方案,在尺度层1中的执行过程可以参阅前面图10部分的内容进行理解,不同的是,在针对待输出梯度不做缩小a被的处理,而是在尺度层2的输入接口处, 直接放大b/a倍,而不是像图10中需要在尺度层2对尺度层1输出的输出梯度还要再放大b倍。
在尺度层2中的执行过程如图12所示,尺度层2包括三个算子,分别为算子4、算子5和算子6,三个算子之间的逻辑关系如图12中所示出的,算子4的输出为算子5的输入,算子5的输出为算子6的输入,其中,算子4为该尺度层4的第一个算子。当目标尺度层是尺度层2时,尺度层1的输出梯度即为算子4的第一梯度,然后按照尺度层2的缩放尺度b/a对算子4的第一梯度执行缩放操作,本申请实施例中,以该缩放操作是放大操作为例,对第一梯度进行b/a倍的放大操作,以得到算子4的第二梯度。然后使用第二梯度计算出算子4的权重梯度S4和算子4的输出梯度O4,算子4的输出梯度04输出给算子5,通过该输出梯度O4计算出算子5的权重梯度S5,还会计算出算子5的输出梯度O5,算子5的输出梯度05输出给算子6,通过算子5的输出梯度O5计算出算子6的权重梯度S6,以及算子6的输出梯度O6。
当尺度层2中每个算子的权重梯度中包括INF或NAN则缩小目标尺度层的缩放尺度,如在缩放尺度b的基础上缩小
Figure PCTCN2022138377-appb-000013
调整后的缩放尺度就为
Figure PCTCN2022138377-appb-000014
当每个算子的权重梯度位于第一精度运算的表达范围内,则放大目标尺度层的缩放尺度,如在缩放尺度b的基础上放大2 (1/1000),调整后的缩放尺度就为b×2 (1/1000)
当每个算子的权重梯度位于第一精度运算的表达范围内时,按照目标尺度层的缩放尺度,对每个算子的权重梯度执行缩放操作的逆缩放操作,以得到逆缩放操作后的每个算子的权重梯度;根据逆缩放操作后的每个算子的权重梯度更新每个算子的权重。
结合图12,当算子4的权重梯度S4、算子5的权重梯度S5,以及算子6的权重梯度S6都位于第一精度运算的表达范围内时,则按照尺度层2的缩放尺度b为对权重梯度S4、权重梯度S5以及权重梯度S6执行逆缩放操作,即执行缩小b倍的操作,分别得到对应的缩放操作后的权重梯度U4、权重梯度U5、以及权重梯度U6,进一步,可以根据权重梯度U4更新算子4的权重,根据权重梯度U5更新算子5的权重,权重梯度U6更新算子6的权重。
图12中,针对尺度层2的待输出梯度不做处理,直接输出给下一个尺度层,在下一个尺度层的处理过程可以参阅在尺度层2的处理过程进行理解,只是下一个尺度层的缩放尺度可能不相同。
三、通过层间修正来优化调整过程。
本申请实施例中,在反向传播方向上可以进行层间修正,在前向传播方向上也可以进行层间修正,下面分别进行介绍。
1.反向传播的层间修正。
当目标尺度层的输出梯度为无穷大的数值或无效数字时,将目标尺度层的输出梯度修正为有效值,并将修正后的目标尺度层的输出梯度传输给目标尺度层的相邻尺度层。
如图13所示,当目标尺度层为尺度层1时,尺度层1的输出梯度的矩阵为
Figure PCTCN2022138377-appb-000015
其中,出现了两个inf,则需要对该输出梯度进行修正,可以将矩阵中的inf修正为0,然后 输出修正后的输出梯度,如:
Figure PCTCN2022138377-appb-000016
本申请实施例中,如果目标尺度层的输出梯度为无穷大的数值或无效数字时,则表示该输出梯度不适用于每个尺度层中算子的权重更新,则直接跳过权重更新的步骤,但为了不影响后续的计算过程,可以将该输出梯度修正为在第一精度范围内的有效值,在传输给下一层进行计算,有利于提高神经网络的训练效率。
2.前向传播的层间修正。
在前向传播的过程中,若所述目标尺度层的特征值包括无穷大的数值或无效数字,则跳过对所述目标尺度层的更新。
本申请实施例中,目标尺度层的特征值是在前向传播过程中各尺度层产生的特征的值,在前向传播的过程中,也可以根据特征值在第一精度运算的表达范围内的表现来确定是否进行该尺度层中算子权重的更新,如果前向特征包括无穷大数值或无效数字则不需要更新算子权重,这样,有利于提高神经网络的训练效率。
另外,在上述任一实施例的基础上,本申请实施例中,还可以执行如下过程:当对第一神经网络的训练达到预设条件时,重新对第一神经网络进行尺度分层,以得到第二神经网络。
本申请实施例中,预设条件可以是训练次数达到一定阈值,如已经训练了300个周期,也可以是神经网络已经训练到一定的程度,如:各尺度层的缩放尺度的差距小于预设值,可以重新对第一神经网络进行尺度分层,尺度分层的方式可以参阅前面的描述进行理解,然后得到一个新的第二神经网络,再对第二神经网络进行训练。这种动态更新尺度层的方式可以提高神经网络的训练效率。
以上介绍了本申请实施例提供的神经网络的调整方法,下面结合附图介绍本申请实施例提供的神经网络的调整装置。
如图14所示,本申请实施例提供的神经网络的调整装置50的一实施例包括:
如图14所示,神经网络的调整装置50包括:
获取单元501,用于获取采用混合精度运算的第一神经网络,第一神经网络包括多个尺度层,其中,每个尺度层具有一个缩放尺度,每个尺度层的缩放尺度指的是用于训练第一神经网络时,对反向传播方向上与每个尺度层相关的梯度进行放大或缩小的尺度,混合精度运算包括第一精度运算。该获取单元501可以执行上述图4对应的方法实施例中的步骤201。
第一处理单元502,用于对输入到获取单元501获取的第一神经网络的训练样本进行前向传播的处理,以得到损失函数的值;该第一处理单元502可以执行上述图4对应的方法实施例中的步骤202。
第二处理单元503,用于在反向传播方向上,按照目标尺度层的缩放尺度对目标尺度层中第一个算子的第一梯度进行缩放操作,以得到第一个算子的第二梯度,目标尺度层为多个尺度层中的任意一个尺度层,第一个算子的第一梯度来源于第一处理单元502得到的损失函数的值,缩放操作为放大操作或缩小操作,第一个算子的第二梯度用于确定目标尺度层 中每个算子的权重梯度;该第二处理单元503可以执行上述图4对应的方法实施例中的步骤203。
第三处理单元504,用于根据第二处理单元503得到目标尺度层中的每个算子的权重梯度在第一精度运算的表达范围内的表现,调整目标尺度层的缩放尺度。该第三处理单元504可以执行上述图4对应的方法实施例中的步骤204。
本申请实施例中,在神经网络训练的过程中,可以按照尺度层的缩放尺度对第一个算子的第一梯度进行缩放操作,进而计算出尺度层中每个算子的权重梯度,然后观察每个算子的权重梯度在第一精度运算的表达范围内的表现,来调整相应尺度层的缩放尺度,这样通过很小的计算量,就可以有效降低第一精度运算的梯度的下溢率,这样可以使混合精度训练很好的应用于神经网络的训练,在保持了较高的训练精度的情况下,提高了训练效率,而且采用低精度运算的数据只需要占用芯片上较小的存储空间,也有利于芯片对低精度运算的数据的快速读写。
可选地,混合精度运算还包括第二精度运算,第二精度运算的表达范围大于第一精度运算的表达范围,获取单元501用于:接收待训练的初始神经网络;将初始神经网络中第一类型算子标记为采用第一精度运算,以得到采用混合精度运算的网络,采用混合精度运算的网络中第二类型算子采用第二精度运算;对采用混合精度运算的网络进行尺度分层,以得到第一神经网络。
可选地,获取单元501用于:获取采用混合精度运算的网络中每个网络层的初始尺度;将初始尺度相同的网络层进行合并,以得到第一神经网络。
可选地,获取单元501用于:获取采用混合精度运算的网络中每个网络层的初始尺度;将初始尺度相同的网络层进行合并,以得到层合并网络;将层合并网络中第一网络层的输出接口的第一缩放操作和第二网络层的输入接口的第二缩放操作进行合并,以得到第一神经网络,第一网络层和第二网络层相邻,且在反向传播方向上第一网络层是第二网络层的前一层。
可选地,获取单元501,用于根据预设的下溢率确定采用混合精度运算的网络中每个网络层的初始尺度,或者,接收对采用混合精度运算的网络中每个网络层的初始尺度进行配置的配置信息,根据配置信息确定所述采用混合精度运算的网络中每个网络层的初始尺度。
可选地,第三处理单元504用于:当目标尺度层中每个算子的权重梯度中包括无穷大的数值或无效数字,则减小目标尺度层的缩放尺度;当目标尺度层中每个算子的权重梯度位于第一精度运算的表达范围内,则增大目标尺度层的缩放尺度。
可选地,第三处理单元504,还用于当每个算子的权重梯度位于第一精度运算的表达范围内时,按照目标尺度层的缩放尺度,对目标尺度层中每个算子的权重梯度执行缩放操作的逆缩放操作,以得到目标尺度层中逆缩放操作后的每个算子的权重梯度;根据目标尺度层中逆缩放操作后的每个算子的权重梯度更新目标尺度层中每个算子的权重。
可选地,第三处理单元504,还用于根据第一个算子的第二梯度确定目标尺度层的待输出梯度;按照目标尺度层的缩放尺度,对目标尺度层中待输出梯度执行缩放操作的逆缩放操作,以得到目标尺度层的输出梯度。
可选地,第三处理单元504,还用于当目标尺度层的输出梯度为无穷大的数值或无效数字时,将目标尺度层的输出梯度修正为在第一精度运算的表达范围内的有效值;将修正后的目标尺度层的输出梯度传输给目标尺度层的相邻尺度层。
可选地,第三处理单元504,还用于在前向传播的过程中,若目标尺度层的特征值包括无穷大的数值或无效数字,则跳过对目标尺度层的更新。
可选地,第三处理单元504,还用于当对第一神经网络的训练达到预设条件时,重新对第一神经网络进行尺度分层,以得到第二神经网络。
需要说明的是,获取单元501、第一处理单元502、第二处理单元503,以及第三处理单元504可以通过一个单元或模块,或者通过多个单元或模块来实现。对此,本申请实施例中不做限定,只要能执行上述方法流程即可。
以上,本申请实施例提供的神经网络的调整装置可以参阅前面的神经网络的调整方法部分的相应内容进行理解,此处不再重复赘述。
图15所示,为本申请的实施例提供的计算机设备60的一种可能的逻辑结构示意图。该计算机设备60可以是神经网络的调整装置。该计算机设备60包括:处理器601、通信接口602、存储器603以及总线604。处理器601、通信接口602以及存储器603通过总线604相互连接。在本申请的实施例中,处理器601用于对计算机设备60的动作进行控制管理,例如,处理器601用于执行图2至图13的方法实施例中神经网络的调整过程,通信接口602用于支持计算机设备60进行通信。存储器603,用于存储计算机设备60的程序代码和数据。
其中,处理器601可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器601也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。总线604可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图15中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
在本申请的另一实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的处理器执行该计算机执行指令时,设备执行上述图3至图8中模型训练的方法,或者执行上述图2-13中神经网络的调整方法。
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;当设备的处理器执行该计算机执行指令时,设备执行上述图2-13中神经网络的调整方法。
在本申请的另一实施例中,还提供一种芯片系统,该芯片系统包括处理器,该处理器用于实现上述图2-13中神经网络的调整方法。在一种可能的设计中,芯片系统还可以包括存储器,存储器,用于保存进程间通信的装置必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及 算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请实施例各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (27)

  1. 一种神经网络的调整方法,其特征在于,包括:
    获取采用混合精度运算的第一神经网络,所述第一神经网络包括多个尺度层,其中,每个尺度层具有一个缩放尺度,所述每个尺度层的缩放尺度指的是用于训练所述第一神经网络时,对反向传播方向上与所述每个尺度层相关的梯度进行放大或缩小的尺度,所述混合精度运算包括第一精度运算;
    对输入到所述第一神经网络的训练样本进行前向传播的处理,以得到损失函数的值;
    在所述反向传播方向上,按照目标尺度层的缩放尺度对所述目标尺度层中第一个算子的第一梯度进行缩放操作,以得到所述第一个算子的第二梯度,所述目标尺度层为所述多个尺度层中的任意一个尺度层,所述第一个算子的第一梯度来源于所述损失函数的值,所述缩放操作为放大操作或缩小操作,所述第一个算子的第二梯度用于确定所述目标尺度层中每个算子的权重梯度;
    根据所述目标尺度层中每个算子的权重梯度在所述第一精度运算的表达范围内的表现,调整所述目标尺度层的缩放尺度。
  2. 根据权利要求1所述的调整方法,其特征在于,所述混合精度运算还包括第二精度运算,其中所述第二精度运算的表达范围大于所述第一精度运算的表达范围;
    则所述获取采用混合精度运算的第一神经网络,包括:
    接收待训练的初始神经网络;
    将所述初始神经网络中第一类型算子标记为采用所述第一精度运算,以得到采用混合精度运算的网络,所述采用混合精度运算的网络中第二类型算子采用所述第二精度运算;
    对所述采用混合精度运算的网络进行尺度分层,以得到所述第一神经网络。
  3. 根据权利要求2所述的调整方法,其特征在于,所述对所述采用混合精度运算的网络进行尺度分层,以得到所述第一神经网络,包括:
    获取采用混合精度运算的网络中每个网络层的初始尺度;
    将初始尺度相同的网络层进行合并,以得到所述第一神经网络。
  4. 根据权利要求2所述的调整方法,其特征在于,所述对所述采用混合精度运算的网络进行尺度分层,以得到所述第一神经网络,包括:
    获取采用混合精度运算的网络中每个网络层的初始尺度;
    将初始尺度相同的网络层进行合并,以得到层合并网络;
    将所述层合并网络中第一网络层的输出接口的第一缩放操作和第二网络层的输入接口的第二缩放操作进行合并,以得到所述第一神经网络,所述第一网络层和所述第二网络层相邻,且在反向传播方向上所述第一网络层是所述第二网络层的前一层。
  5. 根据权利要求3或4所述的调整方法,其特征在于,所述获取采用混合精度运算的网络中每个网络层的初始尺度,包括:
    根据预设的下溢率确定所述采用混合精度运算的网络中每个网络层的初始尺度;或者,
    接收对所述采用混合精度运算的网络中每个网络层的初始尺度进行配置的配置信息,根据所述配置信息确定所述采用混合精度运算的网络中每个网络层的初始尺度。
  6. 根据权利要求2-5任一项所述的调整方法,其特征在于,所述第一类型算子包括卷积算子和/或全连接算子。
  7. 根据权利要求1-6任一项所述的调整方法,其特征在于,所述根据所述目标尺度层中每个算子的权重梯度在所述第一精度运算的表达范围内的表现,调整所述目标尺度层的缩放尺度,包括:
    当所述目标尺度层中每个算子的权重梯度中包括无穷大的数值或无效数字,则减小所述目标尺度层的缩放尺度;
    当所述目标尺度层中每个算子的权重梯度位于所述第一精度运算的表达范围内,则增大所述目标尺度层的缩放尺度。
  8. 根据权利要求1-7任一项所述的调整方法,其特征在于,所述方法还包括:
    按照所述目标尺度层的缩放尺度,对所述目标尺度层中每个算子的权重梯度执行所述缩放操作的逆缩放操作,以得到所述目标尺度层中逆缩放操作后的每个算子的权重梯度;
    根据所述目标尺度层中逆缩放操作后的每个算子的权重梯度更新所述目标尺度层中每个算子的权重。
  9. 根据权利要求3所述的调整方法,其特征在于,所述方法还包括:
    根据所述第一个算子的第二梯度确定所述目标尺度层的待输出梯度;
    按照所述目标尺度层的缩放尺度,对所述目标尺度层的待输出梯度执行所述缩放操作的逆缩放操作,以得到所述目标尺度层的输出梯度。
  10. 根据权利要求9所述的调整方法,其特征在于,所述方法还包括:
    当所述目标尺度层的输出梯度为无穷大的数值或无效数字时,将所述目标尺度层的输出梯度修正为在所述第一精度运算的表达范围内的有效值;
    将修正后的所述目标尺度层的输出梯度传输给所述目标尺度层的相邻尺度层。
  11. 根据权利要求1-10任一项所述的调整方法,其特征在于,所述方法还包括:
    在所述前向传播的过程中,若所述目标尺度层的特征值包括无穷大的数值或无效数字,则跳过对所述目标尺度层的更新。
  12. 根据权利要求1-11任一项所述的调整方法,其特征在于,所述方法还包括:
    当对所述第一神经网络的训练达到预设条件时,重新对所述第一神经网络进行尺度分层,以得到第二神经网络。
  13. 一种神经网络的调整装置,其特征在于,包括:
    获取单元,用于获取采用混合精度运算的第一神经网络,所述第一神经网络包括多个尺度层,其中,每个尺度层具有一个缩放尺度,所述每个尺度层的缩放尺度指的是用于训练所述第一神经网络时,对反向传播方向上与所述每个尺度层相关的梯度进行放大或缩小的尺度,所述混合精度运算包括第一精度运算;
    第一处理单元,用于对输入到所述获取单元获取的第一神经网络的训练样本进行前向传播的处理,以得到损失函数的值;
    第二处理单元,用于在所述反向传播方向上,按照目标尺度层的缩放尺度对所述目标尺度层中第一个算子的第一梯度进行缩放操作,以得到所述第一个算子的第二梯度,所述 目标尺度层为所述多个尺度层中的任意一个尺度层,所述第一个算子的第一梯度来源于所述损失函数的值,所述缩放操作为放大操作或缩小操作,所述第一个算子的第二梯度用于确定所述目标尺度层中每个算子的权重梯度;
    第三处理单元,用于根据所述第二处理单元得到目标尺度层中的每个算子的权重梯度在所述第一精度运算的表达范围内的表现,调整所述目标尺度层的缩放尺度。
  14. 根据权利要求13所述的调整装置,其特征在于,所述混合精度运算还包括第二精度运算,其中所述第二精度运算的表达范围大于所述第一精度运算的表达范围;
    则所述获取单元用于:
    接收待训练的初始神经网络;
    将所述初始神经网络中第一类型算子标记为采用所述第一精度运算,以得到采用混合精度运算的网络,所述采用混合精度运算的网络中第二类型算子采用所述第二精度运算;
    对所述采用混合精度运算的网络进行尺度分层,以得到所述第一神经网络。
  15. 根据权利要求14所述的调整装置,其特征在于,
    所述获取单元用于:
    获取采用混合精度运算的网络中每个网络层的初始尺度;
    将初始尺度相同的网络层进行合并,以得到所述第一神经网络。
  16. 根据权利要求14所述的调整装置,其特征在于,
    所述获取单元用于:
    获取采用混合精度运算的网络中每个网络层的初始尺度;
    将初始尺度相同的网络层进行合并,以得到层合并网络;
    将所述层合并网络中第一网络层的输出接口的第一缩放操作和第二网络层的输入接口的第二缩放操作进行合并,以得到所述第一神经网络,所述第一网络层和所述第二网络层相邻,且在反向传播方向上所述第一网络层是所述第二网络层的前一层。
  17. 根据权利要求15或16所述的调整装置,其特征在于,
    所述获取单元,用于根据预设的下溢率确定所述采用混合精度运算的网络中每个网络层的初始尺度;或者,
    接收对所述采用混合精度运算的网络中每个网络层的初始尺度进行配置的配置信息,根据所述配置信息确定所述采用混合精度运算的网络中每个网络层的初始尺度。
  18. 根据权利要求13-17任一项所述的调整装置,其特征在于,
    所述第三处理单元用于:
    当所述目标尺度层中每个算子的权重梯度中包括无穷大的数值或无效数字,则减小所述目标尺度层的缩放尺度;
    当所述目标尺度层中每个算子的权重梯度位于所述第一精度运算的表达范围内,则增大所述目标尺度层的缩放尺度。
  19. 根据权利要求13-18任一项所述的调整装置,其特征在于,
    所述第三处理单元,还用于按照所述目标尺度层的缩放尺度,对所述目标尺度层中每个算子的权重梯度执行所述缩放操作的逆缩放操作,以得到所述目标尺度层中逆缩放操作 后的每个算子的权重梯度;根据所述目标尺度层中逆缩放操作后的每个算子的权重梯度更新所述目标尺度层中每个算子的权重。
  20. 根据权利要求15所述的调整装置,其特征在于,
    所述第三处理单元,还用于根据所述第一个算子的第二梯度确定所述目标尺度层的待输出梯度;按照所述目标尺度层的缩放尺度,对所述目标尺度层中待输出梯度执行所述缩放操作的逆缩放操作,以得到所述目标尺度层的输出梯度。
  21. 根据权利要求20所述的调整装置,其特征在于,
    所述第三处理单元,还用于当所述目标尺度层的输出梯度为无穷大的数值或无效数字时,将所述目标尺度层的输出梯度修正为在所述第一精度运算的表达范围内的有效值;将修正后的所述目标尺度层的输出梯度传输给所述目标尺度层的相邻尺度层。
  22. 根据权利要求13-21任一项所述的调整装置,其特征在于,
    所述第三处理单元,还用于在所述前向传播的过程中,若所述目标尺度层的特征值包括无穷大的数值或无效数字,则跳过对所述目标尺度层的更新。
  23. 根据权利要求13-22任一项所述的调整装置,其特征在于,
    所述第三处理单元,还用于当对所述第一神经网络的训练达到预设条件时,重新对所述第一神经网络进行尺度分层,以得到第二神经网络。
  24. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被一个或多个处理器执行时实现如权利要求1-12任一项所述的方法。
  25. 一种计算设备,其特征在于,包括一个或多个处理器和存储有计算机程序的存储介质;
    所述计算机程序被所述一个或多个处理器执行时实现如权利要求1-12任一项所述的方法。
  26. 一种芯片系统,其特征在于,包括一个或多个处理器,所述一个或多个处理器被调用用于执行如权利要求1-12任一项所述的方法。
  27. 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序当被一个或多个处理器执行时用于实现如权利要求1-12任一项所述的方法。
PCT/CN2022/138377 2021-12-15 2022-12-12 一种神经网络的调整方法及相应装置 WO2023109748A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111535584.9 2021-12-15
CN202111535584.9A CN116266274A (zh) 2021-12-15 2021-12-15 一种神经网络的调整方法及相应装置

Publications (1)

Publication Number Publication Date
WO2023109748A1 true WO2023109748A1 (zh) 2023-06-22

Family

ID=86742944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138377 WO2023109748A1 (zh) 2021-12-15 2022-12-12 一种神经网络的调整方法及相应装置

Country Status (2)

Country Link
CN (1) CN116266274A (zh)
WO (1) WO2023109748A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703729B (zh) * 2023-08-09 2023-12-19 荣耀终端有限公司 一种图像处理方法、终端、存储介质及程序产品

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180322391A1 (en) * 2017-05-05 2018-11-08 Nvidia Corporation Loss-scaling for deep neural network training with reduced precision
CN113435520A (zh) * 2021-06-30 2021-09-24 深圳市商汤科技有限公司 神经网络的训练方法、装置、设备及计算机可读存储介质
CN113762502A (zh) * 2021-04-22 2021-12-07 腾讯科技(深圳)有限公司 神经网络模型的训练方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180322391A1 (en) * 2017-05-05 2018-11-08 Nvidia Corporation Loss-scaling for deep neural network training with reduced precision
CN113762502A (zh) * 2021-04-22 2021-12-07 腾讯科技(深圳)有限公司 神经网络模型的训练方法及装置
CN113435520A (zh) * 2021-06-30 2021-09-24 深圳市商汤科技有限公司 神经网络的训练方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN116266274A (zh) 2023-06-20

Similar Documents

Publication Publication Date Title
CN109102065B (zh) 一种基于PSoC的卷积神经网络加速器
CN107563497B (zh) 用于稀疏人工神经网络的计算装置和运算方法
WO2019228358A1 (zh) 深度神经网络的训练方法和装置
JP7304148B2 (ja) ニューラルネットワークにおいてコンボリューション演算を処理する方法及びその装置
WO2019127838A1 (zh) 卷积神经网络实现方法及装置、终端、存储介质
CN113326930B (zh) 数据处理方法、神经网络的训练方法及相关装置、设备
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
KR20200060302A (ko) 처리방법 및 장치
WO2023231794A1 (zh) 一种神经网络参数量化方法和装置
WO2022067508A1 (zh) 一种神经网络加速器、加速方法以及装置
WO2023010244A1 (zh) 神经网络加速器及神经网络加速器的数据处理方法
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2022088063A1 (zh) 神经网络模型的量化方法和装置、数据处理的方法和装置
KR102655950B1 (ko) 뉴럴 네트워크의 고속 처리 방법 및 그 방법을 이용한 장치
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
WO2019128248A1 (zh) 一种信号处理方法及装置
CN110874627A (zh) 数据处理方法、数据处理装置及计算机可读介质
WO2022179588A1 (zh) 一种数据编码方法以及相关设备
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
WO2020062299A1 (zh) 一种神经网络处理器、数据处理方法及相关设备
CN114298289A (zh) 一种数据处理的方法、数据处理设备及存储介质
WO2020042770A1 (zh) 图像识别处理方法和装置
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
WO2019076095A1 (zh) 处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906493

Country of ref document: EP

Kind code of ref document: A1