CN117910537A - Neural network training method and device - Google Patents

Neural network training method and device Download PDF

Info

Publication number
CN117910537A
CN117910537A CN202211281740.8A CN202211281740A CN117910537A CN 117910537 A CN117910537 A CN 117910537A CN 202211281740 A CN202211281740 A CN 202211281740A CN 117910537 A CN117910537 A CN 117910537A
Authority
CN
China
Prior art keywords
matrix
data
layer
data format
conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211281740.8A
Other languages
Chinese (zh)
Inventor
张忠星
陈敏琪
罗元勇
伍玮翔
黄泽毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202211281740.8A priority Critical patent/CN117910537A/en
Priority to PCT/CN2023/104242 priority patent/WO2024082705A1/en
Publication of CN117910537A publication Critical patent/CN117910537A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the application discloses a neural network training method and device, relates to the field of neural networks, and aims to reduce training power consumption, reduce training time and improve performance of the neural network. The specific scheme is as follows: and converting the data matrix and the weight matrix of the first data format into a data conversion matrix and a weight conversion matrix of a second data format, wherein the total bit width of the second data format is smaller than that of the first data format. And forward calculating at least one matrix multiplying calculation layer by layer according to the data conversion matrix and the weight conversion matrix to obtain error data. And back-propagating at least one matrix multiplication calculation layer by layer according to the error data to determine an error gradient matrix corresponding to each matrix multiplication calculation layer. And updating the parameters corresponding to the current calculation layer according to the error gradient matrix corresponding to the previous matrix multiplication calculation layer in the at least one matrix multiplication calculation layer. Wherein, at least one matrix multiplication calculation layer is corresponding to the data in the second data format during forward calculation and backward propagation.

Description

Neural network training method and device
Technical Field
The embodiment of the application relates to the field of neural networks, in particular to a neural network training method and device.
Background
With the continuous development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), the size of Neural Networks (NNS) is also increasing, and the demand for computational power in training the neural networks is also increasing. Specifically, from the initial process of training a neural network, which requires 100Petaflops (1 petaflop is equal to 1 million mathematical operations per second), to the current transducer model, the process of training a neural network requires 10 million Petaflops, and the demand for computational power is increased by 1000 ten thousand times.
The proliferation of calculation force during training of the neural network leads to the increase of power consumption of the neural network chip, the longer and longer time of training of the neural network, and how to reduce the power consumption during training of the neural network and the training time becomes a problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a neural network training method and device, which can reduce power consumption during training of a neural network and reduce training time.
In order to achieve the above purpose, the embodiment of the application adopts the following technical scheme:
In a first aspect of the embodiment of the present application, a neural network training method is provided, where the neural network includes at least one matrix multiplication calculation layer, and the method includes: and converting the data matrix and the weight matrix of the first data format into a data conversion matrix and a weight conversion matrix of a second data format, wherein the total bit width of the second data format is smaller than that of the first data format. And forward calculating at least one matrix multiplying calculation layer by layer according to the data conversion matrix and the weight conversion matrix to obtain error data. And back-propagating at least one matrix multiplication layer by layer according to the error data to determine an error gradient matrix corresponding to each matrix multiplication layer in the at least one matrix multiplication layer. And updating the parameters corresponding to the current calculation layer according to the error gradient matrix corresponding to the previous matrix multiplication calculation layer in the at least one matrix multiplication calculation layer. Wherein, at least one matrix multiplication calculation layer is corresponding to the data in the second data format during forward calculation and backward propagation.
According to the neural network training method provided by the embodiment of the application, the data matrix and the weight matrix in the first data format are converted into the data conversion matrix and the weight conversion matrix in the second data format, and at least one matrix multiplication calculation layer is calculated forward layer by layer according to the data conversion matrix and the weight conversion matrix. Specifically, compared with the data matrix with the first data format for the training matrix multiplication calculation layer, the data training matrix multiplication calculation layer with the second data format has the advantages that the total bit width of the second data format is smaller than that of the first data format, so that the power consumption in training the neural network can be reduced, and the training time is shortened.
With reference to the first aspect, in a possible implementation manner, the neural network further includes a non-matrix multiplication computation layer located after any one of the at least one matrix multiplication computation layer, and the at least one matrix multiplication computation layer is calculated forward layer by layer according to the data conversion matrix and the weight conversion matrix, including: and forward calculating at least one matrix multiplication calculation layer and at least one non-matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix. Counter-propagating at least one matrix multiplication computation layer by layer from the error data, comprising: and according to the error data, reversely propagating at least one matrix multiplication calculation layer and at least one non-matrix multiplication calculation layer by layer to determine an error gradient matrix corresponding to each non-matrix multiplication calculation layer in the at least one non-matrix multiplication calculation layer, and updating parameters corresponding to the current calculation layer by using the error gradient matrix corresponding to the previous non-matrix multiplication calculation layer in the at least one non-matrix multiplication calculation layer. The data corresponding to the non-matrix multiplication calculation layer in forward calculation and backward propagation are data in a first data format.
According to the neural network training method provided by the embodiment of the application, the data training non-matrix multiplication computation layer of the first data format is adopted, and compared with the data training non-matrix multiplication computation layer of the second data format, the total bit width of the second data format is smaller than that of the first data format, so that the training precision of the neural network can be improved, and the performance of the neural network can be improved.
With reference to the first aspect, in one possible implementation manner, the neural network includes a first matrix multiplication computation layer and a first non-matrix multiplication computation layer, and the method includes: and according to the data conversion matrix and the weight conversion matrix, forward calculating the first matrix multiplied by the calculation layer to obtain a first output conversion matrix of a third data format, wherein the total bit width of the third data format is larger than or equal to that of the first data format. And forward calculating a first non-matrix multiplication calculation layer according to the first output conversion matrix to obtain a second output matrix in the first data format, wherein the second output matrix is used as an input matrix of a next calculation layer of the first non-matrix multiplication calculation layer when forward calculating. And acquiring a first output error gradient matrix of a third data format, wherein the first output error gradient matrix is determined according to the layer-by-layer counter propagation of error data, the first output error gradient matrix is used for updating parameters corresponding to a first non-matrix multiplication calculation layer, and when the first output error gradient matrix is counter-propagated, the matrix output by the last calculation layer of the first non-matrix multiplication calculation layer. And reversely propagating the first non-matrix multiplication calculation layer according to the first output error gradient matrix to obtain a second output error gradient matrix in the first data format, wherein the second output error gradient matrix is used for updating the weight matrix. And converting the second output error gradient matrix into a second output error gradient conversion matrix of a second data format, and reversely propagating the first matrix multiplied by the calculation layer according to the second output error gradient conversion matrix and a transposed matrix of the weight conversion matrix to obtain a third output error gradient conversion matrix of a third data format, wherein the third output error gradient conversion matrix is used for updating parameters corresponding to a last calculation layer of the first matrix multiplied by the calculation layer when the neural network forward calculation is performed.
According to the application network training method provided by the embodiment of the application, the matrix multiplication calculation layer training is performed by adopting the data in the second data format, so that the power consumption during training the neural network can be reduced, and the training time is shortened. Meanwhile, the training of the non-matrix multiplication computation layer is carried out by adopting the data in the first data format or the third data format, and compared with the training of the non-matrix multiplication computation layer by adopting the data in the second data format, the training precision of the neural network can be improved, and the performance of the neural network can be improved.
With reference to the first aspect, in one possible implementation manner, the data matrix, the weight matrix, or the second output error gradient matrix is target data, the data conversion matrix, the weight conversion matrix, or the second output error gradient conversion matrix is target conversion data, and converting the target data in the first data format into the target conversion data in the second data format includes: converting target data into target conversion data in a mode of rounding away from 0 carry, wherein the rounding away from 0 carry refers to carry when the most significant bit of the reject part is 1; and/or converting the target data into target conversion data by adopting a random rounding mode, wherein the random rounding refers to normalizing the reject part to determine a first numerical value, carrying when the first numerical value is greater than or equal to a second numerical value, and the second numerical value is a random number which is greater than 0 and less than 1.
According to the neural network training method provided by the embodiment of the application, the target data in the first data format is converted into the target conversion data in the second data format by adopting a rounding mode far from 0 carry rounding and/or random rounding, so that conversion errors can be reduced, the influence of the conversion errors on the neural network training precision can be reduced, and the training precision of the neural network can be further improved.
With reference to the first aspect, in one possible implementation manner, the first data format or the third data format includes sign bits, exponent bits, and mantissa bits. The second data format includes a symbol field, a bit field, a step field, and a mantissa field, the bit field being used to indicate a bit width occupied by the step field in a total bit width of the second data format. The embodiment of the application is not limited to the bit number specifically included in the total bit width of the second data format, and is not limited to the bit number specifically included in the bit position domain.
According to the neural network training method provided by the embodiment of the application, the bit width occupied by the bit position domain in the total bit width of the second data format is indicated by defining the bit position domain in the process of encoding the second data format, so that the numerical range and the numerical precision of the converted floating point number can be improved as much as possible on the premise of not additionally increasing the total bit width, and the training precision of the neural network can be improved.
With reference to the first aspect, in one possible implementation manner, after obtaining the second output error gradient conversion matrix of the second data format, the method further includes: and inputting the second output error gradient conversion matrix and the data conversion matrix into the first matrix multiplication calculation layer to obtain weight gradient conversion data in a third data format, and updating the weight matrix according to the weight gradient conversion data.
According to the neural network training method provided by the embodiment of the application, the matrix multiplication calculation layer calculation is performed by adopting the second output error gradient conversion matrix and the data conversion matrix of the second data format, so that the power consumption during training of the neural network can be reduced, and the training time is shortened. Meanwhile, the weight matrix is updated by adopting the weight gradient conversion data in the third data format, so that the training precision of the neural network can be ensured, and the performance of the neural network can be improved.
With reference to the first aspect, in one possible implementation manner, the matrix multiplication includes at least one of convolution, deconvolution, matrix multiplication, and bulk matrix multiplication, and the non-matrix multiplication includes at least one of an activation function or a normalization function.
In a second aspect of the embodiment of the present application, there is provided a neural network training device, including: the conversion module is used for converting the data matrix and the weight matrix of the first data format into the data conversion matrix and the weight conversion matrix of the second data format, and the total bit width of the second data format is smaller than that of the first data format. The training module is used for forward calculating at least one matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix to obtain error data. The training module is further used for back-propagating the at least one matrix multiplication calculation layer by layer according to the error data to determine an error gradient matrix corresponding to each matrix multiplication calculation layer in the at least one matrix multiplication calculation layer. The training module is further configured to update parameters corresponding to the current calculation layer according to an error gradient matrix corresponding to a previous matrix multiplication calculation layer in the at least one matrix multiplication calculation layer. Wherein, at least one matrix multiplication calculation layer is corresponding to the data in the second data format during forward calculation and backward propagation.
With reference to the second aspect, in one possible implementation manner, the neural network further includes a non-matrix multiplication computation layer located after any one of the at least one matrix multiplication computation layer. The training module is specifically used for forward calculating at least one matrix multiplication calculation layer and at least one non-matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix. The training module is further configured to counter-propagate, layer by layer, at least one matrix multiplication computation layer and at least one non-matrix multiplication computation layer according to the error data, so as to determine an error gradient matrix corresponding to each non-matrix multiplication computation layer in the at least one non-matrix multiplication computation layer, and update a parameter corresponding to a current computation layer by using an error gradient matrix corresponding to a previous non-matrix multiplication computation layer in the at least one non-matrix multiplication computation layer. The data corresponding to the non-matrix multiplication calculation layer in forward calculation and backward propagation are data in a first data format.
With reference to the second aspect, in one possible implementation manner, the neural network includes a first matrix multiplication computation layer and a first non-matrix multiplication computation layer. The training module is further configured to forward calculate the first matrix multiplied by the calculation layer according to the data conversion matrix and the weight conversion matrix to obtain a first output conversion matrix of a third data format, where a total bit width of the third data format is greater than or equal to a total bit width of the first data format. The training module is also used for calculating a first non-matrix multiplication calculation layer according to the forward direction of the first output conversion matrix to obtain a second output matrix of the first data format; the second output matrix is used as an input matrix for the next calculation layer of the first non-matrix multiplied calculation layer when performing forward calculation. The training module is also used for acquiring a first output error gradient matrix of a third data format, and the first output error gradient matrix is used for updating parameters corresponding to the first non-matrix multiplication calculation layer; when the first output error gradient matrix is counter-propagating, the first non-matrix multiplies the matrix output by the last calculation layer of the calculation layers. The training module is further configured to counter-propagate the first non-matrix multiplication computation layer according to the first output error gradient matrix to obtain a second output error gradient matrix in the first data format, where the second output error gradient matrix is used to update the weight matrix. The conversion module is further configured to convert the second output error gradient matrix into a second output error gradient conversion matrix of a second data format. The training module is further configured to counter-propagate the first matrix by the computation layer according to the second output error gradient transformation matrix and the transposed matrix of the weight transformation matrix, to obtain a third output error gradient transformation matrix in a third data format, where the third output error gradient transformation matrix is used to update parameters corresponding to a previous computation layer of the first matrix by the computation layer when the neural network forward computation is performed.
With reference to the second aspect, in one possible implementation manner, the data matrix, the weight matrix, or the second output error gradient matrix is target data, and the data conversion matrix, the weight conversion matrix, or the second output error gradient conversion matrix is target conversion data, where the conversion module is further configured to convert the target data into the target conversion data in a manner of rounding away from 0 carry, where rounding away from 0 carry refers to carry when the most significant bit of the reject part is 1; and/or converting the target data into target conversion data by adopting a random rounding mode, wherein the random rounding refers to normalizing the reject part to determine a first numerical value, carrying when the first numerical value is greater than or equal to a second numerical value, and the second numerical value is a random number which is greater than 0 and less than 1.
With reference to the second aspect, in one possible implementation manner, each of the first data format or the third data format includes a sign bit, a exponent bit, and a mantissa bit. The second data format includes a symbol field, a bit field, a step field, and a mantissa field, the bit field being used to indicate a bit width occupied by the step field in a total bit width of the second data format.
With reference to the second aspect, in one possible implementation manner, after obtaining the second output error gradient conversion matrix of the second data format, the training module is further configured to input the second output error gradient conversion matrix and the data conversion matrix into the first matrix multiplication computation layer to obtain weight gradient conversion data of the third data format, and update the weight matrix according to the weight gradient conversion data.
With reference to the second aspect, in one possible implementation manner, the matrix multiplication calculation includes at least one of convolution, deconvolution, matrix multiplication, and batch matrix multiplication; the non-matrix multiplication computation includes at least one of an activation function or a normalization function.
In a third aspect of embodiments of the present application, an electronic device is provided that includes a memory for storing a set of computer instructions and at least one processor; the neural network training method of the first aspect or any of the possible implementations of the first aspect is performed when the processor executes the set of computer instructions.
In a fourth aspect of the embodiments of the present application, there is provided a computer-readable storage medium, comprising instructions. The instructions, when executed on a computer, cause the computer to perform the neural network training method as described above in the first aspect or any one of the possible implementations of the first aspect.
In a fifth aspect of embodiments of the present application, there is provided a computer program product, which when run on a computer causes the computer to perform the neural network training method as in the first aspect or any of the possible implementations of the first aspect.
The description of the second to fifth aspects of the present application may refer to the detailed description of the first aspect; also, the advantages described in the second to fifth aspects may refer to the analysis of the advantages of the first aspect, and will not be described here.
Drawings
FIG. 1 is a schematic diagram of a data structure according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a neural network training method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a step-mantissa distribution according to an embodiment of the present application;
FIG. 4 is a flowchart of another neural network training method according to an embodiment of the present application;
Fig. 5 is a schematic diagram of a neural network training process according to an embodiment of the present application;
FIG. 6 is a schematic diagram of another neural network training process according to an embodiment of the present application;
FIG. 7 is a schematic diagram of another neural network training process according to an embodiment of the present application;
FIG. 8 is a flowchart of another neural network training method according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of a neural network training device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The making and using of the various embodiments are discussed in detail below. It should be appreciated that the numerous applicable inventive concepts provided by the present application may be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the description and technology, and do not limit the scope of the application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.
Each circuit or other component may be described or referred to as "for" performing one or more tasks. In this case, "for" is used to connote structure by indicating that circuitry/components includes structure (e.g., circuitry) that performs one or more tasks during operation. Thus, a given circuit/component may be said to be used to perform that task even when the circuit/component is not currently operational (e.g., not open). Circuits/components used with the term "for" include hardware, such as circuitry to perform operations, etc.
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, a and b, a and c, b and c or a, b and c, wherein a, b and c can be single or multiple. In addition, in the embodiments of the present application, the words "first", "second", and the like do not limit the number and order.
In the present application, the words "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
First, some terms in the present application will be explained in order to be understood by those skilled in the art.
The original code (TF) is a binary fixed-point representation method for numbers in a computer. The conventional original code representation method is to add one sign bit (i.e. the most significant bit of the conventional original code is the sign bit) in front of the numerical value, so as to represent the positive and negative of the numerical value. Wherein, a sign bit of 0 represents a positive number, a sign bit of 1 represents a negative number (also including +0 and-0), and the rest bits except the sign bit in the original code are used for representing the magnitude of the numerical value, namely the amplitude of the original code. For example, original code 1001 represents +1, 0011, and represents-3. In some embodiments of the present application, the step code field in the floating point number may employ a coding manner that a step code symbol bit follows the original code amplitude, for example { Se, TF [2:end ] }, where the step code symbol bit Se is a symbol bit extracted from the original code and is used to represent the positive and negative of the step code field value, and TF is used to represent the amplitude of the step code field value. It should be noted that, based on the encoding method provided by the present application, for different step code field bit widths, the highest bit of the amplitude TF in the step code field is always 1 (i.e. 1' b 1), so that the highest bit 1' b1 may not occupy a wide space during encoding, i.e. the highest bit 1' b1 is hidden and not actually stored, and the highest bit 1' b1 may be directly supplemented during subsequent decoding, so as to obtain { Se,1' b1, TF [2:end ] }. Therefore, the storage space can be greatly saved, and the cost of data storage and data transfer is reduced.
Prefix code (prefix code) coding, i.e. prefix coding. If in one coding scheme, any one code is not a prefix (leftmost substring) of any other code, then that code is referred to as a prefix code, e.g., an unequal length code: 1. 01, 010, 0011, or 00, 01, 10, 1100, 1101, and equal length coding: 00, 01, 10, 11, etc. It will be appreciated that equal length codes are typically prefix codes. The prefix coding can ensure that ambiguity is not generated when the compressed file is decoded, and correct decoding is ensured. In some embodiments of the present application, the bit field added in the floating point number may be encoded by prefix encoding, and meanwhile, the embodiments of the present application may further encode the bit field by different prefix encoding modes such as conventional prefix encoding and non-conventional prefix encoding based on actual requirements, which will not be described in detail herein, specifically please refer to the following description of the embodiments.
Before describing the embodiments of the present application, the background art to which the present application relates will be described in detail again.
The neural network training process mainly comprises forward (forward) computation, backward (backward) propagation and weight (weight) update, the computation mode in the training process comprises universal matrix multiplication (general matrix multiplication, GEMM) and non-universal matrix multiplication, the layer corresponding to the universal matrix multiplication computation can be called a matrix multiplication computation layer, and the layer corresponding to the non-universal matrix multiplication computation can be called a non-matrix multiplication computation layer. Specifically, generic matrix multiplication includes convolution (convolution), deconvolution (transposed convolution), matrix multiplication (matmul), and batch matrix multiplication (batchmatmul), non-generic matrix multiplication includes activation functions including sigmoid, tanh, and relu, normalization functions including batch normalization (batch normalization), layer normalization (layer normalization), and instance normalization (instance normalization), etc., optimizer gradient update calculations, and the like.
The general matrix multiplication may use FP16, BF16 or lower precision data, the non-general matrix multiplication may use FP32 or other higher precision data to perform calculation, FP16 may also be referred to as half-precision floating point (half-precision floating-point), BF16 may also be referred to as 16-bit brain floating point (brain floating point), and FP32 may also be referred to as single-precision floating point (Full Precise Float). Specifically, as shown in fig. 1, FP16 includes 16 bits, wherein 1bit of the highest bit is a sign bit, 5 bits is an exponent bit, the remaining 10 bits are mantissa bits, the mantissa bits are used to represent decimal, BF16 includes 16 bits, wherein 1bit of the highest bit is a sign bit, 8 bits are exponent bits, the remaining 7 bits are mantissa bits, FP32 includes 32 bits, wherein 1bit of the highest bit is a sign bit, 8 bits are exponent bits, and the remaining 23 bits are mantissa bits.
As the scale of the neural network is larger and larger, the general matrix multiplication calculation and the non-general matrix multiplication calculation are more and more when the neural network is trained, the demand on calculation power is larger and more, the power consumption of the neural network chip is increased continuously, and the training time of the neural network is longer and longer.
In order to reduce the power consumption during the training of the neural network, a neural network training method defines 2 low-precision floating point data formats FP8 (E5M 2) and FP8 (E4M 3), and adopts mixed precision for training during the training of the neural network, thereby reducing the power consumption of the neural network chip. Specifically, the data is converted into the above 2 low precision floating point data formats and then calculated when the general matrix multiplication calculation is performed, and the data is converted into the FP32 or FP16 data format and then calculated when the non-general matrix multiplication calculation is performed. It can be understood that, by performing the general matrix multiplication by using the data of the above 2 low precision floating point data formats, compared with the general matrix multiplication by using the data of FP16/FB16, since the total bit width of the 2 low precision floating point data formats is half of the total bit width of FP16/FB16, the data amount is reduced to half during the general matrix multiplication, so that the power consumption of the neural network chip can be reduced.
However, the data is converted into the above 2 low-precision floating point data formats to perform general matrix multiplication, and then is converted into the high-precision data format to perform non-general matrix multiplication, for example, the data is converted into the FP16/FB16 data format, which will cause the precision loss of the data in the data format conversion process, and will cause the training precision of the neural network to be reduced. In order to reduce the precision loss of data, the neural network training method introduces scaling operation, but the data generated by the tensor core area (tensorcore) needs to be distributed and counted after the scaling operation, so that the power consumption of the neural network chip is increased. Moreover, the 2 kinds of low-precision floating point data have differences in the performances of precision and dynamic range, a user needs to select one data format from the 2 kinds of low-precision floating point data to perform neural network training, and at present, the method cannot instruct the user to perform selection, so that the user experience is poor and the generalization is poor.
Another neural network training method is to use mixed precision to train the neural network by defining block floating point (block float point) for the expression of low-precision data, thereby reducing the power consumption of the neural network chip. Specifically, the data is converted into a block floating point data format and then calculated when the general matrix multiplication calculation is performed, and the data is converted into an FP32 or FP16 data format and then calculated when the non-general matrix multiplication calculation is performed. When converting data into a block floating point data format, firstly dividing the high-precision data into a plurality of blocks (blocks), then counting the data distribution in each block, finally dividing the data in the block into a common exponent (exponent) and a sign and mantissa (mantissa) of each data according to the data distribution, wherein the data in each block has the common exponent, so that the total bit width of each data in the block is reduced, and the power consumption of a neural network chip can be reduced when the data in the block floating point data format is adopted for training.
However, when the data in the block floating point data format is adopted for training, the accuracy loss of the data is caused because the block floating point data format limits the expression range of the data, so that the training accuracy of the neural network is reduced, meanwhile, different block division modes exist in different calculation layers of the neural network, the algorithm flow is more complicated, the operations of quant and dequant are required to be introduced before and after the general matrix multiplication calculation is carried out, the analysis of the data is required to be counted, and the training performance of the neural network is affected.
In summary, in the conventional neural network training method, when the neural network is trained, the power consumption of the neural network chip can be reduced, but the training precision of the neural network is reduced, so that the embodiment of the application provides the neural network training method, which can reduce the power consumption when the neural network is trained, reduce the training time, improve the training precision of the neural network and improve the performance of the neural network. The neural network training method can be applied to neural network chips such as a central processing unit (central processing unit, CPU), an image processor (graphics processing unit, GPU), a tensor processor (tensor processing unit, TPU), a neural network processor (neural network processing unit, NPU) and the like.
As shown in fig. 2, a flow chart of a neural network training method provided by an embodiment of the present application is applied to a neural network including at least one matrix multiplication layer, and the embodiment of the present application is not limited to a specific number of the neural network including the matrix multiplication layer, and the method includes steps S201 to S204.
S201, converting the data matrix and the weight matrix of the first data format into a data conversion matrix and a weight conversion matrix of the second data format. The total bit width of the second data format is less than the total bit width of the first data format.
Optionally, the matrix multiplication includes at least one of convolution, deconvolution, matrix multiplication and batch matrix multiplication, and the embodiment of the present application is not limited to the specific type of the matrix multiplication, and the matrix multiplication may be referred to as a general matrix multiplication.
The first data format includes sign bits, exponent bits, and mantissa bits.
Optionally, the first data format may be other data formats with higher precision, such as FP16, FB16, or FP32, and the specific type of the first data format in the embodiment of the present application is not limited.
The second data format includes a symbol field, a bit field, a step field, and a mantissa field, where the bit field is used to indicate a bit width occupied by the step field in a total bit width of the second data format. The embodiment of the application is not limited to the bit number specifically included in the total bit width of the second data format, and is not limited to the bit number specifically included in the bit position domain.
By way of example, when the total bit width of the first data format is 16 bits, the total bit width of the second data format may be 8 bits.
For example, taking the example that the total bit width of the second data format includes 8 bits, the coding manner of the second data format may be as shown in table 1.
TABLE 1
The symbol field occupies a bit width of 1bit, is used for representing positive and negative of data, default 0 represents positive, 1 represents negative, and can also be used for representing negative by 0 and positive by 1 according to actual requirements, and the embodiment of the application is not particularly limited to this. The bit field may occupy 2 bits or 3 bits for representing 5 different values D ([ 0,4] the 5 values), the value D representing the bit width of the step field, the bit width of the step field varying according to the value D of the bit field, the remaining bit width (8-1-2-D) or (8-1-3-D) of the total bit width 8 bits being reserved for the mantissa field.
The bit field is encoded with a prefix code, with 2 bits for {2,3,4}, 3 bits for {0,1}, as shown in table 2 as an example of encoding.
TABLE 2
As can be seen from table 2, when the bit field is 2 bits, the "11" code value "4" may be used, the "10" code value "3" may be used, the "01" code value "2" may be used, and when the bit field is 3 bits, the "001" code value "1" may be used, and the "000" code value "0" may be used. The encoding scheme shown in table 2 is merely an exemplary illustration, and is not limited to the embodiment of the present application.
Further, the code value distribution of the step code field is shown in table 3 below.
TABLE 3 Table 3
Where Es represents the encoded value of the step code field, ei represents the decoded value of the step code field, ev represents the shifting number of the decimal point, se represents the sign bit of Ei, which may also be referred to as the step code sign bit of the step code field, se occupies 1 bit in the bit width D of the step code field, for representing the positive and negative of the value of the step code field, default Se is 0 for representing positive, se is 1 for representing negative, and also 0 for representing negative, 1 for representing positive according to actual requirements, which is not particularly limited in the embodiment of the present application.
As shown in table 3, when d=0, the mantissa field bit width=8-1-3-0=4, when d=1, the mantissa field bit width=8-1-3-1=3, when d=2, the mantissa field bit width=8-1-2-2=3, when d=3, the mantissa field bit width=8-1-2-3=2, and when d=4, the mantissa field bit width=8-1-2-4=1. It will be appreciated that the smaller the bit width of the step field (i.e., the smaller the value range), the greater the bit width occupied by the mantissa field, and the greater the numerical accuracy, and the smaller the bit width occupied by the mantissa field, and the smaller the numerical accuracy, when the bit width of the step field is greater (i.e., the larger the value range).
As can be seen from table 3, when d=0, the code value es=0 of the step code domain; when d=1, the code value of the step code field es= { Se }, when D > 1, es= { Se, TF [2:end ] }. Obviously, the highest concealment of the highest bit 1 in the TF is not stored. TF [2:end ] includes the rest of the bits except the most significant bit in TF, i.e., the second to last bits. As shown in Table 5 above, the end value is the current corresponding D of the step field, so TF [2:end ] can also be written as TF [2:D ], indicating that the second through the D bits in TF are included. For example, as shown in table 5 above, es= { Se, TF [2] }, when d=2, es= { Se, TF [2:3] }, when d=4, es= { Se, TF [2:4] }.
Analyzing Es to obtain a decoding value Ei of the step code domain: d=0, ei=0; when d=1, ei= { Se,1}; when D >1, ei= { Se,1{ Se, TF [2:end ] }. For example, when d=2, ei= { Se,1, tf [2] }; when d=3, ei= { Se,1, tf [2:3] }; when d=4, ei= { Se,1, tf [2:4] }.
At this time, the true value corresponding to the data may be normalized as expressed by the following equation:
X=(-1)S×2Ei+Ec×(1+M)
Wherein S is a value (0 or 1) of the symbol field, ei+ec=ev, ec is a preset level center (generally 0, or may be set to a value of-2, 2 or 3 according to actual requirements), and M represents a decoded value of the mantissa field. As can be seen from table 3, ev=0 when d=0; when d=1, ev= ±1; when d=2, ev= ± [2,3]; when d=3, ev= ± [4,7]; when d=4, ev= ± [8, 15].
As shown in fig. 3, a mantissa-step code distribution diagram is provided in the embodiment of the present application, and according to fig. 3, it can be known that the smaller the absolute value of the step code, the larger the bit width of the mantissa field, so that the data in the second data format may also be referred to as cone precision floating point data.
Alternatively, in some possible embodiments, the value X of the floating point representation may be selected to represent various types of special values by custom settings in addition to the normalized representation described above.
For example, when s=0, d=4, es=4 'b 1111= -15, m=1' b0, x=0 (zero); when s=1, d=4, es=4 'b 1111= -15, m=1' b0, X may represent a non-numerical value (non a number, naN); when d=4, es=4 'b 0111=15, m=1' b1, x=positive and negative infinity (±definition).
It can be appreciated that, since the second data format is used to indicate the bit width occupied by the bit-position field in the total bit width of the second data format in the encoding process, the numerical range and the numerical precision of the floating point number after conversion can be improved as much as possible without increasing the total bit width.
S202, forward calculating at least one matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix to obtain error data.
Specifically, when forward computing is performed layer by layer according to the data conversion matrix and the weight conversion matrix, before forward computing each matrix multiplied by the computing layer, converting the data into a data matrix in a second data format, and multiplying the forward computing matrix by the computing layer.
It can be appreciated that, because the total bit width of the second data format is smaller than the total bit width of the first data format, the forward computation is performed by adopting the data of the second data format, and compared with the forward computation matrix multiplication computation layer by adopting the data of the first data format, the power consumption in training the neural network can be reduced, and the training time is reduced. In addition, in the encoding process of the second data format, the bit width occupied by the bit position field in the total bit width of the second data format is indicated by defining the bit position field, so that the numerical range and the numerical precision of the converted floating point number can be improved as much as possible on the premise of not additionally increasing the total bit width, and therefore, the training precision of the neural network can be improved and the performance of the neural network can be improved by adopting the data of the second data format to perform forward computation.
In a possible embodiment, the neural network further includes a non-matrix multiplication layer located after any one of the at least one matrix multiplication layer, and step S202 further includes forward computing the at least one matrix multiplication layer and the at least one non-matrix multiplication layer by layer according to the data conversion matrix and the weight conversion matrix.
Optionally, the non-matrix multiplication includes at least one of an activation function or a normalization function. Wherein the activation function comprises at least one of sigmoid, tanh, and relu and the normalization function comprises at least one of batch normalization, layer normalization, and instance normalization. The embodiment of the application is not limited to the specific type of the non-matrix multiplication calculation, and the non-matrix multiplication calculation may also be called as a non-universal matrix multiplication calculation.
Specifically, when at least one non-matrix multiplication calculation layer is calculated forward layer by layer according to the data conversion matrix and the weight conversion matrix, converting the data into data in a first data format, and calculating the non-matrix multiplication calculation layer forward.
It can be appreciated that, since the total bit width of the second data format is smaller than the total bit width of the first data format, the data forward computing non-matrix multiplication computation layer adopting the first data format can improve the training precision of the neural network and can improve the performance of the neural network compared with the data forward computing non-matrix multiplication computation layer adopting the second data format.
S203, back-propagating at least one matrix multiplication calculation layer by layer according to the error data to determine an error gradient matrix corresponding to each matrix multiplication calculation layer in the at least one matrix multiplication calculation layer.
In one possible embodiment, the neural network further includes a non-matrix multiplication layer located after any one of the at least one matrix multiplication layer, and step S203 further includes back-propagating the at least one matrix multiplication layer and the at least one non-matrix multiplication layer by layer according to the error data to determine an error gradient matrix corresponding to each of the at least one non-matrix multiplication layer and an error gradient matrix corresponding to a previous non-matrix multiplication layer of the at least one non-matrix multiplication layer for updating parameters corresponding to a current calculation layer.
The data corresponding to the at least one matrix multiplication computation layer in the back propagation is data in the second data format, and the data corresponding to the non-matrix multiplication computation layer in the back propagation is data in the first data format.
It can be appreciated that, since the total bit width of the second data format is smaller than the total bit width of the first data format, the data training matrix multiplication computation layer adopting the second data format can reduce the power consumption when training the neural network and can reduce the training time compared with the data training matrix multiplication computation layer adopting the first data format. Compared with the data training non-matrix multiplication calculation layer adopting the second data format, the training precision of the neural network can be improved, and the performance of the neural network can be improved.
In one possible embodiment, after the error data is scaled, at least one matrix-by-computation layer is counter-propagated layer by layer according to the error data to determine an error gradient matrix corresponding to each of the at least one matrix-by-computation layer.
Specifically, the error data may be multiplied by a proper value to scale, and the rounding error in the data conversion process may be reduced by scaling the error data and then calculating the scaled error data, so that the training accuracy of the neural network may be further improved.
S204, updating parameters corresponding to the current calculation layer according to the error gradient matrix corresponding to the previous matrix multiplication calculation layer in the at least one matrix multiplication calculation layer.
Alternatively, the current computing layer may be a matrix multiplying computing layer, or may be a non-matrix multiplying computing layer, which is not limited by the embodiment of the present application.
Alternatively, the matrix multiplication computation layer and the non-matrix computation layer may be in different computation units of one neural network chip, or may be in different neural network chips, which is not limited by the embodiment of the present application.
It can be appreciated that the embodiment of the application performs matrix multiplication calculation layer training by adopting the data in the second data format, so that the data transmitted in the training process is the data in the second data format, and the requirements on bandwidth and storage space in the neural network training process can be reduced.
According to the neural network training method provided by the embodiment of the application, the data matrix and the weight matrix in the first data format are converted into the data conversion matrix and the weight conversion matrix in the second data format, and at least one matrix multiplication calculation layer is calculated forward layer by layer according to the data conversion matrix and the weight conversion matrix. Specifically, the data training matrix multiplication calculation layer with the second data format and the data training non-matrix multiplication calculation layer with the first data format are adopted, so that the power consumption during training of the neural network can be reduced, and the training time is shortened. In addition, in the encoding process of the second data format, the bit width occupied by the bit position field in the total bit width of the second data format is indicated by defining the bit position field, so that the numerical range and the numerical precision of the converted floating point number can be improved as much as possible on the premise of not additionally increasing the total bit width, and the training precision of the neural network can be improved, and the performance of the neural network can be improved.
In a possible embodiment, the neural network includes a first matrix multiplication calculation layer and a first non-matrix multiplication calculation layer, as shown in fig. 4, and the embodiment of the present application further provides a neural network training method, which includes steps S401 to S405.
S401, according to the data conversion matrix and the weight conversion matrix, the first matrix multiplication calculation layer is calculated in the forward direction to obtain a first output conversion matrix of a third data format. The third data format has a total bit width greater than or equal to the total bit width of the first data format.
The third data format includes sign bits, exponent bits, and mantissa bits.
Optionally, the third data format may be other data formats with higher precision, such as FP32, and the embodiment of the present application is not limited to the specific type of the first data format.
For example, when the first data format is FP16, the third data format may be FP32.
It can be understood that by adopting the data conversion matrix and the weight conversion matrix with smaller total bit width number, the power consumption during training the neural network can be reduced and the training time can be reduced by forward calculating the first matrix multiplication calculation layer. In addition, in the encoding process of the second data format, the bit width occupied by the bit position field in the total bit width of the second data format is indicated by defining the bit position field, so that the numerical range and the numerical precision of the converted floating point number can be improved as much as possible on the premise of not additionally increasing the total bit width, and the training precision of the neural network can be improved, and the performance of the neural network can be improved.
S402, a first non-matrix multiplication calculation layer is calculated according to the forward direction of the first output conversion matrix, and a second output matrix of the first data format is obtained. The second output matrix is used as an input matrix for the next calculation layer of the first non-matrix multiplied calculation layer when performing forward calculation.
It can be appreciated that by adopting the first output conversion matrix forward computation non-matrix multiplication computation layer of the third data format, the training accuracy can be improved and the performance of the neural network can be improved compared with the first output conversion matrix forward computation non-matrix multiplication computation layer adopting the second data format.
S403, acquiring a first output error gradient matrix of a third data format, wherein the first output error gradient matrix is used for updating parameters corresponding to the first non-matrix multiplication calculation layer. When the first output error gradient matrix is counter-propagating, the first non-matrix multiplies the matrix output by the last calculation layer of the calculation layers.
The first output error gradient matrix is determined from the layer-by-layer back propagation of the error data. Specifically, a back propagation algorithm may be used by a previous computation layer of the first non-matrix-multiplied computation layer to determine the first output error gradient matrix. For example, the first output error gradient matrix may be determined by chain derivation, and embodiments of the present application are not limited as to what type of back propagation algorithm is specifically employed.
S404, back-propagating the first non-matrix multiplication calculation layer according to the first output error gradient matrix to obtain a second output error gradient matrix of the first data format, wherein the second output error gradient matrix is used for updating the weight matrix.
S405, converting the second output error gradient matrix into a second output error gradient conversion matrix with a second data format, and back-propagating the first matrix by the calculation layer according to the second output error gradient conversion matrix and a transposed matrix of the weight conversion matrix to obtain a third output error gradient conversion matrix with a third data format, wherein the third output error gradient conversion matrix is used for updating parameters corresponding to a last calculation layer of the first matrix by the calculation layer when the neural network forward calculation is performed.
It can be appreciated that by back-propagating the first matrix-by-computation layer according to the transpose of the second output error gradient transform matrix and the weight transform matrix, power consumption in training the neural network can be reduced and training time can be reduced as compared to back-propagating the first matrix-by-computation layer using the second output error gradient matrix.
It should be noted that, in the neural network training method shown in fig. 4, the description of the neural network training method shown in fig. 2 provided above may be recruitment, which is not repeated here.
According to the application network training method provided by the embodiment of the application, the matrix multiplication calculation layer training is performed by adopting the data in the second data format, so that the power consumption during training the neural network can be reduced, and the training time is shortened. Meanwhile, the training of the non-matrix multiplication computation layer is carried out by adopting the data in the first data format or the third data format, and compared with the training of the non-matrix multiplication computation layer by adopting the data in the second data format, the training precision of the neural network can be improved, and the performance of the neural network can be improved.
In one possible embodiment, the first matrix-by-compute layer is another matrix-by-compute layer of the plurality of matrix-by-compute layers included in the neural network except for the last matrix-by-compute layer.
The training accuracy of the neural network can be further improved by adopting high-accuracy data when the last matrix multiplication calculation layer is trained. For example, in training a neural network for image processing, the training accuracy of the neural network can be improved by training the last matrix calculation layer with high-accuracy data.
The training process of the neural network can be represented by a flow chart as shown in fig. 5. The forward direction is characterized in that L-1 represents a training process corresponding to a last matrix multiplication calculation layer of a first matrix multiplication calculation layer, L+1 represents a training process corresponding to a next matrix multiplication calculation layer of the first matrix multiplication calculation layer, and the training processes of L-1 and L+1 are identical to the training process of a data matrix and a weight matrix corresponding to the first matrix multiplication calculation layer.
Alternatively, the neural network training method provided in the embodiment of the present application may be used in a neural network that is ready to start training, or may be used in a neural network that has already trained a part of the neural network, which is not limited in this embodiment of the present application.
For example, when the neural network training method is used for preparing the neural network to start training, parameters such as a weight matrix, bias parameters, normalization layer parameters, and optimizer parameters of the neural network can be initialized, and then the above neural network training method is circularly performed to complete training of the neural network.
For another example, when the neural network training method is used for a part of trained neural networks, parameters such as a data matrix and a weight matrix corresponding to a matrix multiplication calculation layer in the neural network can be acquired first, and then the neural network training method is executed in a circulating manner, so that training of the neural network is completed.
It can be understood that by converting the training of a part of the neural network into the training by adopting the neural network training method provided by the embodiment of the application, the power consumption during training the neural network can be reduced, and the training precision of the neural network can be improved.
According to the neural network training method provided by the embodiment of the application, the forward computation and the back propagation matrix multiplication computation layer are carried out by adopting the data with the second data format with the smaller total bit width, so that the power consumption during training the neural network can be reduced, and the training time is shortened. Meanwhile, in the encoding process of the second data format, the bit width occupied by the bit position domain in the total bit width of the second data format is indicated by defining the bit position domain, so that the numerical range and the numerical precision of the converted floating point number can be improved as much as possible on the premise of not additionally increasing the total bit width, and the training precision of the neural network can be improved, and the performance of the neural network can be improved. In addition, when the non-matrix multiplication calculation layer training is performed, the training precision of the neural network can be further improved by adopting the data in the first data format or the third data format with high precision.
In one possible embodiment, the data matrix, the weight matrix, or the second output error gradient matrix may be used as target data, and the data conversion matrix, the weight conversion matrix, or the second output error gradient conversion matrix may be used as target conversion data, where converting the target data in the first data format into the target conversion data in the second data format includes: the target data is converted into target conversion data in a manner far from 0 carry rounding (round half to away, TA) and/or in a manner of random rounding (SR).
Specifically, the way to round away from 0 carry means to carry when the most significant bit of the discard portion is 1, and not carry when the most significant bit of the discard portion is 0.
For example, when the lower 3 bits of "110111" are discarded, the most significant bit of the discarded portion is 1, so carry 1 is "to 11000" and the data is converted into the second data format.
Random rounding refers to normalizing the discard portion to determine a first value, carry when the first value is greater than or equal to a second value, the second value being a random number greater than 0 and less than 1.
For example, the data discarding part is normalized to determine that the first value is 0.5, and a random number 0.32 is randomly generated as the second value, wherein the first value 0.5 is greater than the second value 0.32, so that 1 is carried, and then the carried data is converted into the second data format.
Alternatively, the forward computation and the backward propagation may convert the target data in the first data format into the target conversion data in the second data format in the same rounding manner, or may convert the target data in the first data format into the target conversion data in the second data format in a different rounding manner, which is not limited by the embodiment of the present application.
For example one, as shown in fig. 5, both forward computation and back propagation may convert target data in a first data format to target conversion data in a second data format in a manner that is rounded away from the 0 carry (TA).
For example two, as shown in fig. 6, forward computation may convert target data in a first data format to target conversion data in a second data format in a manner far from 0 carry rounding (TA), and backward propagation may convert target data in the first data format to target conversion data in the second data format in a manner of random rounding (SR).
Example three, as shown in fig. 7, both forward computation and back propagation may convert target data in a first data format to target conversion data in a second data format in a random rounding (SR) manner.
According to the neural network training method provided by the embodiment of the application, the target data in the first data format is converted into the target conversion data in the second data format by adopting a rounding mode far from 0 carry rounding and/or random rounding, so that conversion errors can be reduced, the influence of the conversion errors on the neural network training precision can be reduced, and the training precision of the neural network can be further improved.
In one possible embodiment, as shown in fig. 8, after obtaining the second output error gradient transformation matrix of the second data format, the neural network training method further includes steps S406-S407 after step S405.
S406, inputting the second output error gradient conversion matrix and the data conversion matrix into the first matrix multiplication calculation layer to obtain weight gradient conversion data in a third data format.
S407, updating the weight matrix according to the weight gradient conversion data.
Specifically, the training process of the neural network may be represented by a flowchart as shown in fig. 5, and for example, the optimizer (optimizer) may update the weight matrix according to the weight gradient conversion data. The embodiment of the application is not limited to the specific calculation mode adopted by the optimizer to update the weight matrix.
In one possible embodiment, as shown in FIG. 5, the weight matrix and other parameters may be converted into data in a first data format, backed up in a local storage device. For example, the weight matrix of the first data format may be stored in the first storage unit W-master, the state parameters of the optimizer may be stored in the second storage unit Momentum, and the Other parameters may be stored in the third storage unit Other states.
According to the neural network training method provided by the embodiment of the application, the matrix multiplication calculation layer calculation is performed by adopting the second output error gradient conversion data and the data conversion matrix in the second data format, so that the power consumption during training of the neural network can be reduced. Meanwhile, the weight matrix is updated by adopting the weight gradient conversion data in the third data format, so that the training precision of the neural network can be ensured.
As shown in fig. 9, an embodiment of the present application further provides a neural network training device 900, where the neural network training device 900 includes a conversion module 910 and a training module 920.
The conversion module 910 is configured to convert the data matrix and the weight matrix of the first data format into a data conversion matrix and a weight conversion matrix of a second data format, where a total bit width of the second data format is smaller than a total bit width of the first data format. The training module 920 is configured to forward calculate at least one matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix, so as to obtain error data. The training module 920 is further configured to back-propagate, layer by layer, the at least one matrix-by-computation layer according to the error data, to determine an error gradient matrix corresponding to each of the at least one matrix-by-computation layer. The training module 920 is further configured to update a parameter corresponding to the current calculation layer according to at least one error gradient matrix corresponding to a previous matrix multiplication calculation layer in the matrix multiplication calculation layer. Wherein, at least one matrix multiplication calculation layer is corresponding to the data in the second data format during forward calculation and backward propagation.
In one possible embodiment, the neural network further includes a non-matrix multiplication computation layer located after any one of the at least one matrix multiplication computation layer, and the training module 920 is specifically configured to forward calculate the at least one matrix multiplication computation layer and the at least one non-matrix multiplication computation layer by layer according to the data transformation matrix and the weight transformation matrix. The training module 920 is further configured to counter-propagate, layer by layer, at least one matrix-by-computation layer and at least one non-matrix-by-computation layer according to the error data, so as to determine an error gradient matrix corresponding to each non-matrix-by-computation layer in the at least one non-matrix-by-computation layer, and update a parameter corresponding to a current computation layer by using an error gradient matrix corresponding to a previous non-matrix-by-computation layer in the at least one non-matrix-by-computation layer. The data corresponding to the non-matrix multiplication calculation layer in forward calculation and backward propagation are data in a first data format.
In one possible embodiment, the neural network includes a first matrix-by-compute layer and a first non-matrix-by-compute layer.
The training module 920 is further configured to forward calculate the first matrix multiplied by the calculation layer according to the data conversion matrix and the weight conversion matrix to obtain a first output conversion matrix of a third data format, where a total bit width of the third data format is greater than or equal to a total bit width of the first data format. The training module 920 is further configured to forward calculate a first non-matrix multiplication layer according to the first output conversion matrix to obtain a second output matrix in the first data format, where the second output matrix is used as an input matrix of a next calculation layer of the first non-matrix multiplication layer when the first output matrix is used for forward calculation. The training module 920 is further configured to obtain a first output error gradient matrix of the third data format, where the first output error gradient matrix is used to update parameters corresponding to the first non-matrix multiplication calculation layer; when the first output error gradient matrix is counter-propagating, the first non-matrix multiplies the matrix output by the last calculation layer of the calculation layers. The training module 920 is further configured to counter-propagate the first non-matrix multiplication computation layer according to the first output error gradient matrix to obtain a second output error gradient matrix in the first data format, where the second output error gradient matrix is used to update the weight matrix. The conversion module 910 is further configured to convert the second output error gradient matrix into a second output error gradient conversion matrix in a second data format. The training module 920 is further configured to back propagate the first matrix by the computation layer according to the second output error gradient transformation matrix and the transpose matrix of the weight transformation matrix, to obtain a third output error gradient transformation matrix in a third data format, where the third output error gradient transformation matrix is used to update parameters corresponding to a previous computation layer of the first matrix by the computation layer when the neural network forward computation is performed.
The first data format or the third data format includes a symbol field, a step field, and a tail field. The second data format includes a symbol field, a bit field, a step field, and a mantissa field, the bit field being used to indicate a bit width occupied by the step field in a total bit width of the second data format.
The data matrix, the weight matrix, or the second output error gradient matrix is used as target data, the data conversion matrix, the weight conversion matrix, or the second output error gradient conversion matrix is used as target conversion data, the conversion module 910 is further configured to convert the target data into target conversion data in a mode of rounding away from 0 carry, where rounding away from 0 carry means carry when the most significant bit of the reject part is 1; and/or converting the target data into target conversion data by adopting a random rounding mode, wherein the random rounding refers to normalizing the reject part to determine a first numerical value, carrying when the first numerical value is greater than or equal to a second numerical value, and the second numerical value is a random number which is greater than 0 and less than 1.
After obtaining the second output error gradient conversion matrix of the second data format, the training module 920 is further configured to input the second output error gradient conversion matrix and the data conversion matrix into the first matrix multiplication calculation layer to obtain weight gradient conversion data of the third data format, and update the weight matrix according to the weight gradient conversion data.
Optionally, the first matrix multiplication layer is other matrix multiplication layers except the last matrix multiplication layer in the plurality of matrix multiplication layers included in the neural network.
Optionally, the matrix multiplication calculation includes at least one of convolution, deconvolution, matrix multiplication, and batch matrix multiplication; the non-matrix multiplication computation includes at least one of an activation function or a normalization function.
It should be noted that, the description of the neural network training method provided above may be recruitment to the neural network training device 900, and the embodiments of the present application are not repeated here.
As shown in fig. 10, an embodiment of the present application also provides an electronic device 1000, the electronic device 1000 comprising a memory 1100 and at least one processor 1200, the memory 1100 for storing a set of computer instructions; when the processor 1200 executes a set of computer instructions, the neural network training method illustrated in fig. 2,4, or 8 is performed.
Embodiments of the present application also provide a computer readable storage medium having computer program code stored therein, which when executed by the above-described processor, causes an electronic device to perform the neural network training method shown in fig. 2, 4, or 8.
Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the neural network training method shown in fig. 2,4 or 8.
The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (16)

1. A neural network training method, wherein the neural network comprises at least one matrix multiplication computation layer, the method comprising:
Converting the data matrix and the weight matrix of the first data format into a data conversion matrix and a weight conversion matrix of a second data format, wherein the total bit width of the second data format is smaller than that of the first data format;
forward calculating the at least one matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix to obtain error data;
The at least one matrix multiplication calculation layer is counter-propagated layer by layer according to the error data so as to determine an error gradient matrix corresponding to each matrix multiplication calculation layer in the at least one matrix multiplication calculation layer;
Updating parameters corresponding to the current calculation layer according to the error gradient matrix corresponding to the previous matrix multiplication calculation layer in the at least one matrix multiplication calculation layer;
Wherein the data corresponding to the at least one matrix multiplication computation layer in forward computation and back propagation is the data in the second data format.
2. The method of claim 1, wherein the neural network further comprises a non-matrix-multiply computation layer located after any one of the at least one matrix-multiply computation layer, the computing the at least one matrix-multiply computation layer-by-layer forward from the data conversion matrix and the weight conversion matrix, comprising:
Forward calculating the at least one matrix multiplication calculation layer and the at least one non-matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix;
said back-propagating said at least one matrix multiplication computation layer by layer in accordance with said error data, comprising:
The at least one matrix multiplication calculation layer and the at least one non-matrix multiplication calculation layer are reversely propagated layer by layer according to the error data to determine an error gradient matrix corresponding to each non-matrix multiplication calculation layer in the at least one non-matrix multiplication calculation layer, and the error gradient matrix corresponding to the previous non-matrix multiplication calculation layer in the at least one non-matrix multiplication calculation layer is used for updating parameters corresponding to the current calculation layer;
And the data corresponding to the non-matrix multiplication calculation layer in forward calculation and backward propagation are the data in the first data format.
3. The method according to claim 1 or 2, wherein the neural network comprises a first matrix-by-computation layer and a first non-matrix-by-computation layer, the method comprising:
according to the data conversion matrix and the weight conversion matrix, forward calculating the first matrix multiplied by a calculation layer to obtain a first output conversion matrix of a third data format, wherein the total bit width of the third data format is larger than or equal to that of the first data format;
The first non-matrix multiplication calculation layer is calculated according to the first output conversion matrix forward direction, and a second output matrix of the first data format is obtained; the second output matrix is used as an input matrix of a next calculation layer of the first non-matrix multiplication calculation layer when forward calculation is performed;
acquiring a first output error gradient matrix of the third data format, wherein the first output error gradient matrix is used for updating parameters corresponding to the first non-matrix multiplication calculation layer; when the first output error gradient matrix is counter-propagating, the first non-matrix multiplies the matrix output by the last calculation layer of the calculation layers;
the first non-matrix multiplication calculation layer is reversely propagated according to the first output error gradient matrix to obtain a second output error gradient matrix of the first data format, wherein the second output error gradient matrix is used for updating the weight matrix;
And converting the second output error gradient matrix into a second output error gradient conversion matrix of the second data format, and back-propagating the first matrix multiplied by the calculation layer according to the second output error gradient conversion matrix and the transposed matrix of the weight conversion matrix to obtain a third output error gradient conversion matrix of a third data format, wherein the third output error gradient conversion matrix is used for updating parameters corresponding to a last calculation layer of the first matrix multiplied by the calculation layer when the neural network forward calculation is performed.
4. The method of claim 3, wherein the data matrix, the weight matrix, or the second output error gradient matrix is target data, wherein the data conversion matrix, the weight conversion matrix, or the second output error gradient conversion matrix is target conversion data, wherein converting the target data in the first data format to the target conversion data in the second data format comprises:
Converting the target data into the target conversion data in a mode of far-from-0 carry rounding, wherein the far-from-0 carry rounding refers to carry when the most significant bit of the reject part is 1;
And/or converting the target data into the target conversion data in a random rounding mode, wherein the random rounding means that the reject part is normalized to determine a first numerical value, and the first numerical value is carried when being greater than or equal to a second numerical value, and the second numerical value is a random number which is greater than 0 and less than 1.
5. The method of claim 3 or 4, wherein the first data format or the third data format comprises sign bits, exponent bits, and mantissa bits;
the second data format includes a symbol field, a bit field, a step field, and a mantissa field, the bit field being used to indicate a bit width occupied by the step field in a total bit width of the second data format.
6. The method according to any of claims 3-5, wherein after deriving the second output error gradient transformation matrix of the second data format, the method further comprises:
inputting the second output error gradient conversion matrix and the data conversion matrix into the first matrix multiplication calculation layer to obtain weight gradient conversion data in the third data format;
and updating the weight matrix according to the weight gradient conversion data.
7. The method of any of claims 1-6, wherein the matrix multiplication computation comprises at least one of convolution, deconvolution, matrix multiplication, and bulk matrix multiplication; the non-matrix multiplication computation includes at least one of an activation function or a normalization function.
8. A neural network training device, the device comprising:
the conversion module is used for converting the data matrix and the weight matrix of the first data format into the data conversion matrix and the weight conversion matrix of the second data format, and the total bit width of the second data format is smaller than that of the first data format;
the training module is used for forward calculating the at least one matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix to obtain error data;
the training module is further configured to counter-propagate the at least one matrix multiplication computation layer by layer according to the error data, so as to determine an error gradient matrix corresponding to each matrix multiplication computation layer in the at least one matrix multiplication computation layer;
the training module is further configured to update parameters corresponding to a current calculation layer according to an error gradient matrix corresponding to a previous matrix multiplication calculation layer in the at least one matrix multiplication calculation layer;
Wherein the data corresponding to the at least one matrix multiplication computation layer in forward computation and back propagation is the data in the second data format.
9. The apparatus of claim 8, wherein the neural network further comprises a non-matrix-multiply-calculate layer located after any of the at least one matrix-multiply-calculate layers,
The training module is specifically configured to forward calculate the at least one matrix multiplication calculation layer and the at least one non-matrix multiplication calculation layer by layer according to the data conversion matrix and the weight conversion matrix;
The training module is further configured to counter-propagate the at least one matrix multiplication computation layer and the at least one non-matrix multiplication computation layer by layer according to the error data, so as to determine an error gradient matrix corresponding to each of the at least one non-matrix multiplication computation layer, where the error gradient matrix corresponding to a previous non-matrix multiplication computation layer in the at least one non-matrix multiplication computation layer is used to update parameters corresponding to a current computation layer;
And the data corresponding to the non-matrix multiplication calculation layer in forward calculation and backward propagation are the data in the first data format.
10. The apparatus of claim 8 or 9, wherein the neural network comprises a first matrix-by-computation layer and a first non-matrix-by-computation layer,
The training module is further configured to forward calculate the first matrix multiplied by the calculation layer according to the data conversion matrix and the weight conversion matrix to obtain a first output conversion matrix in a third data format, where a total bit width of the third data format is greater than or equal to a total bit width of the first data format;
The training module is further configured to calculate the first non-matrix multiplication calculation layer according to the first output conversion matrix forward direction, so as to obtain a second output matrix in the first data format; the second output matrix is used as an input matrix of a next calculation layer of the first non-matrix multiplication calculation layer when forward calculation is performed;
the training module is further configured to obtain a first output error gradient matrix of the third data format, where the first output error gradient matrix is used to update parameters corresponding to the first non-matrix multiplication computation layer; when the first output error gradient matrix is counter-propagating, the first non-matrix multiplies the matrix output by the last calculation layer of the calculation layers;
The training module is further configured to counter-propagate the first non-matrix multiplication computation layer according to the first output error gradient matrix to obtain a second output error gradient matrix in the first data format, where the second output error gradient matrix is used to update the weight matrix;
The conversion module is further configured to convert the second output error gradient matrix into a second output error gradient conversion matrix in the second data format;
The training module is further configured to counter-propagate the first matrix multiplied by the calculation layer according to the second output error gradient conversion matrix and the transpose matrix of the weight conversion matrix, to obtain a third output error gradient conversion matrix in a third data format, where the third output error gradient conversion matrix is used to update a parameter corresponding to a previous calculation layer of the first matrix multiplied by the calculation layer when the neural network forward calculation is performed.
11. The apparatus of claim 10, wherein the data matrix, the weight matrix, or the second output error gradient matrix is target data, the data transformation matrix, the weight transformation matrix, or the second output error gradient transformation matrix is target transformation data, the transformation module further configured to:
Converting the target data into the target conversion data in a mode of far-from-0 carry rounding, wherein the far-from-0 carry rounding refers to carry when the most significant bit of the reject part is 1;
And/or converting the target data into the target conversion data in a random rounding mode, wherein the random rounding means that the reject part is normalized to determine a first numerical value, and the first numerical value is carried when being greater than or equal to a second numerical value, and the second numerical value is a random number which is greater than 0 and less than 1.
12. The apparatus of claim 10 or 11, wherein the first data format or the third data format each comprises a sign bit, a exponent bit, and a mantissa bit;
the second data format includes a symbol field, a bit field, a step field, and a mantissa field, the bit field being used to indicate a bit width occupied by the step field in a total bit width of the second data format.
13. The apparatus according to any one of claims 10-12, wherein after obtaining the second output error gradient conversion matrix of the second data format, the training module is further configured to input the second output error gradient conversion matrix and the data conversion matrix into the first matrix multiplication computation layer to obtain weight gradient conversion data of the third data format; and updating the weight matrix according to the weight gradient conversion data.
14. The apparatus of any of claims 8-13, wherein the matrix multiplication computation comprises at least one of convolution, deconvolution, matrix multiplication, and bulk matrix multiplication; the non-matrix multiplication computation includes at least one of an activation function or a normalization function.
15. An electronic device comprising a memory and at least one processor, the memory for storing a set of computer instructions; the neural network training method of any of the preceding claims 1-7 when the processor executes the set of computer instructions.
16. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when run on a computing device, causes the computing device to perform the neural network training method of any of claims 1-7.
CN202211281740.8A 2022-10-19 2022-10-19 Neural network training method and device Pending CN117910537A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211281740.8A CN117910537A (en) 2022-10-19 2022-10-19 Neural network training method and device
PCT/CN2023/104242 WO2024082705A1 (en) 2022-10-19 2023-06-29 Neural network training method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211281740.8A CN117910537A (en) 2022-10-19 2022-10-19 Neural network training method and device

Publications (1)

Publication Number Publication Date
CN117910537A true CN117910537A (en) 2024-04-19

Family

ID=90682705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211281740.8A Pending CN117910537A (en) 2022-10-19 2022-10-19 Neural network training method and device

Country Status (2)

Country Link
CN (1) CN117910537A (en)
WO (1) WO2024082705A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11676003B2 (en) * 2018-12-18 2023-06-13 Microsoft Technology Licensing, Llc Training neural network accelerators using mixed precision data formats
CN111950689A (en) * 2019-05-16 2020-11-17 华为技术有限公司 Neural network training method and device
US11526761B2 (en) * 2019-08-24 2022-12-13 Microsoft Technology Licensing, Llc Neural network training with decreased memory consumption and processor utilization
CN111290732B (en) * 2020-03-03 2023-03-14 南京大学 Floating-point number multiplication circuit based on posit data format
CN112613604A (en) * 2021-01-07 2021-04-06 江苏禹盛科技有限公司 Neural network quantification method and device
CN112836806B (en) * 2021-02-26 2023-12-22 上海阵量智能科技有限公司 Data format adjustment method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2024082705A1 (en) 2024-04-25

Similar Documents

Publication Publication Date Title
CN107451658B (en) Fixed-point method and system for floating-point operation
JP5851536B2 (en) Perform rounding according to instructions
CN109214509B (en) High-speed real-time quantization structure and operation implementation method for deep neural network
CN107273090A (en) Towards the approximate floating-point multiplier and floating number multiplication of neural network processor
US8788561B2 (en) Arithmetic circuit, arithmetic processing apparatus and method of controlling arithmetic circuit
Lu et al. Training deep neural networks using posit number system
WO2022052625A1 (en) Fixed-point and floating-point converter, processor, method, and storage medium
CN111936965A (en) Random rounding logic
CN111033462A (en) Providing efficient floating point operations using matrix processors in processor-based systems
CN114296682A (en) Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip
EP4250090A1 (en) Method for processing floating point number and related device
CN117910537A (en) Neural network training method and device
CN116933840A (en) Multi-precision Posit encoding and decoding operation device and method supporting variable index bit width
CN111831257A (en) Implementation method and device for calculating sine or cosine function
CN113126954A (en) Method and device for multiplication calculation of floating point number and arithmetic logic unit
CN116166217A (en) System and method for performing floating point operations
CN115268832A (en) Floating point number rounding method and device and electronic equipment
CN114860193A (en) Hardware operation circuit for calculating Power function and data processing method
CN116700666A (en) Floating point number processing method and device
CN114265575A (en) Floating point number processing method and device, electronic equipment and storage medium
CN109416757B (en) Method, apparatus and computer-readable storage medium for processing numerical data
WO2020084723A1 (en) Computation processing device and computation processing device control method
WO2024082674A1 (en) Floating-point data precision conversion method and apparatus
WO2023004799A1 (en) Electronic device and neural network quantization method
CN115965048A (en) Data processing device, data processing method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication