WO2021082725A1 - Winograd卷积运算方法及相关产品 - Google Patents

Winograd卷积运算方法及相关产品 Download PDF

Info

Publication number
WO2021082725A1
WO2021082725A1 PCT/CN2020/113168 CN2020113168W WO2021082725A1 WO 2021082725 A1 WO2021082725 A1 WO 2021082725A1 CN 2020113168 W CN2020113168 W CN 2020113168W WO 2021082725 A1 WO2021082725 A1 WO 2021082725A1
Authority
WO
WIPO (PCT)
Prior art keywords
transformation
result
sub
data
layer
Prior art date
Application number
PCT/CN2020/113168
Other languages
English (en)
French (fr)
Inventor
张英男
曾洪博
张尧
刘少礼
黄迪
周诗怡
张曦珊
刘畅
郭家明
高钰峰
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2021082725A1 publication Critical patent/WO2021082725A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • This application relates to the field of deep learning technology, and in particular to a Winograd convolution operation method and related products.
  • the neural network model is an operation model in deep learning technology, which uses a multi-layer architecture to process the input data and output the corresponding operation results.
  • training a neural network model is a necessary step for calculations using the neural network model.
  • the neural network to be trained will use the convolution algorithm to repeatedly perform iterative operations on the massive training data to obtain the training. Neural network model.
  • the convolution operation involves a large number of matrix multiplications, which will take up a lot of computing resources, and the computing efficiency is not high.
  • the low computing efficiency will make the training of the neural network model take a long time. The training efficiency is not high.
  • this application provides a Winograd convolution operation method, including:
  • the reverse input gradient of the j-th layer in the neural network and the forward transformation operation of the forward input feature data of the j-th layer in the neural network are disassembled into A summation operation to obtain the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the forward input feature data of the jth layer based on the summation operation;
  • the training of the neural network is completed according to the weight difference of the jth layer.
  • this application provides a Winograd convolution operation device, including:
  • the data receiving module is used to obtain the positive input characteristic data for training the neural network
  • the transformation module is used to separately input the reverse input gradient of the j-th layer and the forward input feature data of the j-th layer in the neural network during the training of the neural network based on the pre-configured Winograd convolution algorithm by the training module
  • the forward transformation operation of is disassembled into a summation operation to obtain the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the forward input feature data of the jth layer based on the summation operation ;
  • the bit multiplication module is used to perform bit multiplication on the transformation result of the reverse input gradient forward transformation operation of the jth layer and the transformation result of the forward input feature data forward transformation operation of the jth layer to obtain the first multiplication operation result ;
  • the transformation module is also used to disassemble the inverse transformation operation of the first multiplication operation result into a summation operation, and use the result of the summation operation as the weight difference of the jth layer;
  • the weight update module is used to complete the training of the neural network according to the weight difference of the jth layer.
  • the present application provides an artificial intelligence chip, which includes the Winograd convolution operation device as described in any one of the preceding items.
  • the present application provides an electronic device including the artificial intelligence chip as described above.
  • the present application provides a board card, the board card includes: a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip;
  • the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;
  • the storage device is used to store data
  • the interface device is used to implement data transmission between the artificial intelligence chip and external equipment
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the Winograd convolution operation method and related products provided in this application adopt the Winograd algorithm to train the weight data in the neural network using the characteristic data after receiving the training instruction and obtaining the characteristic data, and obtain the trained neural network, and Compared with the prior art, this application takes advantage of the feature of the Winograd algorithm that converts a large number of matrix multiplication operations into matrix addition operations, which effectively improves the computational efficiency of processing neural network training data, and reduces the computational resources occupied by the training process. .
  • Figure 1 is a schematic diagram of a neural network architecture in the prior art
  • FIG. 2 shows a schematic diagram of an operating system on which the Winograd convolution operation method according to an embodiment of the present disclosure is based;
  • FIG. 3 is a schematic flowchart of a neural network training method provided by this application.
  • FIG. 4 is a schematic flowchart of a Winograd convolution operation method provided by this application.
  • FIG. 5 is a schematic flowchart of another Winograd convolution operation method provided by this application.
  • FIG. 6 is a schematic flowchart of another Winograd convolution operation method provided by this application.
  • FIG. 7 is a schematic structural diagram of a Winograd convolution operation device provided by this application.
  • Fig. 8 shows a structural block diagram of a board according to an embodiment of the present disclosure.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Convolution operation refers to opening an active window with the same size as the template from the upper left corner of the image.
  • the active window corresponds to a window image, which is the convolution kernel
  • the window image corresponds to the pixels in the image Multiply and add, and use the calculation result as the first pixel value of the new image after the convolution operation.
  • the active window moves one column to the right, the window image corresponding to the active window and the pixels in the image are multiplied and then added, and the calculation result is used as the second pixel value of the new image after the convolution operation.
  • Winograd convolution operation is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: the first target matrix and the second target matrix are respectively subjected to Winograd convolution positive transformation, and then the first target matrix and the second target matrix after the positive transformation are subjected to bitwise multiplication, and finally The Winograd convolution inverse transformation is performed again on the result of the bit multiplication, and the convolution result equivalent to the original convolution operation is obtained.
  • Convolutional neural network model is a type of feedforward neural network model that includes convolution calculation and has a deep structure, and is one of the representative models of deep learning.
  • Convolutional layer fully connected layer and other network layers in the convolutional neural network model, it is necessary to perform convolution operations on neurons and convolution kernels to obtain feature data, which is widely used in image classification and image recognition.
  • Figure 1 is a schematic diagram of a neural network architecture in the prior art.
  • the neural network uses an output layer including a convolutional layer, a pooling layer, a fully connected layer, and a classifier layer.
  • Network Architecture The layers of the neural network are processed sequentially.
  • the convolutional layer is used for feature extraction of the feature data of the input layer (the first layer of convolutional layer is used for feature extraction of the feature data of the original input);
  • the pooling layer is used for the output of the previous layer
  • the features of the pooling are calculated by artificially setting the pooling window size and step length to reduce the dimension of the features and aggregate the features.
  • the neural network After the first-layer convolutional network of the neural network receives the original input data (such as the image to be processed), the neural network starts to process the image to be processed by the convolutional neural network, and each layer of the convolutional network performs convolution Convolutional neural network processing; and except for the last layer of convolutional network, each layer of convolutional network will output its own processing results to the next layer of convolutional network after completing the processing.
  • the next layer of convolutional network can use this
  • the processing result is used as its own input data, and the subsequent processing of the convolutional neural network is continued.
  • the convolutional neural network processing whether it is in the inference process of the neural network or the training process of the neural network, it is necessary to perform convolution operation on the input feature data and the weight data in the neural network, and the convolution operation
  • the number of times is generally multiple. As the amount of feature data increases, the complexity of its operations will increase accordingly.
  • the computing device performing the neural network operation will be in an overloaded computing state, and the computing time and computing resources will be greatly increased.
  • the convolution kernel can slide on the feature data to change
  • the feature data is split into four sub-data with dimensions of 3 ⁇ 3, and 2 ⁇ 2 output data can be output through the convolution operation of the convolution kernel and each sub-data.
  • the convolution kernel and the sub-data can be used for bitwise multiplication, and then the nine products are added to obtain a value in the output data.
  • a total of 36 multiplications and 32 additions are required to get the output result.
  • the neural network will first be processed based on the forward propagation training.
  • the feature input is input to the neural network and convolved with the weight data to obtain the positive
  • the output result of the forward propagation subsequently, the neural network will be trained based on the backward propagation.
  • the gradient data will be convolved with the feature data to obtain the weight difference used to adjust the weight data.
  • the forward propagation training process will require 36 multiplications and 32 addition operations to get the output result of the forward propagation.
  • the reverse propagation training process It will also require at least 36 multiplications and 32 additions to get the output result of backpropagation.
  • the multiplication operation consumes a lot of the computing system, which will directly cause low computing efficiency and reduced training efficiency.
  • this application adopts the pre-configured Winograd convolution algorithm to be applied in the convolution operation of the neural network training process, especially in the back propagation training process of the neural network training.
  • the pre-configured Winograd convolution algorithm used in this application is an operation method that can convert a convolution operation into a large number of matrix addition operations and a small number of matrix multiplication operations.
  • the arithmetic unit performs data processing on the data.
  • the time and resources required for matrix multiplication operations are far greater than the time and resources required for matrix addition operations on data.
  • the Winograd convolution algorithm is used to reduce the number of matrix multiplication operations that consume large computational resources. Relatively increase the number of matrix addition operations that consume less energy, thereby reducing the computational resource consumption and computational time of the entire neural network training process during back propagation processing, and improving computational efficiency.
  • FIG. 2 shows a schematic diagram of a computing system on which the Winograd convolution operation method according to an embodiment of the present disclosure is based.
  • the Winograd convolution operation method provided in this application can be applied to the computing system shown in FIG. 2, as shown in FIG. 2,
  • the computing system 100 includes multiple processors 101 and a memory 102.
  • the multiple processors 101 are used to execute instruction sequences, and the memory 102 is used to store data, and may include a random access memory (RAM, Random Access Memory) and a register file.
  • the multiple processors 101 in the processing system 100 can not only share part of the storage space, for example, share part of the RAM storage space and register file, but also have their own storage space at the same time.
  • the computing system is used to execute the various operations performed by the computing method of this application. step.
  • the Winograd convolution operation method can be applied to any processor of an operation system (for example, an artificial intelligence chip) including multiple processors (multi-core).
  • the processor may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor (IPU) for performing artificial intelligence operations.
  • Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and so on.
  • the artificial intelligence processor may, for example, include GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips.
  • GPU Graphics Processing Unit
  • NPU Neuro-Network Processing Unit
  • DSP Digital Signal Process, digital signal processing unit
  • field programmable gate array Field-Programmable Gate Array, FPGA
  • the processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as convolution computing tasks and pooling tasks. Or fully connected tasks, etc.
  • the present disclosure does not limit the processing unit and the tasks run by the processing unit.
  • the network layer in the neural network will be adjusted accordingly, and the weight data and feature data will be changed accordingly.
  • the Winograd convolution operation method, Winograd convolution operation device and operation chip described in this application can be used for the training of neural networks in different scenarios, such as the training of face recognition models and the training of color classifiers.
  • the training of the audio and video data conversion model the training of the image boundary division model and so on.
  • the trained neural network can be used to implement the inference process, which is suitable for different scenarios, such as face recognition, color classification, audio and video data conversion based on special needs, image boundary division, and so on.
  • FIG. 3 is a schematic flowchart of a neural network training method provided by this application. As shown in FIG. 3, the method includes:
  • Step 101 Receive a training instruction and obtain characteristic data
  • Step 102 Use the feature data and the weight data of the neural network to train the neural network using the Winograd algorithm to obtain a trained neural network.
  • the execution body of the training method can be a Winograd convolution computing device, which can interact with chips including computing chips, neural network chips, etc., to receive training instructions initiated by the user's electronic equipment or combination processing device, etc. .
  • the training instruction can be used to instruct the computing device to start performing the computing for training the neural network.
  • the Winograd convolution operation method provided in this application can be applied in step 102 to improve the training efficiency of the neural network training process.
  • the Winograd algorithm is an operation method that can convert the convolution operation into a large number of matrix addition operations and a small number of matrix multiplication operations.
  • the operation formula can be expressed in the form of formula (1):
  • S is used to represent the result matrix of the convolution, that is, the result matrix obtained by using the feature data and the weight data to perform the convolution operation;
  • d is used to represent the input feature data;
  • g is used to represent the weight data in the neural network;
  • B is used to represent the feature transformation matrix that transforms the feature data from the original domain to the Winograd domain;
  • B T is used to represent the feature inverse transformation matrix that transforms the feature data from the Winograd domain to the original domain;
  • G is used to represent the weight data from the original domain to the weight matrix Winograd transform domain;
  • G T for indicating the converted weight data from the inverse Winograd field to the weight of the original domain transform matrix;
  • a represents a para-multiplication calculation result from the original domain converter
  • AT is used to represent the inverse transformation matrix of the inverse transformation operation that converts the result of the bitwise multiplication from the Winograd domain to the original domain.
  • the above-mentioned original domain refers to a domain that has not been transformed by Winograd
  • the Winograd domain refers to a domain that has been transformed by Winograd.
  • the above-mentioned feature data d and weight data g can be a matrix with a fixed size, wherein the size of the feature data d can be in the range of 4*4, and the scale of the weight data g can be in the range of 3*3.
  • the selection of the scale is related to the scale of the result matrix, and this application does not limit it.
  • A, B, G, A T , B T , G T are all conversion matrices, and the scale and value of the constant matrix are related to the scale of the aforementioned result matrix S and weight data g .
  • the element values will also be different, but for each type of constant matrix of each scale, the element values are fixed.
  • the Winograd convolution operation device will obtain feature data.
  • the characteristic data can be carried in the training instruction and sent to the computing device along with the training instruction for its acquisition; the feature data can also be used by the computing device to respond to the training instruction after receiving the training instruction, and Based on the pre-stored data storage address, or the received data storage address carried in the training instruction, read from the data storage location corresponding to the data storage address. It can be seen that the feature data used for training has a relatively large amount of data.
  • the computing device can read the required feature data from the cloud server according to the training mode and storage address indicated by the training instruction, and perform processing and computing.
  • Winograd convolution operation method for the Winograd convolution operation method involved in this application, it can be applied in the aforementioned step 102, especially in the back propagation training process of training the neural network.
  • the Winograd convolution operation device After acquiring the feature data, the Winograd convolution operation device will first use the feature data and the weight data of each layer to forward the neural network. Propagation processing to obtain the positive output feature data of the nth layer.
  • the Winograd convolution operation device will use the following steps 201 to 204 to reverse the neural network using the input gradient of each layer, the forward input feature data of each layer, and the weight data of each layer. To propagate processing to update the weight data of each layer.
  • FIG. 4 is a schematic flowchart of a Winograd convolution operation method provided by this application. As shown in FIG. 4, the method includes:
  • Step 201 Separate the reverse input gradient of the jth layer of the neural network and the forward transformation operation of the forward input feature data of the jth layer into a summation operation to obtain the jth layer based on the summation operation.
  • Step 202 Perform a bitwise multiplication operation on the transformation result of the reverse input gradient forward transformation operation of the jth layer and the transformation result of the forward input feature data forward transformation operation of the jth layer to obtain the first multiplication operation result;
  • Step 203 Disassemble the inverse transform operation of the first multiplication operation result into a summation operation, and use the result of the summation operation as the weight difference of the jth layer;
  • Step 204 Complete the training of the neural network according to the weight difference of the j-th layer.
  • the j-th layer (j belongs to n) in the neural network
  • the j-th layer when the j-th layer is backpropagated training, it will input the j-th layer's reverse input gradient and Input feature data in the forward direction, so that the weight difference of the j-th layer is obtained by convolution operation on the reverse input gradient and the forward input feature data, and then the weight of the j-th layer is completed according to the weight difference of the j-th layer Data update. This process is repeated until the weight data of each layer in the n-layer neural network is updated to complete the training of the neural network.
  • the winograd algorithm can be used to transform the reverse input gradient and the convolution operation of the forward input feature data during the back propagation training.
  • the formula (2) can be used to express:
  • ⁇ w j is used to represent the weight difference, that is, the result matrix obtained by the convolution operation of the reverse input gradient and the forward input feature data;
  • top — diff j is used to represent the reverse input gradient of the jth layer, and The reverse output gradient of the j+1th layer is the same;
  • top — data j is used to represent the forward input feature data of the jth layer in the neural network, which is the same as the forward output feature data of the j-1th layer;
  • B is used Yu represents the transformation matrix that converts the forward output feature data from the original domain to the winograd domain;
  • B T is used to represent the inverse transformation matrix that converts the forward output feature data from the winograd domain to the original domain;
  • A is used to represent the reverse input
  • the gradient is converted from the original domain to the transformation matrix of the winograd domain;
  • AT is used to represent the inverse transformation matrix of the reverse input gradient from the winograd domain to the original domain;
  • G is used to represent the conversion of the result
  • a winograd domain refers to a domain that has been transformed by winograd.
  • the method provided in this embodiment can obtain the reverse input gradient and the forward input feature data. Transformation matrices A, A T , B T , B for performing forward transformation operations, and transformation matrices G, G T for performing inverse transformation operations on the result of bit multiplication.
  • the A, A T, B, B T, G, G T is a fixed matrix.
  • the size of top — diff j and top — data j can be determined according to the size of the required output result ⁇ w j and the sliding step length of the convolution process , and then the corresponding A, AT , B, B can be determined according to the size of these data T , G, G T.
  • the Winograd convolution operation device can use the transformation matrix B and B T to perform the forward transformation operation on the forward input characteristic data top — data j to obtain the operation result of B T top — data j B, and use the transformation matrix A and AT to reverse the Input the gradient top — diff j to perform the forward transformation operation, and get the operation result of Atop — diff j AT.
  • the Winograd convolution operation device converts the transformation result Atop — diff j A T of the reverse input gradient forward transformation operation of the jth layer and the transformation result of the forward transformation operation B T top — data of the forward input feature data of the jth layer j B performs bitwise multiplication, and obtains the first multiplication result (Atop_diff j A T ) ⁇ (B T top_data j B).
  • Winograd convolution unit using the transformation matrix G, G T the first multiplication result to an inverse transform operation to obtain the final weight difference G T [(Atop_diff j A T ) ⁇ (B T top_data j B) ]G.
  • the weight difference will be used to adjust the weight data of the jth layer.
  • the transformation operation can be divided into a summation operation, and the operations of the forward transformation operation and the inverse transformation operation are determined according to the summation operation. Result rate.
  • traditional convolution operations there are more multiplication operations.
  • winograd algorithm for convolution processing the number of multiplication operations in transformation operations is effectively reduced, and the number of addition operations is relatively increased, thereby increasing the operation efficiency and reducing the operation bandwidth. Performance loss.
  • the reverse input gradient of the jth layer is equivalent to the reverse output gradient of the j+1 layer
  • the reverse output gradient of the jth layer is equivalent to the j-1th layer
  • the reverse output gradient For the acquisition of the reverse output gradient of the jth layer, it can be obtained by convolution operation on the reverse input gradient of the jth layer and the weight data of the jth layer.
  • the convolution operation can use the aforementioned pre-configured Winograd convolution algorithm.
  • Figure 5 is a schematic flow diagram of another Winograd convolution operation method provided by this application. The method shown in Figure 5 can be applied to obtain the reverse input gradient of any layer in Figure 4, The method includes:
  • Step 301 In the process of training the neural network based on the pre-configured Winograd convolution algorithm, the reverse input gradient of the j-th layer and the forward transformation operation of the weight data of the j-th layer are respectively disassembled into a summation operation, Obtaining the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the weight data of the jth layer based on a summation operation;
  • Step 302 Perform a bitwise multiplication operation on the transformation result of the reverse input gradient forward transformation operation of the jth layer and the transformation result of the weight data forward transformation operation of the jth layer to obtain a second multiplication operation result;
  • Step 303 Disassemble the inverse transform operation of the second multiplication operation result into a summation operation, and use the result of the summation operation as the inverse output gradient of the jth layer.
  • the winograd algorithm may be used to transform the weight data during the back propagation training and the convolution operation of the reverse input gradient to obtain the reverse output gradient.
  • formula (3) can be used to express:
  • bottom — diff j is used to represent the inverse output gradient of the jth layer, that is, the result matrix obtained by convolution using the inverse input gradient of the jth layer and the weight data of the jth layer.
  • the reverse output gradient will also be used as the reverse input gradient of the j-1th layer
  • top — diff j is used to represent the reverse input gradient of the jth layer, which is the same as the reverse output gradient of the j+1th layer
  • g j Used to represent the weight data of the j-th layer in the neural network
  • B is used to represent the transformation matrix that can convert the reverse input gradient from the original domain to the winograd domain
  • B T is used to represent the reverse input gradient can be converted from the winograd domain Converted to the inverse transformation matrix of the original domain
  • G is used to represent the transformation matrix that can convert the weight data from the original domain to the winograd domain
  • G T is used to represent the inverse transformation matrix that can convert the weight data from the winograd domain to the original domain
  • a winograd domain refers to a domain that has been transformed by winograd.
  • the method provided in this embodiment can be used to perform forward transformation operations on the reverse input gradient and weight data.
  • the transformation matrices G, G T , B T , B, and the transformation matrices A, AT that perform the inverse transformation of the result of the bit multiplication.
  • the A, A T, B, B T, G, G T is a fixed matrix.
  • the size of top- diff j and g j can be determined according to the size of bottom- diff j of the required output result and the sliding step length of the convolution process , and then the corresponding A, AT , B, B can be determined according to the size of these data T , G, G T.
  • Winograd convolution arithmetic means may use the transformation matrix B, B T on the inverting input Gradient top - diff j for n-transform operation to obtain B T top - the calculation result diff j B, using the transformation matrix G, G T of weight data g j performs a forward transformation operation to obtain the result of the operation of Gg j G T. Subsequently, the Winograd convolution operation device performs the pairing of the transformation result B T top — diff j B of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result Gg j G T of the forward transformation operation of the weight data of the jth layer.
  • the second multiplication operation result (Gg j G T ) ⁇ (B T top_diff j B) is obtained.
  • the Winograd convolution operation device uses transformation matrices A and AT to perform an inverse transformation operation on the result of the second multiplication operation to obtain the final inverse output gradient AT [(Gg j G t ) ⁇ (B T top_diff j B )]A.
  • the reverse output gradient will be used as the reverse input gradient of the j-1th layer, so that the j-1th layer can use the reverse input gradient to calculate the reverse output gradient of the j-1th layer.
  • the inverse input gradient of the jth layer is obtained by the j+1 layer by performing arithmetic processing on the inverse input gradient and weight data of the j+1 layer and output to the jth layer. of.
  • the training of the neural network also includes the training of forward propagation.
  • the convolution operation can also be divided and summed.
  • Figure 6 provides this application. Another flow diagram of Winograd convolution operation method. The method shown in Figure 6 can be used in combination with Figure 4 or Figure 5 to work together on the training process of the neural network. The method shown in Figure 6 can also be used independently. Act on the training process of neural networks.
  • the method includes:
  • Step 401 Based on the pre-configured Winograd algorithm, the forward transformation operation of the forward input feature data of the i-th layer and the weight data of the i-th layer are respectively disassembled into a summation operation to obtain the i-th layer The transformation result of the positive transformation operation of the forward input feature data, and the transformation result of the positive transformation operation of the weight data of the i-th layer;
  • Step 402 Perform a bitwise multiplication operation on the transformation result of the forward transformation operation of the forward input feature data of the i-th layer and the transformation result of the forward transformation operation of the weight data of the i-th layer to obtain a third multiplication operation result;
  • Step 403 Disassemble the inverse transform operation of the third multiplication operation result into a summation operation, and use the result of the summation operation as the positive output feature data of the i-th layer.
  • the winograd algorithm can be used to transform the convolution operation of the weight data and the forward input feature data during the forward propagation training to obtain the forward output feature data.
  • bottom_data i A T [(Gg i G T ) ⁇ (B T top_data i B)]
  • bottom — data i is used to represent the positive output feature data of the i-th layer (i belongs to m), that is, the result of convolution operation using the forward input feature data of the i-th layer and the weight data of the i-th layer Matrix, the forward output feature data of the i-th layer will also be used as the forward input feature data of the i+1th layer;
  • top — data i is used to represent the forward input feature data of the i-th layer, which is the same as the i-1th layer
  • the positive output feature data of the layers are the same;
  • g i is used to represent the weight data of the i-th layer in the neural network;
  • B is used to represent the transformation matrix that can transform the forward input feature data from the original domain to the winograd domain;
  • B T Used to represent the inverse transformation matrix that can convert the forward input feature data from the winograd domain to the original domain;
  • G is used to represent the transformation matrix that can convert the weight data from the original domain to the win
  • a winograd domain refers to a domain that has been transformed by winograd.
  • the method provided in this embodiment can obtain the positive input feature data and the weight data.
  • the A, A T, B, B T, G, G T is a fixed matrix.
  • the size of top — data i and g i can be determined according to the size of the required output result bottom — data i and the sliding step length of the convolution process , and then the corresponding A, AT , B, B can be determined according to the size of these data T , G, G T.
  • Winograd convolution arithmetic means may use the transformation matrix B, B T of the input feature data of the forward top - data i n transformation operations performed to obtain B T top - the calculation result data i B, using the transformation matrix G, the right value G T
  • the data g i undergoes a forward transformation operation to obtain the operation result of Gg i G T.
  • the Winograd convolution operation device will perform the transformation result B T top — data i B of the positive transformation operation of the forward input feature data of the i-th layer and the transformation result Gg i G T of the positive transformation operation of the weight data of the i-th layer For bitwise multiplication, the third multiplication result (Gg i G T ) ⁇ (B T top_data i B) is obtained. Then, the Winograd convolution operation device uses transformation matrices A and AT to perform an inverse transformation operation on the third multiplication operation result to obtain the final forward output characteristic data AT [(Gg i G T ) ⁇ (B T top_data i B)]A.
  • the forward output feature data will be used as the forward input feature data of the i+1th layer, so that the i+1th layer uses the forward input feature data to calculate the forward output feature data of the i+1th layer.
  • the positive input feature data of the i-th layer is obtained by the i-1th layer performing arithmetic processing on the i-1th layer’s forward input feature data and weight data and output to the i-1th layer. i-layer.
  • the transformation operation can be disassembled into a summation operation.
  • the sum operation determines the calculation result rate of the forward transformation operation and the inverse transformation operation.
  • traditional convolution operations there are more multiplication operations.
  • winograd algorithm for convolution processing the number of multiplication operations in transformation operations is effectively reduced, and the number of addition operations is relatively increased, thereby increasing the operation efficiency and reducing the operation bandwidth. Performance loss.
  • the processing method of disassembling the forward transformation operation or the inverse transformation operation into the summation operation can adopt the following method: split the target data
  • the solution is multiple sub-tensors corresponding to the target data, and the multiple sub-tensors are transformed and summed, and the transformation result corresponding to the target data is obtained according to the result of the summation operation;
  • the target data includes One of the following: reverse input gradient, forward input feature data, weight data, first multiplication operation result, second multiplication operation result, and third multiplication operation result.
  • the aforementioned forward transformation operations on the reverse input gradient, forward input feature data, and weight data can be disassembled into multiple sub-transform results, and then the sub-transform results can be summed
  • the result is determined as the corresponding transformation result
  • the involved inverse transformation operation of the first multiplication operation result, the second multiplication operation result, and the third multiplication operation result can be disassembled into multiple sub-transformation results, and then the sub-transformation result
  • the summation result is determined as the corresponding transformation result.
  • top — data j B As an example, the replacement matrix corresponding to each element in top — data j can be preset, for example d 00 corresponds to a matrix D 00 , d 01 corresponds to a D 01 ... d 33 corresponds to a D 33 .
  • the replacement matrix can be a matrix including 0, 1, -1.
  • the multiplication operation of a single element and the replacement matrix can reduce the number of multiplications. Especially when the replacement matrix is composed of 0, 1, -1, the amount of calculation can be greatly reduced.
  • the characteristic data is a 4 ⁇ 4 matrix, which includes 16 elements d 00 , d 01 ... d 33 in total. At this time, there may be 16 replacement matrices D 01 , D 01 ... D 33 corresponding to these elements. .
  • the target data is the positive input feature data top — data j as an example; it can be expressed as:
  • the positive input feature data top — data j can be split into 16 (assuming that the elements in the feature data are all non-zero), and the sub-tensors are:
  • a transformation matrix can be used to transform each sub-tensor, and then the transformation results of the sub-tensors can be added to obtain the feature transformation result.
  • the result of transforming the target sub-data and then adding the transform results is the same as the transform result of the target data.
  • the above transformation can be performed, and then the transformation results of each sub-tensor are added to obtain the transformation result.
  • meta-sub-tensor is a tensor that sets the non-zero elements of the sub-tensor to 1;
  • the transformation results of the sub-tensors are summed to obtain the feature transformation results.
  • the non-zero elements in the sub-tensor can be identified, and the position corresponding to the non-zero elements can be set to 1, to obtain the meta-sub-tensor, for example, for the sub-tensor
  • the corresponding meta-sub-tensor can be determined.
  • the transformation result of the sub-tensor can be determined according to the transformation matrix, the element sub-tensor and the corresponding non-zero elements.
  • B T and B can be determined according to the size of the forward input feature data, and the meta-sub-tensor can also be determined in advance according to the forward input feature data. Therefore, the replacement matrix corresponding to the position of each element in the feature data can also be determined in advance according to B T, B, and element sub-tensors.
  • the replacement matrix is:
  • the corresponding replacement matrix can be determined for each element position in the forward input feature data.
  • the corresponding replacement matrix set can be determined directly according to the data size, and then the transformation can be determined according to the replacement matrix set result.
  • the multiplication of the element and the replacement matrix is changed to the process of directly writing data.
  • the 1 in the replacement matrix is multiplied by d 00 , and the actual result is directly written into d 00 . Therefore, based on the method provided in this embodiment, the transformation process in the winograd algorithm can be converted into an addition algorithm, thereby further reducing the computational complexity of the convolution process.
  • the process of performing the forward transformation operation is similar to the process of performing the forward transformation operation on top — data j, and will not be repeated here.
  • the two results can also be used to perform a bitwise multiplication operation based on the operation target. For example, after obtaining the B T top — data j B and Atop — diff j A T in the method shown in Figure 1-4, the two operation results can be multiplied by bit, that is, to determine (Atop_diff j A T ) ⁇ (B T top_data j B) result.
  • the values at the corresponding positions of the two transformation results can be multiplied to obtain a new matrix as the result of the multiplication operation.
  • the transformation result of the reverse input gradient is:
  • the transformation result of the positive input feature data is:
  • the inverse transformation matrices A and AT used to inversely transform the result of the multiplication operation can be obtained.
  • the inverse transformation matrix can be determined according to the size of the operation result.
  • the inverse transformation matrix can be used to inversely transform the multiplication operation result.
  • the transformation operation of the multiplication operation result can also be disassembled into a summation operation, and the operation result is determined according to the summation operation.
  • the first multiplication operation result can be transformed based on the following formula:
  • a T and A are the inverse transformation matrices
  • p is the result of the first multiplication operation.
  • the sum of multiple sub-tensors is the result of the multiplication operation
  • the number of multiple sub-tensors is the same as the number of non-zero elements in the result of the multiplication operation
  • each sub-tensor has a single non-zero element
  • the sub-tensor is non-zero
  • the element is the same as the non-zero element at the corresponding position in the result of the multiplication operation.
  • the first multiplication operation result p is:
  • the result of the first multiplication operation can be split into 16 sub-tensors, which are:
  • the inverse transformation matrix can be used to transform each resultant sub-tensor, and then the transformation results of the resultant sub-tensors can be added to obtain the result of the operation.
  • the above transformation can be performed, and then the transformation results of each sub-tensor are added to obtain the operation result.
  • meta-sub-tensor is a tensor that sets the non-zero elements of the sub-tensor to 1;
  • the non-zero element in the sub-tensor of the first multiplication operation result can be identified, and the position corresponding to the non-zero element can be set to 1, to obtain the element sub-tensor, for example, for sub-tensor Tensor
  • the corresponding meta-sub-tensor can be determined.
  • the result of the operation can be determined according to the inverse transformation matrix, the element sub-tensor and the corresponding non-zero elements.
  • the left side of the sub-tensor can be multiplied by the left multiplication matrix in the inverse transformation matrix
  • the right side can be multiplied by the right multiplication matrix in the inverse transformation matrix
  • the result can be multiplied by the non-zero elements corresponding to the sub-tensor to obtain the sub-tensor.
  • the transformation result of the quantity; among them, the left multiplication matrix and the right multiplication matrix in the element sub-tensor are both determined by the size of the sub-tensor.
  • a T and A can be determined according to the size of the operation result, and the element sub-tensor can also be determined in advance according to the size of the operation result. Therefore, the replacement matrix corresponding to the position of each element in the multiplication operation result can also be determined in advance according to AT, A, and the result element sub-tensor.
  • the replacement matrix is:
  • the corresponding replacement matrix can be determined for each element position in the first multiplication operation result.
  • the corresponding replacement matrix set can be determined directly according to the result or the final operation result size, and then according to The replacement matrix set determines the result of the operation.
  • the weight difference can be expressed as:
  • the replacement matrix corresponding to each element included in each multiplication operation can be determined, so that the inverse transformation operation can be disassembled into a summation operation based on these replacement matrices, and the operation result can be determined according to the summation operation.
  • the specific disassembly method is similar to the disassembly method for the feature transformation operation, and the convolution operation result can be obtained through fewer multiplication methods.
  • the processing procedure of the inverse transformation operation of the second multiplication operation result and the third multiplication operation result is similar to the aforementioned processing of the inverse transformation operation of the first multiplication operation result, and will not be repeated here.
  • the method provided in this embodiment is used to perform convolution operations, and the method is executed by a device provided with the method provided in this embodiment, and the device is usually implemented in hardware and/or software.
  • the training time required for the training will be greatly shortened due to the adoption of the aforementioned Winograd algorithm.
  • the scale of the acquired feature data will be related to the scale of the operation result output by the neural network and the scale of the weight data in the neural network. That is to say, in the process of training the neural network to obtain feature data, based on the training calculation task, the training data can be disassembled to split the large-scale multi-dimensional training data into several fixed-scale two-dimensional training data. For the feature data of the dimension, for each feature data, the computing device can use the methods provided above to perform training operations.
  • the computing device can use the method of accurately reading the data based on the storage address, and read the data stored in the cloud server based on the storage address bits in the cloud server.
  • Specific storage address bit data and processing.
  • the computing device can obtain it from the training instruction, that is, the training instruction can carry the storage address of the feature data required for this computing process, so that the computing device can base on the storage address Read the characteristic data from the cloud server.
  • the Winograd algorithm will be used for at least one convolution operation of the arithmetic device during the aforementioned forward propagation processing and back propagation processing.
  • each layer will have a large number of convolution operations, some or all of the convolution operations can use the Winograd algorithm, when only part of the convolution operations use the Winograd algorithm In this case, other convolution operations other than the partial convolution operation can be processed by the usual convolution operation.
  • the data format involved in the operation is not limited. For example, a floating-point arithmetic algorithm can be used to perform each step of the operation, or a floating-point number can be converted to a fixed-point number and then each step of the operation can be performed.
  • the training instruction received by the computing device will also include the selected target layer, and the computing device will use the Winograd algorithm to process the target based on the selected target layer.
  • Convolution operations of layers are processed.
  • the target layer is selected using preset rules.
  • the preset rule may specifically depend on the complexity of the training data, that is, the degree of splitting the training data into feature data.
  • the complexity of the training data is relatively large, the data scale is relatively large.
  • the processing time of the operation will be very long.
  • the Winograd algorithm is used to perform convolution operation on the huge number of split feature data at this time, the processing time of each operation will be shortened compared to the traditional operation processing method.
  • the improvement of the calculation efficiency of the entire training data will also show a geometric increase, which has a significant effect on the improvement of the calculation efficiency of the neural network training process.
  • the number of feature data after the split can be determined first, and then based on the number of feature data, one or more targets for convolution using the Winograd algorithm are selected.
  • Floor a target for convolution using the Winograd algorithm.
  • the selection process only the convolution operation in the forward propagation process of one or more target layers can be processed by the Winograd algorithm, or the convolution operation in the back propagation process of one or more target layers can be processed.
  • the convolution operation is processed using the Winograd algorithm, or, the convolution operation in the forward propagation process and the back propagation process of one or more target layers may be processed using the Winograd algorithm.
  • the neural network obtained by the above calculation method can be used in various scenes.
  • a face recognition model for reasoning is used as an example for illustration:
  • Step 501 Receive a recognition instruction, and obtain feature data of a face sample.
  • Step 502 Use the feature data of the face sample as the input of the face recognition model obtained by training, and perform neural network processing on the face recognition model obtained by training;
  • Step 503 Use the result output by the face recognition model as a face recognition result.
  • the calculation method and related products of the information provided in this application adopt the Winograd algorithm to train the weight data in the neural network using the characteristic data after receiving the training instruction and obtaining the characteristic data, and obtain the trained neural network.
  • this application takes advantage of the feature of the Winograd algorithm that converts a large number of matrix multiplication operations into matrix addition operations, which effectively improves the computational efficiency of processing neural network training data, and reduces the computational resources occupied by the training process.
  • steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • FIG. 7 is a schematic structural diagram of a Winograd convolution operation device provided by this application.
  • the Winograd convolution operation device of the present application includes: a data receiving module 10, a transformation module 20, and The position multiplication module 30 and the weight update module 40;
  • the data receiving module 10 is used to obtain positive input feature data for training the neural network
  • the transformation module 20 is used to separately input the reverse input gradient of the j-th layer and the forward input feature of the j-th layer in the neural network during the training of the neural network based on the pre-configured Winograd convolution algorithm by the training module
  • the forward transformation operation of the data is disassembled into a summation operation to obtain the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation of the forward transformation operation of the forward input feature data of the jth layer based on the summation operation result;
  • the bit multiplication module 30 is used to perform bit multiplication on the conversion result of the reverse input gradient forward transformation operation of the jth layer and the conversion result of the forward input feature data forward transformation operation of the jth layer to obtain the first multiplication operation result;
  • the transformation module 20 is also used to disassemble the inverse transformation operation of the first multiplication operation result into a summation operation, and use the result of the summation operation as the weight difference of the jth layer;
  • the weight update module 40 is configured to complete the training of the neural network according to the weight difference of the jth layer.
  • the transformation module 20 is also used to separately transform the reverse input gradient of the jth layer and the forward transformation of the weight data of the jth layer in the process of training the neural network based on the pre-configured Winograd convolution algorithm
  • the operation is disassembled into a sum operation, so as to obtain, based on the sum operation, the transformation result of the reverse input gradient forward transformation operation of the jth layer, and the transformation result of the weight data forward transformation operation of the jth layer;
  • the bitwise multiplication module 30 is also used to perform bitwise multiplication on the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the weight data of the jth layer to obtain the first Two multiplication operation result;
  • the transformation module 20 is also used to disassemble the inverse transformation operation of the second multiplication operation result into a summation operation, and use the result of the summation operation as the inverse output gradient of the jth layer.
  • the transformation module 20 is further configured to separate the forward transformation operation of the forward input feature data of the i-th layer and the weight data of the i-th layer into a summation operation based on the pre-configured Winograd algorithm. To obtain the transformation result of the forward transformation operation of the forward input feature data of the i-th layer, and the transformation result of the forward transformation operation of the weight data of the i-th layer;
  • the bit multiplication module 30 is also used to perform bit multiplication on the conversion result of the positive conversion operation of the forward input feature data of the i-th layer and the conversion result of the positive conversion operation of the weight data of the i-th layer to obtain The third multiplication operation result;
  • the transformation module 20 is also used to disassemble the inverse transformation operation of the third multiplication operation result into a summation operation, and use the result of the summation operation as the positive output feature data of the i-th layer.
  • the transformation module 20 is specifically configured to: the processing method of decomposing the forward transformation operation or the inverse transformation operation into the summation operation is: decomposing the target data into multiple sub-tensors corresponding to the target data, and The multiple sub-tensors are transformed and summed, and the transformation result corresponding to the target data is obtained according to the result of the summation operation; the target data includes one of the following: reverse input gradient, forward input feature data, weight Value data, first multiplication operation result, second multiplication operation result, and third multiplication operation result.
  • the sum of multiple sub-tensors corresponding to the target data is the target data; the number of the multiple sub-tensors is the same as the number of non-zero elements in the target data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the target data.
  • the transformation module 20 is specifically configured to: determine the meta-sub-tensor corresponding to each sub-tensor of the target data, wherein the meta-sub-tensor is a sheet with non-zero elements of the sub-tensor set to 1. ⁇ ; Obtain the transformation result of the meta-sub-tensor corresponding to each sub-tensor; take the non-zero element value in the sub-tensor as the coefficient and multiply the transformation result of the corresponding meta-sub-tensor to obtain the transformation result of the sub-tensor ; Add the transformation results of multiple sub-tensors to obtain the result of the summation operation, and obtain the transformation result of the transformation operation on the target data according to the result of the summation operation.
  • the transformation module 20 is specifically configured to: for each of the sub-tensors, multiply the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplication matrix, and the right side by the right multiplication matrix to obtain the element
  • the transformation type of the transformation operation is specifically configured to: for each of the sub-tensors, multiply the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplication matrix, and the right side by the right multiplication matrix to obtain the element
  • the transformation result of the sub-tensor wherein the left multiplication matrix and the right multiplication matrix are both determined by the scale and transformation type of the sub-tensor, wherein the transformation type includes the transformation type and the inverse of the forward transformation operation.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Static random access memory SRAM Static Random-Access Memory
  • enhanced dynamic random access memory EDRAM Enhanced Dynamic Random Access Memory
  • high-bandwidth memory HBM High-Bandwidth Memory
  • hybrid storage cube HMC Hybrid Memory Cube
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • an artificial intelligence chip is also disclosed, which includes the aforementioned Winograd convolution operation device.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
  • Fig. 8 shows a structural block diagram of a board card according to an embodiment of the present disclosure.
  • the board card may include other supporting components in addition to the chip 389 described above.
  • the supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;
  • the storage device 390 is connected to the artificial intelligence chip through a bus for storing data.
  • the storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips).
  • the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the artificial intelligence chip.
  • the interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the artificial intelligence chip.
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
  • an electronic device which includes the aforementioned artificial intelligence chip.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • a Winograd convolution operation method comprising:
  • the reverse input gradient of the j-th layer in the neural network and the forward transformation operation of the forward input feature data of the j-th layer in the neural network are disassembled into A summation operation to obtain the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the forward input feature data of the jth layer based on the summation operation;
  • the training of the neural network is completed according to the weight difference of the jth layer.
  • the reverse input gradient of the jth layer and the forward transformation operation of the weight data of the jth layer are disassembled into a summation operation to be based on the calculation. And obtaining the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the weight data of the jth layer;
  • the inverse transform operation of the second multiplication operation result is disassembled into a summation operation, and the result of the summation operation is used as the inverse output gradient of the jth layer.
  • the forward transformation operation of the forward input feature data of the i-th layer and the weight data of the i-th layer are respectively disassembled into a summation operation to obtain the forward input of the i-th layer
  • the transformation result of the feature data positive transformation operation, and the transformation result of the weight data positive transformation operation of the i-th layer are respectively disassembled into a summation operation to obtain the forward input of the i-th layer
  • the inverse transform operation of the third multiplication operation result is disassembled into a summation operation, and the result of the summation operation is used as the positive output feature data of the i-th layer.
  • the processing method of disassembling the forward transformation operation or the inverse transformation operation into the summation operation is: disassembling the target data into a plurality of sub-tensors corresponding to the target data, and performing a transformation operation on the plurality of sub-tensors and calculating And, obtaining the transformation result corresponding to the target data according to the result of the summation operation;
  • the target data includes one of the following: reverse input gradient, forward input characteristic data, weight data, first multiplication operation result, second multiplication operation result, and third multiplication operation result.
  • the number of the multiple sub-tensors is the same as the number of non-zero elements in the target data, each of the sub-tensors has a single non-zero element, and the non-zero elements in the sub-tensor are the same as those in the target data.
  • the non-zero elements at corresponding positions in the data are the same.
  • the transformation results of the multiple sub-tensors are added to obtain the result of the summation operation, and the transformation result of the transformation operation on the target data is obtained according to the result of the summation operation.
  • the obtaining the transformation result of the meta-sub-tensor corresponding to each sub-tensor includes:
  • the transformation type includes a transformation type of a forward transformation operation and a transformation type of an inverse transformation operation.
  • a Winograd convolution operation device including:
  • the data receiving module is used to obtain the positive input characteristic data for training the neural network
  • the transformation module is used to separately input the reverse input gradient of the j-th layer and the forward input feature data of the j-th layer in the neural network during the training of the neural network based on the pre-configured Winograd convolution algorithm by the training module
  • the forward transformation operation of is disassembled into a summation operation to obtain the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the forward input feature data of the jth layer based on the summation operation ;
  • the bit multiplication module is used to perform bit multiplication on the transformation result of the reverse input gradient forward transformation operation of the jth layer and the transformation result of the forward input feature data forward transformation operation of the jth layer to obtain the first multiplication operation result ;
  • the transformation module is also used to disassemble the inverse transformation operation of the first multiplication operation result into a summation operation, and use the result of the summation operation as the weight difference of the jth layer;
  • the weight update module is used to complete the training of the neural network according to the weight difference of the jth layer.
  • the transformation module is also used to separate the reverse input gradient of the j-th layer and the forward transformation operation of the weight data of the j-th layer into the process of training the neural network based on the pre-configured Winograd convolution algorithm.
  • a sum operation to obtain the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the weight data of the jth layer based on the sum operation;
  • the bitwise multiplication module is also used to perform bitwise multiplication on the transformation result of the forward transformation operation of the reverse input gradient of the jth layer and the transformation result of the forward transformation operation of the weight data of the jth layer to obtain the second Multiplication result;
  • the transformation module is also used to disassemble the inverse transformation operation of the second multiplication operation result into a summation operation, and use the result of the summation operation as the reverse output gradient of the jth layer.
  • the transformation module is also used to separately disassemble the forward transformation operation of the i-th layer's forward input feature data and the i-th layer's weight data into a summation operation based on the pre-configured Winograd algorithm, so as to obtain the first the transformation result of the positive transformation operation of the forward input feature data of the i layer, and the transformation result of the positive transformation operation of the weight data of the i-th layer;
  • the bitwise multiplication module is also used to perform bitwise multiplication on the conversion result of the positive conversion operation of the forward input feature data of the i-th layer and the conversion result of the positive conversion operation of the weight data of the i-th layer to obtain the first The result of the three multiplication operation;
  • the transformation module is also used to disassemble the inverse transformation operation of the third multiplication operation result into a summation operation, and use the result of the summation operation as the positive output feature data of the i-th layer.
  • the processing method of disassembling the forward transformation operation or the inverse transformation operation into the summation operation is: disassembling the target data into a plurality of sub-tensors corresponding to the target data, and performing a transformation operation on the plurality of sub-tensors and calculating And, obtaining the transformation result corresponding to the target data according to the result of the summation operation;
  • the target data includes one of the following: reverse input gradient, forward input characteristic data, weight data, first multiplication operation result, second multiplication operation result, and third multiplication operation result.
  • the number of the multiple sub-tensors is the same as the number of non-zero elements in the target data, each of the sub-tensors has a single non-zero element, and the non-zero elements in the sub-tensor are the same as those in the target data.
  • the non-zero elements at corresponding positions in the data are the same.
  • each sub-tensor For each sub-tensor, multiply the left side of the sub-tensor corresponding to the sub-tensor by the left-multiplying matrix and the right side by the right-multiplying matrix to obtain the transformation result of the sub-tensor, where the Both the left multiplication matrix and the right multiplication matrix are determined by the scale of the sub-tensor and the transformation type, where the transformation type includes a transformation type of a forward transformation operation and a transformation type of an inverse transformation operation.
  • An artificial intelligence chip comprising the Winograd convolution operation device according to any one of clauses A6 to A10.
  • a board card comprising: a storage device, an interface device, a control device, and the artificial intelligence chip as described in clause A15;
  • the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;
  • the storage device is used to store data
  • the interface device is used to implement data transmission between the artificial intelligence chip and external equipment
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the chip includes: a DDR controller, which is used to control the data transmission and data storage of each storage unit;
  • the interface device is: a standard PCIE interface.
  • the Winograd operation method and related products provided by this application are performed by training the neural network based on the pre-configured Winograd convolution algorithm, separately inputting the reverse input gradient of the j-th layer and the j-th layer of the neural network in the process of training the neural network based on the pre-configured Winograd convolution algorithm.
  • the forward transformation operation of the forward input feature data is disassembled into a summation operation to obtain the transformation result of the reverse input gradient forward transformation operation of the jth layer based on the summation operation, and the positive input feature data of the jth layer
  • the transformation result of the transformation operation; the transformation result of the reverse input gradient forward transformation operation of the jth layer and the transformation result of the forward transformation operation of the forward input feature data of the jth layer are performed bitwise multiplication to obtain the first multiplication operation result;
  • the inverse transform operation of the first multiplication operation result is disassembled into a summation operation, and the result of the summation operation is used as the weight difference of the jth layer; the training of the neural network is completed according to the weight difference of the jth layer .
  • this application takes advantage of the feature of the Winograd algorithm that converts a large number of matrix multiplication operations into matrix addition operations, which effectively improves the computational efficiency of processing the training data of the neural network, and reduces the computation occupied by the training process. Resources.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

一种Winograd卷积运算方法及相关产品。所述方法包括:在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果(201);对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果(202);将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差(203);根据所述第j层的权值差完成所述神经网络的训练(204)。

Description

Winograd卷积运算方法及相关产品
本申请要求于2019年11月1日提交中国专利局、申请号为2019110610891、申请名称为“Winograd卷积运算方法及相关产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及深度学习技术领域,尤其涉及一种Winograd卷积运算方法及相关产品。
背景技术
近年来,深度学习技术得到了飞速发展,特别是在图像识别、语音识别、自然语言分析、智能机器人、大数据分析等领域得到了广泛引用,成为了研究重点。
神经网络模型是深度学习技术中的运算模型,通过利用多层架构以对输入的数据进行处理,并输出相应的运算结果。在现有技术中,对于神经网络模型进行训练是使用神经网络模型进行运算的必要步骤,在训练过程中,待训练的神经网络将利用卷积算法对海量训练数据重复进行迭代运算以得到训练完毕的神经网络模型。
但是,卷积运算中涉及到大量的对于矩阵的乘法,这将占用大量的运算资源,运算效率不高,同时,较低的运算效率将使得对神经网络模型进行训练所需的时间较长,训练效率不高。
发明内容
基于此,有必要针对上述技术问题,提供一种能够用于提高神经网络模型训练效率,降低训练运算损耗资源的Winograd卷积运算方法及相关产品。
第一方面,本申请提供了一种Winograd卷积运算方法,包括:
在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正 变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
根据所述第j层的权值差完成所述神经网络的训练。
第二方面,本申请提供了一种Winograd卷积运算装置,包括:
数据接收模块,用于获取对神经网络进行训练的正向输入特征数据;
变换模块,用于在训练模块基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
对位乘模块,用于对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
变换模块,还用于将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
权值更新模块,用于根据所述第j层的权值差完成所述神经网络的训练。
第三方面,本申请提供了一种人工智能芯片,所述芯片包括如前任一项所述的Winograd卷积运算装置。
第四方面,本申请提供了一种电子设备,所述电子设备包括如前所述的人工智能芯片。
第五方面,本申请提供了一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如前所述的人工智能芯片;
其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
所述控制器件,用于对所述人工智能芯片的状态进行监控。
本申请提供的Winograd卷积运算方法及相关产品,通过在接收训练指令和获取特征数据之后,采用Winograd算法以利用特征数据对神经网络中的权值数据进行训练,获得训练后的神经网络,与现有技术相比,本申请利用了Winograd算法中将大量矩阵乘法运算转换为矩阵加法运算的特点,有效提高了对于神经网络的训练数据进行处理的运算效率,降低了训练过程所占用的运算资源。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍。
图1为现有技术中的神经网络架构示意图;
图2示出根据本公开实施例的Winograd卷积运算方法所基于的运算系统的示意图;
图3为本申请提供的一种神经网络的训练方法的流程示意图;
图4为本申请提供的一种Winograd卷积运算方法的流程示意图;
图5为本申请提供的另一种Winograd卷积运算方法的流程示意图;
图6为本申请提供的又一种Winograd卷积运算方法的流程示意图;
图7为本申请提供的一种Winograd卷积运算装置的结构示意图;
图8示出根据本公开实施例的板卡的结构框图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确 定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
为了清楚理解本申请的技术方案,下面对现有技术和本申请实施例中涉及的技术术语进行解释:
卷积运算:卷积运算是指从图像的左上角开始,开一个与模板同样大小的活动窗口,活动窗口对应一个窗口图像,该窗口图像为卷积核,窗口图像与图像中的像素对应起来相乘再相加,并用计算结果作为卷积运算后新图像的第一个像素值。然后,活动窗口向右移动一列,将活动窗口对应的窗口图像与图像中的像素对应起来相乘再相加,并用计算结果作为卷积运算后新图像的第二个像素值。以此类推,从左到右、从上到下,即可得到一幅新图像。
Winograd卷积运算:Winograd卷积运算是一种基于多项式插值算法的卷积加速实现方式。它通过对卷积操作的两个输入:第一目标矩阵和第二目标矩阵分别进行Winograd卷积正变换,再将正变换后的第一目标矩阵和第二目标矩阵进行对位乘法,最后对对位乘法结果再次进行Winograd卷积逆变换,得到与原卷积操作等价的卷积结果。
卷积神经网络模型:卷积神经网络模型是一类包含卷积计算且具有深度结构的前馈神经网络模型,是深度学习的代表模型之一。在卷积神经网络模型中的卷积层,全连接层等网络层中均需要对神经元与卷积核进行卷积运算,得到特征数据,其被广泛用于图像分类、图像识别等。
作为一种示例,图1为现有技术中的神经网络架构示意图,如图1所示,该神经网络采用了包括有卷积层、池化层、全连接层以及分类器层的输出层的网络架构。该神经网络的多层依次进行的处理。举例来说,卷积层用于对输入本层的特征数据进行特征提取(第一层卷积层即用于对原始输入的特征数据进行特征提取);池化层用于对上一层输出的特征通过人为设定的池化窗口大小以及步长进行池化计算,以减小特征的维度,使特征聚合。
具体来说,在神经网络的第一层卷积网络接收到原始输入的数据(比如待处理图像)后,神经网络开始对待处理图像进行卷积神经网络处理,其中每层卷积网络均进行卷积神经网络的处理;且除最后一层卷积网络外,其它每层卷积网络在完成处理后,均将自身的处理结果输出至下一层卷积网络,下一层卷积网络可用该处理结果作为本身的输入数据,继续进行卷积神经网络的后续处理。
关于上述卷积神经网络处理,无论是在神经网络的推理过程还是在神经网络的训练过程,均需要将输入的特征数据与神经网络中的权值数据进行卷积运算,且该卷积运算的次 数一般为多次。随着特征数据的数据量增大,其运算的复杂性将相应升高。此时,若采用传统的基于矩阵乘法的卷积运算将使得执行该神经网络运算的运算装置处于超负荷运算状态,其运算时长和运算所需资源均大大增加。
举例来说,通过神经网络的权值对特征数据进行卷积的过程中,需要大量的乘法运算。例如输入一个卷积层的特征数据维度是4×4,该卷积层的卷积核的维度是3×3,滑动步长是1,则可以通过卷积核在特征数据上的滑动,将特征数据拆分为四个维度是3×3的子数据,通过卷积核与每个子数据的卷积运算,能够输出2×2的输出数据。具体可以用卷积核与子数据进行对位乘法运算,再将九个乘积相加,得到输出数据中的一个值。在上述例子中,共需要36次乘法、32次加法运算,才能够得到输出结果。
同理的,在对神经网络进行训练的训练过程中,将首先对神经网络基于正向的传播训练处理,此时,特征输入被输入至神经网络,并与权值数据进行卷积,得到正向传播的输出结果;随后,还将对神经网络基于反向的传播训练处理,此时,梯度数据将与特征数据进行卷积,以得到用于对权值数据进行调整的权值差。在上述例子中,对神经网络进行训练的训练过程,其正向的传播训练处理将需要36次乘法、32次加法运算,才能够能得到正向传播的输出结果,在反向的传播训练处理也将至少需要36次乘法、32次加法运算,才能够能得到反向传播的输出结果。
而对于运算系统来说,进行乘法的运算对于运算系统的消耗较大,这将直接造成运算效率低下,以及训练效率的降低。
为了解决这一问题,本申请采用将预配置的Winograd卷积算法应用在神经网络训练过程的卷积运算中,特别是应用在神经网络训练的反向传播训练过程中。具体的,本申请采用的基于预配置的Winograd卷积算法是一种可将卷积运算转换为由大量矩阵加法运算和少量矩阵乘法运算的运算方法,在本申请中,考虑运算器对于数据进行矩阵乘法运算所需的时间和资源远大于对数据进行矩阵加法运算所需的时间和资源,在本申请中,通过利用Winograd卷积算法以减小运算资源消耗较大的矩阵乘法运算的次数,相对增加消耗较小的矩阵加法运算的次数,从而降低整个神经网络训练过程在反向传播处理时的运算资源消耗量和运算时间,提高运算效率。
以神经网络训练的场景作为示例,下面将结合附图对本申请提供的Winograd卷积运算方法及相关产品进行解释和说明:
图2示出根据本公开实施例的Winograd卷积运算方法所基于的运算系统的示意图,本申请提供的Winograd卷积运算方法可以应用于图2所示的运算系统中,如图2所示,运算系统100包括多个处理器101以及存储器102,多个处理器101用于执行指令序列, 存储器102用于存储数据,可包括随机存储器(RAM,Random Access Memory)和寄存器堆。处理系统100中的多个处理器101既可共用部分存储空间,例如共用部分RAM存储空间和寄存器堆,又可同时拥有各自的存储空间,该运算系统用于执行本申请运算方法所执行的各步骤。
根据本公开实施例的Winograd卷积运算方法可应用于包括多个处理器(多核)的运算系统(例如人工智能芯片)的任意一个处理器中。该处理器可以是通用处理器,例如CPU(Central Processing Unit,中央处理器),也可以是用于执行人工智能运算的人工智能处理器(IPU)。人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network Processing Unit,神经网络处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对处理器的具体类型不作限制。此外,运算系统中的多个处理器的类型可以相同或不同,本公开对此不作限制。
在一种可能的实现方式中,本公开中所提及的处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如:卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。
此外,基于不同的使用场景,其神经网络中的网络层将相应调整,其中的权值数据、特征数据也将相应变换。也就是说,本申请所述的Winograd卷积运算方法、Winograd卷积运算装置及运算芯片可用于不同场景下的神经网络的训练,如对于人脸识别模型的训练,对于颜色分类器的训练,对于音视频数据转换模型的训练,对于图像边界划分模型的训练等等。相应的,在本申请中,训练完毕的神经网络可用于实现推理过程,其适用于不同的场景,例如人脸识别、颜色分类、基于特殊需求的音视频数据转换、图像边界划分等等。
第一方面,图3为本申请提供的一种神经网络的训练方法的流程示意图,如图3所示的,该方法,其包括:
步骤101、接收训练指令,获取特征数据;
步骤102、利用所述特征数据和神经网络的权值数据,采用Winograd算法对所述神经网络进行训练,得到训练后的神经网络。
需要说明的是,该训练方法的执行主体可为Winograd卷积运算装置,其可与包括运算芯片、神经网络芯片等芯片进行交互,以接收来自用户的电子设备或组合处理装置等发 起的训练指令。其中,训练指令可用于指示运算装置开始执行对神经网络进行训练的运算。
其中,本申请所提供的Winograd卷积运算方法则可应用于步骤102中,以提高神经网络训练过程的训练效率。
为了便于理解,首先将对本申请涉及的Winograd算法进行介绍:
Winograd算法是一种可将卷积运算转换为由大量矩阵加法运算和少量矩阵乘法运算的运算方法,其运算公式可以公式(1)的形式进行表示:
S=A T[(GgG T)⊙(B TdB)]A     公式(1)
其中,S用于表示卷积的结果矩阵,即使用特征数据与权值数据进行卷机运算得到的结果矩阵;d用于表示输入的特征数据;g用于表示神经网络中的权值数据;B用于表示将特征数据从原域转换至Winograd域的特征变换矩阵;B T用于表示将特征数据从Winograd域转换至原域的特征逆变换矩阵;G用于表示将权值数据从原域转换至Winograd域的权值变换矩阵;G T用于表示将权值数据从Winograd域转换至原域的权值逆变换矩阵;A用于表示将对位乘后的运算结果从原域转换至Winograd域的逆变换运算的变换矩阵;A T用于表示将对位乘后的运算结果从Winograd域转换至原域的逆变换运算的逆变换矩阵。
需要说明的是,上述的原域是指未经过Winograd变换的域,而Winograd域是指经过Winograd变换的域。上述的特征数据d和权值数据g可为固定规模大小的矩阵,其中特征数据d的规模可在4*4的范围内,而权值数据g的规模可在3*3的范围内,其规模的选取与结果矩阵的规模有关,本申请不进行限制。同时,在Winograd算法中,A、B、G、A T、B T、G T均为转换矩阵,且该常数矩阵的规模和取值均与前述的结果矩阵S和权值数据g的规模相关。此外,针对不同规模的常数矩阵,其元素值也将存在差异,但针对每一规模的每一类常数矩阵,其元素值是固定的。
在本申请涉及的整个训练过程中,Winograd卷积运算装置将获取到特征数据。该特征数据可携带在训练指令中,并随着该训练指令一并发送至运算装置,以供其获取;该特征数据也可由运算装置在接收到训练指令之后,对该训练指令进行响应,并基于预存的数据存储地址,或接收的携带在训练指令中的数据存储地址,从与数据存储地址相应的数据存储位置进行读取。可知的是,用于训练的特征数据的数据量较大,在可选实施方式中,在本领域技术人员完成对于训练的特征数据进行采集和标注等操作之后,可将其存储于云端服务器中,当运算装置接收到训练指令之后,可根据该训练指令指示的训练方式以及存储地址,从云端服务器中读取所需要的特征数据,并进行处理和运算。
具体来说,针对于本申请涉及到的Winograd卷积运算方法,其可应用于前述的步骤102中,特别应用在对神经网络进行训练的反向传播训练过程中。
以神经网络包括n层,n为大于等于2的整数为例,Winograd卷积运算装置在获取到特征数据之后,首先将利用特征数据和各层的权值数据,对所述神经网络进行正向传播处理,获得第n层的正向输出特征数据。
随后,Winograd卷积运算装置将采用下述步骤201至步骤204的方法,利用各层的输入梯度、各层的正向输入特征数据、以及各层的权值数据,对所述神经网络进行反向传播处理,以对各层的权值数据进行更新。
图4为本申请提供的一种Winograd卷积运算方法的流程示意图,如图4所示的,该方法,其包括:
步骤201、分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
步骤202、对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
步骤203、将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
步骤204、根据所述第j层的权值差完成所述神经网络的训练。
具体来说,针对于神经网络中的第j层(j属于n)来说,在对该第j层进行反向传播训练时,将会向其网络层输入第j层的反向输入梯度以及正向输入特征数据,以使通过对反向输入梯度以及正向输入特征数据进行卷积运算获得第j层的权值差,进而根据该第j层的权值差完成第j层的权值数据的更新。重复该过程直至完成对n层神经网络中的每一层的权值数据的更新,以完成对神经网络的训练。
在本实施例提供的方法中,可以采用winograd算法对反向传播训练时的反向输入梯度以及正向输入特征数据的卷积运算进行变换。
其中,针对采用Winograd算法进行权值差的运算,可采用公式(2)进行表示:
Δw j=G T[(Atop_diff jA T)⊙(B Ttop_data jB)]G            公式(2)
其中,Δw j用于表示权值差,即使用反向输入梯度与正向输入特征数据进行卷积运算得到的结果矩阵;top diff j用于表示第j层的反向输入梯度,其与第j+1层的反向输出梯 度相同;top data j用于表示神经网络中的第j层的正向输入特征数据,其与第j-1层的正向输出特征数据相同;B用于表示将正向输出特征数据从原域转换至winograd域的变换矩阵;B T用于表示将正向输出特征数据从winograd域转换至原域的逆变换矩阵;A用于表示将反向输入梯度从原域转换至winograd域的变换矩阵;A T用于表示将反向输入梯度从winograd域转换至原域的逆变换矩阵;G用于表示将对位乘后的运算结果从原域转换至winograd域的逆变换运算的变换矩阵;G T用于表示将对位乘后的运算结果从winograd域转换至原域的逆变换运算的逆变换矩阵。
需要说明的是,上述的原域是指未经过winograd变换的域,而winograd域是指经过winograd变换的域。
实际应用时,在winograd算法中,需要对反向输入梯度与正向输入特征数据分别进行正变换运算,因此,本实施例提供的方法可以获取用于对反向输入梯度与正向输入特征数据进行正变换运算的变换矩阵A、A T、B T、B,以及对位乘后的运算结果进行逆变换运算的变换矩阵G、G T
在本实施例中,若top diff j和top data j的尺寸固定,则A、A T、B、B T、G、G T矩阵也是固定的。具体可以根据需要的输出结果Δw j的尺寸以及卷积过程的滑动步长,确定top diff j和top data j的尺寸,进而根据这些数据的尺寸确定对应的A、A T、B、B T、G、G T
Winograd卷积运算装置可以使用变换矩阵B、B T对正向输入特征数据top data j进行正变换运算,得到B Ttop data jB的运算结果,使用变换矩阵A、A T对反向输入梯度top diff j进行正变换运算,得到Atop diff jA T的运算结果。随后,Winograd卷积运算装置将对第j层的反向输入梯度正变换运算的变换结果Atop diff jA T以及第j层的正向输入特征数据正变换运算的变换结果B Ttop data jB执行对位乘法运算,获得第一乘法运算结果(Atop_diff jA T)⊙(B Ttop_data jB)。再后,Winograd卷积运算装置使用变换矩阵G、G T对第一乘法运算结果进行逆变换运算,以得到最后的权值差G T[(Atop_diff jA T)⊙(B Ttop_data jB)]G。该权值差将用于对第j层的权值数据进行调整。
也就是说,在本公开实施例中,对于正变换运算和逆变换运算,均可采用将变换运算拆解为求和运算的方式进行,根据求和运算确定正变换运算和逆变换运算的运算结果率。传统的卷积运算中乘法运算次数较多,通过采用winograd算法进行卷积处理,从而有效降低了变换运算中乘法运算的次数,相对增加了加法运算的次数,进而提高的运算效率,降低运算带来的性能损耗。
可选的,对于神经网络进行训练中,第j层的反向输入梯度是等同于第j+1层的反 向输出梯度的,第j层的反向输出梯度是等同于第j-1层的反向输出梯度的。而对于第j层的反向输出梯度的获取,则可通过对第j层的反向输入梯度与第j层的权值数据进行卷积运算获得,该卷积运算可采用前述的预配置的Winograd卷积算法,图5为本申请提供的另一种Winograd卷积运算方法的流程示意图,如图5所示的方法可适用于获取图4中对于任一层的反向输入梯度的获取,该方法其包括:
步骤301、在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将第j层的反向输入梯度以及第j层的权值数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及所述第j层的权值数据正变换运算的变换结果;
步骤302、对所述第j层的反向输入梯度正变换运算的变换结果和所述第j层的权值数据正变换运算的变换结果执行对位乘法运算,获得第二乘法运算结果;
步骤303、将第二乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的反向输出梯度。
在本实施例提供的方法中,可以采用winograd算法对反向传播训练时的权值数据和反向输入梯度的卷积运算进行变换,以得到反向输出梯度。
其中,针对采用Winograd算法进行反向输出梯度的运算,可采用公式(3)进行表示:
Figure PCTCN2020113168-appb-000001
其中,bottom diff j用于表示第j层的反向输出梯度,即使用第j层的反向输入梯度与第j层的权值数据进行卷积运算得到的结果矩阵,该第j层的反向输出梯度也将作为第j-1层的反向输入梯度;top diff j用于表示第j层的反向输入梯度,其与第j+1层的反向输出梯度相同;g j用于表示神经网络中的第j层的权值数据;B用于表示可将反向输入梯度从原域转换至winograd域的变换矩阵;B T用于表示可将反向输入梯度从winograd域转换至原域的逆变换矩阵;G用于表示可将权值数据从原域转换至winograd域的变换矩阵;G T用于表示可将权值数据从winograd域转换至原域的逆变换矩阵;A用于表示将对位乘后的运算结果从原域转换至winograd域的逆变换运算的变换矩阵;A T用于表示将对位乘后的运算结果从winograd域转换至原域的逆变换运算的逆变换矩阵。
需要说明的是,上述的原域是指未经过winograd变换的域,而winograd域是指经过winograd变换的域。
实际应用时,在winograd算法中,需要对反向输入梯度与权值数据分别进行正变换 运算,因此,本实施例提供的方法可以获取用于对反向输入梯度与权值数据进行正变换运算的变换矩阵G、G T、B T、B,以及对位乘后的运算结果进行逆变换运算的变换矩阵A、A T
在本实施例中,若top diff j和g j的尺寸固定,则A、A T、B、B T、G、G T矩阵也是固定的。具体可以根据需要的输出结果bottom diff j的尺寸以及卷积过程的滑动步长,确定top diff j和g j的尺寸,进而根据这些数据的尺寸确定对应的A、A T、B、B T、G、G T
Winograd卷积运算装置可以使用变换矩阵B、B T对反向输入梯度top diff j进行正变换运算,得到B Ttop diff jB的运算结果,使用变换矩阵G、G T对权值数据g j进行正变换运算,得到Gg jG T的运算结果。随后,Winograd卷积运算装置将对第j层的反向输入梯度正变换运算的变换结果B Ttop diff jB以及第j层的权值数据正变换运算的变换结果Gg jG T执行对位乘法运算,获得第二乘法运算结果(Gg jG T)⊙(B Ttop_diff jB)。再后,Winograd卷积运算装置使用变换矩阵A、A T对第二乘法运算结果进行逆变换运算,以得到最后的反向输出梯度A T[(Gg jG t)⊙(B Ttop_diff jB)]A。该反向输出梯度将作为第j-1层的反向输入梯度,以供第j-1层利用该反向输入梯度计算第j-1层的反向输出梯度。当然,类似的,在上述过程中第j层的反向输入梯度是由第j+1层对该第j+1层的反向输入梯度和权值数据进行运算处理获得并输出给第j层的。
可选的,对于神经网络进行训练中,还包括有正向传播的训练,在正向传播的训练中也可采用将卷积运算进行拆分求和的运算处理,图6为本申请提供的又一种Winograd卷积运算方法的流程示意图,该图6所示的方法可与图4或图5组合使用,一同作用于对于神经网络的训练过程,图6所示方法也可独立使用,单独作用于神经网络的训练过程。
该方法,包括:
步骤401、基于所述预配置的Winograd算法,分别将第i层的正向输入特征数据以及第i层的权值数据的正变换运算拆解为求和运算,以获取所述第i层的正向输入特征数据正变换运算的变换结果,以及所述第i层的权值数据正变换运算的变换结果;
步骤402、对所述第i层的正向输入特征数据正变换运算的变换结果和所述第i层的权值数据正变换运算的变换结果执行对位乘法运算,获得第三乘法运算结果;
步骤403、对第三乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所 得结果作为第i层的正向输出特征数据。
在本实施例提供的方法中,可以采用winograd算法对正向传播训练时的权值数据和正向输入特征数据的卷积运算进行变换,以得到正向输出特征数据。
其中,针对采用Winograd算法对正向输出特征数据进行运算,可采用公式(4)进行表示:
bottom_data i=A T[(Gg iG T)⊙(B Ttop_data iB)]A       公式(4)
其中,bottom data i用于表示第i层(i属于m)的正向输出特征数据,即使用第i层的正向输入特征数据与第i层的权值数据进行卷积运算得到的结果矩阵,该第i层的正向输出特征数据也将作为第i+1层的正向输入特征数据;top data i用于表示第i层的正向输入特征数据,其与第i-1层的正向输出特征数据相同;g i用于表示神经网络中的第i层的权值数据;B用于表示可将正向输入特征数据从原域转换至winograd域的变换矩阵;B T用于表示可将正向输入特征数据从winograd域转换至原域的逆变换矩阵;G用于表示可将权值数据从原域转换至winograd域的变换矩阵;G T用于表示可将权值数据从winograd域转换至原域的逆变换矩阵;A用于表示将对位乘后的运算结果从原域转换至winograd域的逆变换运算的变换矩阵;A T用于表示将对位乘后的运算结果从winograd域转换至原域的逆变换运算的逆变换矩阵。
需要说明的是,上述的原域是指未经过winograd变换的域,而winograd域是指经过winograd变换的域。
实际应用时,在winograd算法中,需要对正向输入特征数据与权值数据分别进行正变换运算,因此,本实施例提供的方法可以获取用于对正向输入特征数据与权值数据进行正变换运算的变换矩阵G、G T、B T、B,以及对位乘后的运算结果进行逆变换运算的变换矩阵A、A T
在本实施例中,若top data i和g i的尺寸固定,则A、A T、B、B T、G、G T矩阵也是固定的。具体可以根据需要的输出结果bottom data i的尺寸以及卷积过程的滑动步长,确定top data i和g i的尺寸,进而根据这些数据的尺寸确定对应的A、A T、B、B T、G、G T
Winograd卷积运算装置可以使用变换矩阵B、B T对正向输入特征数据top data i进行正变换运算,得到B Ttop data iB的运算结果,使用变换矩阵G、G T对权值数据g i进行正变换运算,得到Gg iG T的运算结果。随后,Winograd卷积运算装置将对第i层的正向输入特征数据正变换运算的变换结果B Ttop data iB以及第i层的权值数据正变换运算 的变换结果Gg iG T执行对位乘法运算,获得第三乘法运算结果(Gg iG T)⊙(B Ttop_data iB)。再后,Winograd卷积运算装置使用变换矩阵A、A T对第三乘法运算结果进行逆变换运算,以得到最后的正向输出特征数据A T[(Gg iG T)⊙(B Ttop_data iB)]A。该正向输出特征数据将作为第i+1层的正向输入特征数据,以供第i+1层利用该正向输入特征数据计算第i+1层的正向输出特征数据。当然,类似的,在上述过程中第i层的正向输入特征数据是由第i-1层对该第i-1层的正向输入特征数据和权值数据进行运算处理获得并输出给第i层的。
也就是说,在本公开实施例中,对于图4、图5以及图6中涉及到的正变换运算和逆变换运算,均可采用将变换运算拆解为求和运算的方式进行,根据求和运算确定正变换运算和逆变换运算的运算结果率。传统的卷积运算中乘法运算次数较多,通过采用winograd算法进行卷积处理,从而有效降低了变换运算中乘法运算的次数,相对增加了加法运算的次数,进而提高的运算效率,降低运算带来的性能损耗。
进一步的,为了降低乘法次数,降低运算带来的性能损耗,本实施例提供的方法中,将正变换运算或逆变换运算拆解为求和运算的处理方式可采用如下方式:将目标数据拆解为与所述目标数据对应的多个子张量,并对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果;所述目标数据包括如下一种:反向输入梯度、正向输入特征数据、权值数据、第一乘法运算结果、第二乘法运算结果和第三乘法运算结果。
也就是说,实际应用时,可以将前述中涉及到的对反向输入梯度、正向输入特征数据以及权值数据的正变换运算拆解为多个子变换结果,再将子变换结果的求和结果确定为相应的变换结果,还可将涉及到的对第一乘法运算结果、第二乘法运算结果以及第三乘法运算结果的逆变换运算拆解为多个子变换结果,再将子变换结果的求和结果确定为相应的变换结果。
其中,以B Ttop data jB进行举例来说,假设正向输入特征数据top data j是4×4的矩阵,则可以预先设置top data j中每个元素对应的替换矩阵,例如d 00对应一矩阵D 00,d 01对应一D 01……d 33对应一D 33。替换矩阵可以是包括0、1、﹣1的矩阵。
在对top data j进行变换时,可以直接读取与top data j对应的替换矩阵,并提取top data j中的每个元素,用该元素与相应的替换矩阵相乘再相加,得到变换结果。具体可以根据正向输入特征数据的尺寸、特征转换矩阵B、特征逆变换矩阵B T确定替换矩阵,在对正向输入特征数据进行变换时,可以直接读取预先存储的特征转换替换矩阵。
具体的,通过单个元素与替换矩阵进行相乘的运算,能够降低乘法次数。尤其在替换 矩阵由0、1、﹣1组成的时候,更能够大幅的降低运算量。例如,特征数据是4×4的矩阵,共包括d 00、d 01……d 33这个16个元素,此时,可以具有与这些元素对应的16个替换矩阵D 01、D 01……D 33
进一步的,所述目标数据对应的多个子张量之和为所述目标数据;所述多个子张量的个数与所述目标数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述目标数据中对应位置的非0元素相同。也就是说,以目标数据为正向输入特征数据top data j为例,其可表示为:
Figure PCTCN2020113168-appb-000002
则可以根据上述规定,将正向输入特征数据top data j拆分为16个(假设特征数据中的元素均为非0时),子张量分别为:
Figure PCTCN2020113168-appb-000003
实际应用时,将目标数据拆分为子张量之后,可以使用变换矩阵对每个子张量进行变换,再将子张量的变换结果相加,得到特征变换结果。
其中,由于子张量之和与目标数据等同,那么对目标子数据进行变换,再将变换结果相加得到的结果,与目标数据的变换结果相同。
例如,对于其中的一个子张量,可以基于下式对其进行转换:
Figure PCTCN2020113168-appb-000004
针对每个子张量,都可以进行如上的变换,再将各个子张量的变换结果相加,得到变换结果。
为了进一步的减少运算过程中的乘法运算,在对子张量进行变换运算并求和,得到变换结果时,还可以:
根据子张量确定对应的元子张量,其中,元子张量是将子张量的非0元素置为1的张量;
确定特征子张量的变换结果;
对子张量的变换结果进行求和,得到特征变换结果。
其中,可以识别子张量中的非0元素,并将非0元素对应的位置设置为1,得到元子张量,例如对于子张量
Figure PCTCN2020113168-appb-000005
来说,对应的元子张量是:
Figure PCTCN2020113168-appb-000006
针对每个子张量都可以确定其对应的元子张量。
在对子张量进行变换时,可以根据变换矩阵、元子张量及其对应的非0元素,确定子张量的变换结果。
具体可以对元子张量左边乘以变换矩阵中的左乘矩阵、右边乘以特征变换矩阵中的右乘矩阵,并将结果与元子张量对应的非0元素相乘,得到子张量的变换结果;其中,特征元子张量中的左乘矩阵和右乘矩阵都是由子张量的规模确定的
例如对于
Figure PCTCN2020113168-appb-000007
来说,可以转换为
Figure PCTCN2020113168-appb-000008
由于矩阵中的元素除了0就是1,因此,对于上式的运算过程引起的性能损耗较小。
进一步的,由于根据正向输入特征数据的尺寸可以确定B T、B,并且,元子张量也是预先可以根据正向输入特征数据确定的。因此,还可以预先根据B T、B、元子张量确定与特征数据中每个元素位置对应的替换矩阵。
例如,对于第一行第一列的元素位置,替换矩阵为:
Figure PCTCN2020113168-appb-000009
基于上式可以得知,子张量
Figure PCTCN2020113168-appb-000010
的变换结果变为:
d 00×D 00
针对正向输入特征数据中每个元素位置都可以确定对应的替换矩阵,可以在对正向输入特征数据进行变换时,直接根据数据尺寸,确定对应的替换矩阵集合,再根据替换矩阵集合确定变换结果。
具体运算时,得到
B Ttop data jB=d 00×D 01+d 01×D 01…d 33×D 33
当替换矩阵由0、1、﹣1组成时,元素与替换矩阵的乘法被改变为直接写入数据的过程,例如替换矩阵中的1与d 00相乘,实际结果为直接写入d 00。因此,基于本实施例提供的方法,能够将winograd算法中的变换过程转换为加法算法,从而进一步的降低卷积过程的运算量。
而对其他类型数据,如top diff j或g j等,进行正变换运算的过程,则与对top data j进行正变换运算的过程类似,在此不做赘述。
此外,在确定正变换运算的变换结果之后,还可将基于运算目标采用这两个结果进行对位乘运算。例如,在得到了图1-4所示方法中的B Ttop data jB及Atop diff jA T之后,可以对这两个运算结果进行对位乘运算,即确定(Atop_diff jA T)⊙(B Ttop_data jB)的结果。
实际应用时,可以将两个变换结果相应位置的数值相乘,从而得到新的矩阵作为乘法运算结果。例如反向输入梯度的变换结果为:
Figure PCTCN2020113168-appb-000011
正向输入特征数据的变换结果为:
Figure PCTCN2020113168-appb-000012
则第一乘法运算结果可表示为:
Figure PCTCN2020113168-appb-000013
可以获取用于对乘法运算结果进行逆变换的逆变换矩阵A、A T。如上所述,可以根据运算结果的尺寸确定逆变换矩阵。
进一步的,可以使用逆变换矩阵对乘法运算结果进行逆变换,例如,可采用拆分求和的方式确定对第一乘法运算结果的逆变换运算的变换结果,即权值差:Δw j=G T[(Atop_diff jA T)⊙(B Ttop_data jB)]G。
具体的,在逆变换过程中,为了减少乘法运算,如正变换过程类似的,也可以将乘法运算结果的变换运算拆解为求和运算,并根据求和运算确定运算结果。
以对第一乘法运算结果进行逆变换运算的拆分为例,可以基于下式对第一乘法运算结果进行变换:
ApA T
其中,A T、A是逆变换矩阵,p是第一乘法运算结果。在将该变换过程拆解为求和运算时,可以将第一乘法运算结果拆解为多个子张量;再根据逆变换矩阵对多个子张量进行变换运算并求和,得到运算结果。
具体的,多个子张量之和为乘法运算结果,多个子张量的个数与乘法运算结果中非0元素的个数相同,每个子张量中有单个非0元素,且在子张量中的非0元素与在乘法运算结果中对应位置的非0元素相同。
也就是说,第一乘法运算结果p为:
Figure PCTCN2020113168-appb-000014
可将第一乘法运算结果拆分为16个子张量,分别为:
Figure PCTCN2020113168-appb-000015
Figure PCTCN2020113168-appb-000016
实际应用时,将乘法运算结果拆分为结果子张量之后,可以使用逆变换矩阵对每个结果子张量进行变换,再将结果子张量的变换结果相加,得到运算结果。
例如,对于其中的一个结果子张量,可以基于下式对其进行转换:
Figure PCTCN2020113168-appb-000017
针对每个子张量,都可以进行如上的变换,再将各个子张量的变换结果相加,得到运算结果。
为了进一步的减少运算过程中的乘法运算,在对子张量进行变换运算并求和,得到运算结果时,还可以:
根据子张量确定对应的元子张量,其中,元子张量是将子张量的非0元素置为1的张量;
根据逆变换矩阵、元子张量及其对应的非0元素,确定结果子张量的变换结果;
对子张量的变换结果进行求和,得到运算结果。
与对前述的正向输入特征数据的变换过程类似,可以识别第一乘法运算结果的子张量中的非0元素,并将非0元素对应的位置设置为1,得到元子张量,例如对于子张量
Figure PCTCN2020113168-appb-000018
来说,对应的元子张量是:
Figure PCTCN2020113168-appb-000019
针对每个子张量都可以确定其对应的元子张量。
在对子张量进行变换时,可以根据逆变换矩阵、元子张量及其对应的非0元素,确定运算结果。
具体可以对元子张量左边乘以逆变换矩阵中的左乘矩阵、右边乘以逆变换矩阵中的右乘矩阵,并将结果与元子张量对应的非0元素相乘,得到子张量的变换结果;其中,元子张量中的左乘矩阵和右乘矩阵都是由子张量的规模确定的。
例如对于
Figure PCTCN2020113168-appb-000020
来说,可以转换为
Figure PCTCN2020113168-appb-000021
进一步的,由于根据运算结果的尺寸可以确定A T、A,并且,元子张量也是预先可以根据运算结果的尺寸确定的。因此,还可以预先根据A T、A、结果元子张量确定与乘法运算结果中每个元素位置对应的替换矩阵。
例如,对于第一行第一列的元素位置,替换矩阵为:
Figure PCTCN2020113168-appb-000022
基于上式可以得知,子张量
Figure PCTCN2020113168-appb-000023
的变换结果变为:
p 00×D″ 00
针对第一乘法运算结果中每个元素位置都可以确定对应的替换矩阵,可以在对第一乘法运算结果进行变换时,直接根据该结果或最终运算结果尺寸,确定对应的替换矩阵集合,再根据替换矩阵集合确定运算结果。
基于上式可以得到运算结果,即权值差可表示为:
ApA T=p 00×D″ 00+p 01×D″ 01…p 33×D″ 33
也就是说,可以确定与各乘法运算中包括的每个元素对应的替换矩阵,从而可以根据这些替换矩阵对逆变换运算拆解为求和运算,并根据求和运算确定运算结果。
其中,具体的拆解方式与对特征变换运算的拆解方式类似,进而通过更少的乘法方式就能够得到卷积运算结果。而对第二乘法运算结果以及第三乘法运算结果的逆变换运算的处理过程与前述的对第一乘法运算结果的逆变换运算进行处理的过程类似,再此不进行赘述。
本实施例提供的方法用于进行卷积运算,该方法由设置有本实施例提供的方法的设备执行,该设备通常以硬件和/或软件的方式来实现。
在本申请所基于的对神经网络进行训练的过程中,由于采用了前述的Winograd算法,其训练所需的训练时间将大大缩短。特别的,由于训练的特征数据的数据量较大,在本申请中,对于获取的特征数据的规模将与神经网络输出的运算结果的规模,以及神经网络中权值数据的规模相关联。也就是说,在对神经网络进行训练的运算获取特征数据的过程中,基于训练的运算任务,可将训练数据进行拆解,以将规模较大的多维训练数据拆分为若干固定规模的二维的特征数据,针对每一特征数据,运算装置可利用前述提供的方法进行训练运算。
当然,对于运算装置如何从存储有多维训练数据的云端服务器中获取二维的特征数据,则可采用基于存储地址进行数据精准读取的方式,基于云端服务器中的存储地址位,读取存储在特定存储地址位的数据,并进行处理。其中,对于处理所需要的特征数据的存储地址,运算装置可从训练指令中获取,即训练指令中可携带有本次运算处理所需要的特征数据的存储地址,以供运算装置基于该存储地址从云端服务器中读取特征数据。
特别的,在其中一种可选实现方式中,运算装置在进行上述的正向传播处理和反向传播处理过程时的至少一次卷积运算将采用Winograd算法。
具体来说,由于神经网络中包括有n层,其每一层均将存在有大量卷积运算,其中的部分或全部卷积运算可采用该Winograd算法,当仅有部分卷积运算采用Winograd算法时,该部分卷积运算以外的其他卷积运算可采用通常的卷积运算处理。另外,参与运算的数据格式也不限,例如,可以使用浮点运算的算法进行各步骤的运算,也可以将浮点数转换为定点数后进行各步骤的运算。
进一步的,当运算装置仅对部分卷积运算采用Winograd算法进行运算处理时,运算装置接收的训练指令中还将包括有选取的目标层,运算装置将基于选取的目标层,采用Winograd算法对目标层的卷积运算进行处理。其中,该目标层是采用预设规则选 取获得的。
其中,预设规则具体可取决于训练数据的复杂程度,即训练数据被拆分为特征数据的拆分程度。
具体的,若训练数据的复杂程度较大,其数据规模也相对较大,为了进行卷积处理,需要对训练数据进行拆分,并获得数量巨大的拆分后的特征数据。
此时,若采用传统的卷积运算处理方式对数量巨大的拆分后的特征数据进行卷积运算,其运算的处理时间将会很长。相应的,若此时对数量巨大的拆分后的特征数据采用Winograd算法进行卷积运算,相对于传统运算处理方式每一次的运算的处理时间将得到缩短,而由于特征数据的数量巨大,使得整个训练数据的运算效率的提升也将呈现几何倍增长,对于神经网络训练过程的运算效率提高的效果显著。
因此,在可选实施方式中,可在对训练数据进行拆分后,先确定拆分后的特征数据的数量,随后基于特征数据的数量选取采用Winograd算法进行卷积运算的一个或多个目标层。其中,在选择过程中,可仅对一个或多个目标层的正向传播处理过程中的卷积运算采用Winograd算法进行处理,也可对一个或多个目标层的反向传播处理过程中的卷积运算进行采用Winograd算法进行处理,或者,可对一个或多个目标层的正向传播处理过程以及反向传播过程中的卷积运算进行采用Winograd算法进行处理。
进一步的,上述运算方法获得的神经网络可用于各场景,此处以人脸识别模型进行推理为例进行说明:
步骤501、接收识别指令,获取人脸样本的特征数据;
步骤502、将所述人脸样本的特征数据作为训练获得的所述人脸识别模型的输入,通过训练得到的人脸识别模型进行神经网络处理;
步骤503、将所述人脸识别模型输出的结果作为人脸识别结果。
本申请提供的信息的运算方法及相关产品,通过在接收训练指令和获取特征数据之后,采用Winograd算法以利用特征数据对神经网络中的权值数据进行训练,获得训练后的神经网络,与现有技术相比,本申请利用了Winograd算法中将大量矩阵乘法运算转换为矩阵加法运算的特点,有效提高了对于神经网络的训练数据进行处理的运算效率,降低了训练过程所占用的运算资源。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知 悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。
进一步需要说明的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
第二方面,图7为本申请提供的一种Winograd卷积运算装置的结构示意图,如图7所示的,本申请的Winograd卷积运算装置,包括:数据接收模块10、变换模块20、对位乘模块30以及权值更新模块40;
其中,
数据接收模块10,用于获取对神经网络进行训练的正向输入特征数据;
变换模块20,用于在训练模块基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
对位乘模块30,用于对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
变换模块20,还用于将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
权值更新模块40,用于根据所述第j层的权值差完成所述神经网络的训练。
可选的,变换模块20,还用于在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将第j层的反向输入梯度以及第j层的权值数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及所述第j层的权值数据正变换运算的变换结果;
对位乘模块30,还用于对所述第j层的反向输入梯度正变换运算的变换结果和所述第j层的权值数据正变换运算的变换结果执行对位乘法运算,获得第二乘法运算结 果;
变换模块20,还用于将第二乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的反向输出梯度。
可选的,变换模块20,还用于基于所述预配置的Winograd算法,分别将第i层的正向输入特征数据以及第i层的权值数据的正变换运算拆解为求和运算,以获取所述第i层的正向输入特征数据正变换运算的变换结果,以及所述第i层的权值数据正变换运算的变换结果;
对位乘模块30,还用于对所述第i层的正向输入特征数据正变换运算的变换结果和所述第i层的权值数据正变换运算的变换结果执行对位乘法运算,获得第三乘法运算结果;
变换模块20,还用于对第三乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第i层的正向输出特征数据。
可选的,变换模块20具体用于:将正变换运算或逆变换运算拆解为求和运算的处理方式为:将目标数据拆解为与所述目标数据对应的多个子张量,并对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果;所述目标数据包括如下一种:反向输入梯度、正向输入特征数据、权值数据、第一乘法运算结果、第二乘法运算结果和第三乘法运算结果。
可选的,所述目标数据对应的多个子张量之和为所述目标数据;所述多个子张量的个数与所述目标数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述目标数据中对应位置的非0元素相同。
可选的,变换模块20具体用于:确定目标数据的各子张量对应的元子张量,其中,所述元子张量是将所述子张量的非0元素置为1的张量;获取各子张量对应的元子张量的变换结果;将所述子张量中非0的元素值作为系数乘以对应的元子张量的变换结果,得到所述子张量的变换结果;将多个子张量的变换结果相加得到求和运算的结果,根据求和运算的结果得到对所述目标数据进行变换运算的变换结果。
可选的,变换模块20具体用于:对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及变换类型确定的,其中所述变换类型包括正变换运算的变换类型和逆变换运算的变换类型。
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有 另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述人工智能处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在一种可能的实现方式中,还公开了一种人工智能芯片,其包括了上述Winograd卷积运算装置。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
图8示出根据本公开实施例的板卡的结构框图,参阅图8,上述板卡除了包括上述芯片 389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述人工智能芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述人工智能芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述人工智能芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述人工智能芯片电连接。所述接口装置用于实现所述人工智能芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述人工智能芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述人工智能芯片电连接。所述控制器件用于对所述人工智能芯片的状态进行监控。具体的,所述人工智能芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述人工智能芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述人工智能芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述人工智能芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一种可能的实现方式中,公开了一种电子设备,其包括了上述人工智能芯片。电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、 手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
A1、一种Winograd卷积运算方法,所述方法包括:
在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
根据所述第j层的权值差完成所述神经网络的训练。
A2、根据条款A1所述的方法,所述方法还包括:
在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将第j层的反向输入梯度以及第j层的权值数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及所述第j层的权值数据正变换运算的变换结果;
对所述第j层的反向输入梯度正变换运算的变换结果和所述第j层的权值数据正变换运算的变换结果执行对位乘法运算,获得第二乘法运算结果;
将第二乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的反向输出梯度。
A3、根据条款A1或A2所述的方法,所述方法还包括:
基于所述预配置的Winograd算法,分别将第i层的正向输入特征数据以及第i层的权值数据的正变换运算拆解为求和运算,以获取所述第i层的正向输入特征数据 正变换运算的变换结果,以及所述第i层的权值数据正变换运算的变换结果;
对所述第i层的正向输入特征数据正变换运算的变换结果和所述第i层的权值数据正变换运算的变换结果执行对位乘法运算,获得第三乘法运算结果;
对第三乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第i层的正向输出特征数据。
A4、根据条款A1至A3任一项所述的方法,所述方法还包括:
将正变换运算或逆变换运算拆解为求和运算的处理方式为:将目标数据拆解为与所述目标数据对应的多个子张量,并对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果;
所述目标数据包括如下一种:反向输入梯度、正向输入特征数据、权值数据、第一乘法运算结果、第二乘法运算结果和第三乘法运算结果。
A5、根据条款A4所述的方法,所述目标数据对应的多个子张量之和为所述目标数据;
所述多个子张量的个数与所述目标数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述目标数据中对应位置的非0元素相同。
A6、根据条款A4或A5所述的方法,所述对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果,包括:
确定目标数据的各子张量对应的元子张量,其中,所述元子张量是将所述子张量的非0元素置为1的张量;
获取各子张量对应的元子张量的变换结果;
将所述子张量中非0的元素值作为系数乘以对应的元子张量的变换结果,得到所述子张量的变换结果;
将多个子张量的变换结果相加得到所述求和运算的结果,根据求和运算的结果得到对所述目标数据进行变换运算的变换结果。
A7、根据条款A6所述的方法,所述获取各子张量对应的元子张量的变换结果,包括:
对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及变换类型确定的,其中所述变换类型包括正变换运算的变换类型和逆变换运算的变换类型。
A8、一种Winograd卷积运算装置,包括:
数据接收模块,用于获取对神经网络进行训练的正向输入特征数据;
变换模块,用于在训练模块基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
对位乘模块,用于对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
变换模块,还用于将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
权值更新模块,用于根据所述第j层的权值差完成所述神经网络的训练。
A9、根据条款A8所述的装置,
变换模块,还用于在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将第j层的反向输入梯度以及第j层的权值数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及所述第j层的权值数据正变换运算的变换结果;
对位乘模块,还用于对所述第j层的反向输入梯度正变换运算的变换结果和所述第j层的权值数据正变换运算的变换结果执行对位乘法运算,获得第二乘法运算结果;
变换模块,还用于将第二乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的反向输出梯度。
A10、根据条款A8或A9所述的装置,
变换模块,还用于基于所述预配置的Winograd算法,分别将第i层的正向输入特征数据以及第i层的权值数据的正变换运算拆解为求和运算,以获取所述第i层的正向输入特征数据正变换运算的变换结果,以及所述第i层的权值数据正变换运算的变换结果;
对位乘模块,还用于对所述第i层的正向输入特征数据正变换运算的变换结果和所述第i层的权值数据正变换运算的变换结果执行对位乘法运算,获得第三乘法运算结果;
变换模块,还用于对第三乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第i层的正向输出特征数据。
A11、根据条款A8至A10任一项所述的装置,所述变换模块具体用于:
将正变换运算或逆变换运算拆解为求和运算的处理方式为:将目标数据拆解为与所述目标数据对应的多个子张量,并对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果;
所述目标数据包括如下一种:反向输入梯度、正向输入特征数据、权值数据、第一乘法运算结果、第二乘法运算结果和第三乘法运算结果。
A12、根据条款A11所述的装置,所述目标数据对应的多个子张量之和为所述目标数据;
所述多个子张量的个数与所述目标数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述目标数据中对应位置的非0元素相同。
A13、根据条款A10或A11所述的装置,所述变换模块具体用于:
确定目标数据的各子张量对应的元子张量,其中,所述元子张量是将所述子张量的非0元素置为1的张量;
获取各子张量对应的元子张量的变换结果;
将所述子张量中非0的元素值作为系数乘以对应的元子张量的变换结果,得到所述子张量的变换结果;
将多个子张量的变换结果相加得到所述求和运算的结果,根据求和运算的结果得到对所述目标数据进行变换运算的变换结果。
A14、根据条款A13所述的装置,所述变换模块具体用于:
对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及变换类型确定的,其中所述变换类型包括正变换运算的变换类型和逆变换运算的变换类型。
A15、一种人工智能芯片,所述芯片包括如条款A6至A10中任意一项所述的Winograd卷积运算装置。
A16、一种电子设备,所述电子设备包括如条款A15所述的人工智能芯片。
A17、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款A15所述的人工智能芯片;
其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
所述控制器件,用于对所述人工智能芯片的状态进行监控。
A18、根据条款A17所述的板卡,所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;
所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
所述接口装置为:标准PCIE接口。
本申请提供的Winograd运算方法及相关产品,通过在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;根据所述第j层的权值差完成所述神经网络的训练。与现有技术相比,本申请利用了Winograd算法中将大量矩阵乘法运算转换为矩阵加法运算的特点,有效提高了对于神经网络的训练数据进行处理的运算效率,降低了训练过程所占用的运算资源。
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。

Claims (18)

  1. 一种Winograd卷积运算方法,其特征在于,所述方法包括:
    在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
    对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
    将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
    根据所述第j层的权值差完成所述神经网络的训练。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将第j层的反向输入梯度以及第j层的权值数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及所述第j层的权值数据正变换运算的变换结果;
    对所述第j层的反向输入梯度正变换运算的变换结果和所述第j层的权值数据正变换运算的变换结果执行对位乘法运算,获得第二乘法运算结果;
    将第二乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的反向输出梯度。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    基于所述预配置的Winograd算法,分别将第i层的正向输入特征数据以及第i层的权值数据的正变换运算拆解为求和运算,以获取所述第i层的正向输入特征数据正变换运算的变换结果,以及所述第i层的权值数据正变换运算的变换结果;
    对所述第i层的正向输入特征数据正变换运算的变换结果和所述第i层的权值数据正变换运算的变换结果执行对位乘法运算,获得第三乘法运算结果;
    对第三乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第i层的正向输出特征数据。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,将正变换运算或逆变换运算拆解为求和运算的处理方式为:将目标数据拆解为与所述目标数据对应的多个子张量,并对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果;
    所述目标数据包括如下一种:反向输入梯度、正向输入特征数据、权值数据、第一乘法运算结果、第二乘法运算结果和第三乘法运算结果。
  5. 根据权利要求4所述的方法,其特征在于,所述目标数据对应的多个子张量之和为所述目标数据;
    所述多个子张量的个数与所述目标数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述目标数据中对应位置的非0元素相同。
  6. 根据权利要求4或5所述的方法,其特征在于,所述对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果,包括:
    确定目标数据的各子张量对应的元子张量,其中,所述元子张量是将所述子张量的非0元素置为1的张量;
    获取各子张量对应的元子张量的变换结果;
    将所述子张量中非0的元素值作为系数乘以对应的元子张量的变换结果,得到所述子张量的变换结果;
    将多个子张量的变换结果相加得到所述求和运算的结果,根据求和运算的结果得到对所述目标数据进行变换运算的变换结果。
  7. 根据权利要求6所述的方法,其特征在于,所述获取各子张量对应的元子张量的变换结果,包括:
    对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及变换类型确定的,其中所述变换类型包括正变换运算的变换类型和逆变换运算的变换类型。
  8. 一种Winograd卷积运算装置,其特征在于,包括:
    数据接收模块,用于获取对神经网络进行训练的正向输入特征数据;
    变换模块,用于在训练模块基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将所述神经网络中第j层的反向输入梯度以及第j层的正向输入特征数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及第j层的正向输入特征数据正变换运算的变换结果;
    对位乘模块,用于对第j层的反向输入梯度正变换运算的变换结果和第j层的正向输入特征数据正变换运算的变换结果执行对位乘法运算,获得第一乘法运算结果;
    变换模块,还用于将第一乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的权值差;
    权值更新模块,用于根据所述第j层的权值差完成所述神经网络的训练。
  9. 根据权利要求8所述的装置,其特征在于,
    变换模块,还用于在基于预配置的Winograd卷积算法对神经网络进行训练的过程中,分别将第j层的反向输入梯度以及第j层的权值数据的正变换运算拆解为求和运算,以基于求和运算获取所述第j层的反向输入梯度正变换运算的变换结果,以及所述第j层的权值数据正变换运算的变换结果;
    对位乘模块,还用于对所述第j层的反向输入梯度正变换运算的变换结果和所述第j层的权值数据正变换运算的变换结果执行对位乘法运算,获得第二乘法运算结果;
    变换模块,还用于将第二乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第j层的反向输出梯度。
  10. 根据权利要求8或9所述的装置,其特征在于,
    变换模块,还用于基于所述预配置的Winograd算法,分别将第i层的正向输入特征数据以及第i层的权值数据的正变换运算拆解为求和运算,以获取所述第i层的正向输入特征数据正变换运算的变换结果,以及所述第i层的权值数据正变换运算的变换结果;
    对位乘模块,还用于对所述第i层的正向输入特征数据正变换运算的变换结果和所述第i层的权值数据正变换运算的变换结果执行对位乘法运算,获得第三乘法运算结果;
    变换模块,还用于对第三乘法运算结果的逆变换运算拆解为求和运算,并将求和运算所得结果作为第i层的正向输出特征数据。
  11. 根据权利要求8-10任一项所述的装置,其特征在于,所述变换模块具体用于:
    将正变换运算或逆变换运算拆解为求和运算的处理方式为:将目标数据拆解为与 所述目标数据对应的多个子张量,并对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述目标数据对应的变换结果;
    所述目标数据包括如下一种:反向输入梯度、正向输入特征数据、权值数据、第一乘法运算结果、第二乘法运算结果和第三乘法运算结果。
  12. 根据权利要求11所述的装置,其特征在于,所述目标数据对应的多个子张量之和为所述目标数据;
    所述多个子张量的个数与所述目标数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述目标数据中对应位置的非0元素相同。
  13. 根据权利要求10或11所述的装置,其特征在于,所述变换模块具体用于:
    确定目标数据的各子张量对应的元子张量,其中,所述元子张量是将所述子张量的非0元素置为1的张量;
    获取各子张量对应的元子张量的变换结果;
    将所述子张量中非0的元素值作为系数乘以对应的元子张量的变换结果,得到所述子张量的变换结果;
    将多个子张量的变换结果相加得到所述求和运算的结果,根据求和运算的结果得到对所述目标数据进行变换运算的变换结果。
  14. 根据权利要求13所述的装置,其特征在于,所述变换模块具体用于:
    对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及变换类型确定的,其中所述变换类型包括正变换运算的变换类型和逆变换运算的变换类型。
  15. 一种人工智能芯片,其特征在于,所述芯片包括如权利要求6-10中任意一项所述的Winograd卷积运算装置。
  16. 一种电子设备,其特征在于,所述电子设备包括如权利要求15所述的人工智能芯片。
  17. 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求15所述的人工智能芯片;
    其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
    所述存储器件,用于存储数据;
    所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
    所述控制器件,用于对所述人工智能芯片的状态进行监控。
  18. 根据权利要求17所述的板卡,其特征在于,
    所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;
    所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
    所述接口装置为:标准PCIE接口。
PCT/CN2020/113168 2019-11-01 2020-09-03 Winograd卷积运算方法及相关产品 WO2021082725A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911061089.1A CN112784951B (zh) 2019-11-01 2019-11-01 Winograd卷积运算方法及相关产品
CN201911061089.1 2019-11-01

Publications (1)

Publication Number Publication Date
WO2021082725A1 true WO2021082725A1 (zh) 2021-05-06

Family

ID=75715762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113168 WO2021082725A1 (zh) 2019-11-01 2020-09-03 Winograd卷积运算方法及相关产品

Country Status (2)

Country Link
CN (1) CN112784951B (zh)
WO (1) WO2021082725A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399036A (zh) * 2022-01-12 2022-04-26 电子科技大学 一种基于一维Winograd算法的高效卷积计算单元
CN116415103A (zh) * 2023-06-09 2023-07-11 之江实验室 一种数据处理的方法、装置、存储介质以及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189237A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Winograd algorithm on a matrix processing architecture
CN109388777A (zh) * 2017-08-07 2019-02-26 英特尔公司 一种用于经优化的Winograd卷积加速器的系统和方法
US20190149134A1 (en) * 2019-01-14 2019-05-16 Intel Corporation Filter optimization to improve computational efficiency of convolution operations
CN110222760A (zh) * 2019-06-04 2019-09-10 东南大学 一种基于winograd算法的快速图像处理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189237A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Winograd algorithm on a matrix processing architecture
CN109388777A (zh) * 2017-08-07 2019-02-26 英特尔公司 一种用于经优化的Winograd卷积加速器的系统和方法
US20190149134A1 (en) * 2019-01-14 2019-05-16 Intel Corporation Filter optimization to improve computational efficiency of convolution operations
CN110222760A (zh) * 2019-06-04 2019-09-10 东南大学 一种基于winograd算法的快速图像处理方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LAVIN ANDREW; GRAY SCOTT: "Fast Algorithms for Convolutional Neural Networks", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 27 June 2016 (2016-06-27), pages 4013 - 4021, XP033021587, DOI: 10.1109/CVPR.2016.435 *
XIANG YANG: "Summary of Experience: Caffe Backward", 22 July 2017 (2017-07-22), pages 1 - 6, XP009527754, Retrieved from the Internet <URL:https://blog.csdn.net/m0_37407756/article/details/75807664> *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114399036A (zh) * 2022-01-12 2022-04-26 电子科技大学 一种基于一维Winograd算法的高效卷积计算单元
CN114399036B (zh) * 2022-01-12 2023-08-22 电子科技大学 一种基于一维Winograd算法的高效卷积计算单元
CN116415103A (zh) * 2023-06-09 2023-07-11 之江实验室 一种数据处理的方法、装置、存储介质以及电子设备
CN116415103B (zh) * 2023-06-09 2023-09-05 之江实验室 一种数据处理的方法、装置、存储介质以及电子设备

Also Published As

Publication number Publication date
CN112784951B (zh) 2024-04-19
CN112784951A (zh) 2021-05-11

Similar Documents

Publication Publication Date Title
CN109543832B (zh) 一种计算装置及板卡
WO2021036905A1 (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021036908A1 (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021036890A1 (zh) 数据处理方法、装置、计算机设备和存储介质
TWI795519B (zh) 計算裝置、機器學習運算裝置、組合處理裝置、神經網絡芯片、電子設備、板卡及執行機器學習計算的方法
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
WO2021082725A1 (zh) Winograd卷积运算方法及相关产品
US20220108150A1 (en) Method and apparatus for processing data, and related products
WO2021083101A1 (zh) 数据处理方法、装置及相关产品
WO2021185262A1 (zh) 计算装置、方法、板卡和计算机可读存储介质
WO2021114903A1 (zh) 数据处理方法、装置、计算机设备和存储介质
CN109740730B (zh) 运算方法、装置及相关产品
WO2021082746A1 (zh) 运算装置及相关产品
WO2021082747A1 (zh) 运算装置及相关产品
CN109711538B (zh) 运算方法、装置及相关产品
WO2021082723A1 (zh) 运算装置
WO2021082724A1 (zh) 运算方法及相关产品
US20220414183A1 (en) Winograd convolution operation method, apparatus, and device, and storage medium
WO2021036904A1 (zh) 数据处理方法、装置、计算机设备和存储介质
WO2021082722A1 (zh) 运算装置、方法及相关产品
WO2021169914A1 (zh) 数据量化处理方法、装置、电子设备和存储介质
WO2021037083A1 (zh) 用于处理数据的方法、装置以及相关产品
WO2021212972A1 (zh) 运算方法、处理器以及相关产品
JP7269382B2 (ja) 計算装置、方法、プリント基板、およびコンピュータ読み取り可能な記録媒体
WO2023279946A1 (zh) 一种处理装置、设备、方法及其相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20880796

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20880796

Country of ref document: EP

Kind code of ref document: A1