WO2022227024A1 - 神经网络模型的运算方法、训练方法及装置 - Google Patents

神经网络模型的运算方法、训练方法及装置 Download PDF

Info

Publication number
WO2022227024A1
WO2022227024A1 PCT/CN2021/091574 CN2021091574W WO2022227024A1 WO 2022227024 A1 WO2022227024 A1 WO 2022227024A1 CN 2021091574 W CN2021091574 W CN 2021091574W WO 2022227024 A1 WO2022227024 A1 WO 2022227024A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
transformation
weight
neural network
data
Prior art date
Application number
PCT/CN2021/091574
Other languages
English (en)
French (fr)
Inventor
李文硕
王云鹤
伍玮翔
辛晨
王璇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/091574 priority Critical patent/WO2022227024A1/zh
Priority to CN202180094093.7A priority patent/CN116888605A/zh
Publication of WO2022227024A1 publication Critical patent/WO2022227024A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of artificial intelligence, and more particularly, to an operation method, training method and apparatus of a neural network model.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • a large number of matrix operations are usually involved in neural network models.
  • the convolution operation involves a multiplication operation, which has high computational complexity and a long delay in the operation process.
  • Matrix operations usually occupy most of the time of neural network model operations, and the delay of matrix operations will become a major factor restricting computational efficiency, affecting the overall processing efficiency of the neural network model, resulting in greater power loss.
  • the present application provides an operation method, training method and device for a neural network model, which can reduce the operation cost of the neural network model and improve the processing efficiency.
  • a method for computing a neural network model comprising: performing the following operations in at least one feature extraction layer of the neural network model: inputting the input data matrix of the data to be processed through the input transformation matrix of the winograd algorithm
  • the data is transformed to obtain the transformed input data matrix, and the data to be processed includes image data, voice data or text data; the transformed input data matrix is used for feature extraction to obtain an intermediate matrix, wherein the transformed weight matrix is used.
  • the weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer through the weight transformation matrix of the winograd algorithm.
  • Each element in the intermediate matrix is based on the corresponding position in the transformed input data matrix and the transformed weight matrix.
  • the L1 distance between elements is determined; the output data matrix is transformed by the output transformation matrix of the winograd algorithm to obtain the output data matrix.
  • the point multiplication operation in winograd is replaced with an addition operation such as the operation of calculating the L1 distance, which reduces the calculation amount of the feature extraction process, improves the running speed of the model, and reduces the calculation overhead.
  • the data to be processed includes image data, voice data, or text data.
  • the type of data to be processed is related to the task of the neural network model.
  • the data to be processed may be images.
  • image processing tasks include image classification, image detection, image segmentation, image recognition or image generation, etc.
  • the data to be processed may be text.
  • text processing tasks include text recognition or text translation.
  • the data to be processed may be speech data.
  • speech processing tasks include speech recognition and the like. This embodiment of the present application does not limit the type of data to be processed.
  • the input data matrix of the data to be processed refers to the data matrix input to the at least one feature extraction layer.
  • the input data matrix can be part or all of the input feature map.
  • the input feature map may be the data itself input into the neural network model, eg, the image to be processed.
  • the input feature map can also be a feature map obtained after one or more feature extractions are performed by some feature extraction layers in the neural network model.
  • the weight matrix is a parameter of one or more feature extraction kernels in the at least one feature extraction layer in the neural network model.
  • the transformed weight matrix can be obtained.
  • winograd transformation is performed on the intermediate matrix through the output transformation matrix to obtain the output data matrix.
  • the output data matrix can be part or all of the output feature map.
  • the transformed weight matrix is obtained through offline transformation.
  • the transformed weight matrix is transformed before performing the operation of the neural network model.
  • the transformed weight matrix can be transformed before the neural network model is deployed.
  • the weight matrix is unchanged, and obtaining the transformed weight matrix through offline calculation can further improve the operation speed and reduce the operation cost.
  • each element in the intermediate matrix is the inverse of the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • the output data matrix satisfies the following formula:
  • Y represents the output data matrix
  • A represents the output transformation matrix
  • a T represents the transpose matrix of A
  • G represents the weight transformation matrix
  • G T represents the transpose matrix of G
  • g represents the weight matrix before transformation
  • B represents the input transformation matrix
  • B T represents the transposed matrix of B
  • d represents the input data matrix before transformation.
  • the value of an element in the output transformation matrix is any one of 0, -1, or 1.
  • the value of the element in the output transformation matrix is 0, 1 or -1, which can reduce the multiplication operation, further reduce the number of calculations, and help reduce the calculation amount of the model.
  • the output transformation matrix is:
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • the element of at least one row in the output transformation matrix is the inverse of the element of the position corresponding to the at least one row in the first matrix, and the elements of other rows in the output transformation matrix are The elements are the same as the elements in the positions corresponding to other rows in the first matrix, and the first matrix is:
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • any row of A' can be negated without affecting the final result of winograd transformation.
  • c 0 is 0, c 1 is -1, and c 2 is 1.
  • the weight transformation matrix is:
  • the above-mentioned output transformation matrix and weight transformation matrix satisfy the general solution form of the winograd algorithm, which can ensure that the result of the convolution calculation obtained by the winograd algorithm is the same as the result of the conventional convolution calculation.
  • the above-mentioned output transformation matrix and weight transformation matrix can still be applied to the convolution calculation.
  • the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.
  • the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.
  • the number of positive numbers in each column element in the output transformation matrix is the same. +The number of 1 is the same, and the number of -1 in each column element is the same, which can balance the magnitude of each position of the output data matrix, that is, reduce the imbalance in the accumulation of eigenvalues, which is beneficial to the training of the model.
  • it is advantageous to perform subsequent processing on the output data matrix eg, batchnorm processing.
  • a training method for a neural network model comprising: performing input data transformation on an input data matrix of training data through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix; using the transformed input data matrix The weight matrix performs feature extraction on the transformed input data matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer through the weight transformation of the winograd algorithm, and the intermediate matrix is obtained.
  • Each element in is determined according to the Lp distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix; the intermediate matrix is transformed by the output transformation matrix of the winograd algorithm to obtain the output data.
  • the L2 distance is used to assist the training, and the L2 distance is more friendly to the winograd algorithm, which can improve the convergence speed of the training process, thereby improving the training effect of the model.
  • training is performed based on the L1 distance to further improve the training effect of the model using the L1 distance.
  • the trained model uses the L1 distance, which is more friendly to hardware.
  • the type of training data is related to the task of the neural network model.
  • the training data may be images.
  • image processing tasks include image classification, image detection, image segmentation, or image generation, etc.
  • the training data may be text.
  • text processing tasks include text recognition or text translation.
  • the training data may be speech data.
  • speech processing tasks include speech recognition and the like. This embodiment of the present application does not limit the type of training data.
  • the output data matrix is part or all of the output feature map
  • the processing result of the training data can be determined according to the output feature map
  • the value of the loss function is calculated according to the processing result of the training data.
  • the processing result of the training data is related to the type of training data and the task of the neural network model.
  • the training data is image data
  • the image processing may include image super-score processing, image denoising processing, image recognition processing, etc.
  • the image processing results include image over-score, image denoising, or image classification, etc. This embodiment of the present application does not limit this.
  • the training data is speech data
  • the speech processing may include speech recognition and the like.
  • the speech processing results include speech recognition results and the like. This embodiment of the present application does not limit this.
  • the initial value of p is 2, and the value of p decreases as the number of iterations increases.
  • training the neural network model according to the value of the loss function includes: adjusting the first weight matrix according to the partial derivative of the loss function to the weight in the first weight matrix
  • the first weight matrix includes the weight matrix before transformation or the weight matrix after transformation.
  • the gradient of the weight matrix before transformation can be calculated, and then the value of the weight matrix before transformation can be adjusted.
  • the gradient of the transformed weight matrix may also be calculated, and then the value of the transformed weight matrix may be adjusted.
  • the partial derivative of the loss function with respect to the weights in the first weight matrix satisfies the following formula:
  • p is the norm of the calculation
  • w represents the weight
  • x represents the data in the input data matrix before transformation or the data in the input data matrix after transformation
  • L represents the loss function
  • i represents The number of layers of the neural network model
  • sign() represents the sign function.
  • the above formula is the partial derivative of the loss function obtained based on the L2 distance to the first weight matrix, that is to say, backpropagation is performed based on the L2 distance.
  • the above formula is the partial derivative of the loss function obtained based on the L1 distance to the first weight matrix, that is to say, backpropagation is performed based on the L2 distance.
  • a third aspect provides an apparatus for computing a neural network model, where the apparatus includes a module or unit for executing the method in the first aspect and any one of the implementation manners of the first aspect.
  • an apparatus for training a neural network model comprising a module or unit for performing the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.
  • a computing device for a neural network model comprising: a memory for storing a program; a processor for executing the program stored in the memory, when the program stored in the memory is executed, The processor is configured to execute the method in the first aspect and any one of the implementation manners of the first aspect.
  • the processor in the fifth aspect above may be either a central processing unit (CPU), or a combination of a CPU and a neural network computing processor, where the neural network computing processor may include a graphics processor (graphics processing unit). unit, GPU), neural network processor (neural-network processing unit, NPU) and tensor processor (tensor processing unit, TPU) and so on.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • a training device for a neural network model comprising: a memory for storing a program; a processor for executing the program stored in the memory, when the program stored in the memory is executed, The processor is configured to execute the method in the second aspect and any one of the implementation manners of the second aspect.
  • the processor in the sixth aspect above can be either a central processing unit, or a combination of a CPU and a neural network computing processor, where the neural network computing processor can include a graphics processor, a neural network processor, and a tensor processor. and many more.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • a computer-readable storage medium stores program code for execution by a device, the program code including a program code for executing any one of the first aspect or the second aspect. method.
  • a computer program product containing instructions, which, when the computer program product runs on a computer, causes the computer to execute the method in any one of the implementation manners of the first aspect or the second aspect.
  • a ninth aspect provides a chip, the chip includes a processor and a data interface, the processor reads an instruction stored in a memory through the data interface, and executes any one of the first aspect or the second aspect above method in the implementation.
  • the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any one of the implementations of the first aspect or the second aspect.
  • the above chip may specifically be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
  • FPGA field-programmable gate array
  • ASIC application-specific integrated circuit
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a computing device for a neural network model provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a method for computing a neural network model provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for training a neural network model provided by an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a training device for a neural network model provided by an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of a computing device for a neural network model provided by an embodiment of the present application.
  • FIG. 11 is a schematic block diagram of another apparatus for training a neural network model provided by an embodiment of the present application.
  • FIG. 12 is a schematic block diagram of another computing device of a neural network model provided by an embodiment of the present application.
  • Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform.
  • the infrastructure can communicate with the outside through sensors, and the computing power of the infrastructure can be provided by smart chips.
  • the smart chip here can be a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processor (graphics processing unit, GPU), an application specific integrated circuit (application specific). Integrated circuit, ASIC) and field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips.
  • CPU central processing unit
  • NPU neural-network processing unit
  • GPU graphics processor
  • application specific integrated circuit application specific
  • Integrated circuit, ASIC field programmable gate array
  • FPGA field programmable gate array
  • the basic platform of the infrastructure can include distributed computing framework and network and other related platform guarantee and support, and can include cloud storage and computing, interconnection network, etc.
  • data can be obtained through sensors and external communication, and then these data can be provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • the above data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other processing methods.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, safe city, smart terminals, etc.
  • the embodiments of the present application can be applied in many fields of artificial intelligence, for example, smart manufacturing, smart transportation, smart home, smart medical care, smart security, automatic driving, safe city and other fields.
  • the embodiments of the present application can be specifically applied to fields that require the use of (deep) neural networks, such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing.
  • deep neural networks such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing.
  • identifying the images in the album can facilitate the user or the system to classify and manage the album and improve user experience.
  • Using the computing method of the neural network model in the embodiment of the present application can reduce hardware overhead and be more friendly to terminal equipment.
  • the speed of classifying pictures by using the neural network can be improved, which is conducive to labeling pictures of different categories in real time, which is convenient for users to view and search.
  • the classification tags of these pictures can also be provided to the album management system for classification management, which saves the user's management time, improves the efficiency of album management, and enhances the user experience.
  • Monitoring scenarios include: smart city, field monitoring, indoor monitoring, outdoor monitoring, in-vehicle monitoring, etc.
  • various attribute recognition is required, such as pedestrian attribute recognition and cycling attribute recognition.
  • Deep neural network plays an important role in various attribute recognition with its powerful capabilities.
  • the processing efficiency of the neural network model can be improved, which is beneficial to real-time processing of the input road picture, and can identify different attribute information in the road picture more quickly, while reducing the power consumption.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • s 1, 2, ... n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to transform the input signal in the neural unit into an output signal.
  • the output signal of this activation function can be used as the input of the next layer.
  • the activation function can be a ReLU, tanh or sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers.
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficients), and ⁇ () is the activation function.
  • Each layer is just an input vector The output vector is obtained after such a simple operation. Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location.
  • the convolution kernel can be formalized in the form of a matrix of random size, and the convolution kernel can be learned to obtain reasonable weights during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the loss function loss function
  • objective function objective function
  • the training of the deep neural network becomes the process of reducing the loss as much as possible.
  • the smaller the loss the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network.
  • the smaller the loss fluctuation the more stable the training; the larger the loss fluctuation, the more unstable the training.
  • the neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and by back-propagating the error loss information to update the parameters in the neural network model, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.
  • the loss value generated by each training of the neural network model is passed from back to front in the neural network model layer by layer.
  • the update amount (partial derivative operation) of the parameters of the layer is calculated at the same time, and this update amount is related to the gradient.
  • the structure of AdderNet is similar to that of CNN.
  • the convolutional layer in CNN can be used for feature extraction or filtering of input data.
  • Convolutional layers in CNN can be used as feature extraction layers in CNN.
  • the adder layer in AdderNet can also be used to perform feature extraction or filtering on the input data.
  • the adder layer in AdderNet can be used as the feature extraction layer in AdderNet.
  • the parameters of the convolution kernel in CNN can be understood as the parameters of the adder in AdderNet.
  • Both the convolution kernel in CNN and the adder in AdderNet can be understood as filters.
  • Each convolutional layer in the CNN extracts feature information from the input data through convolution operations, and AdderNet uses the l1 distance to calculate the output features.
  • the adder layer in AdderNet extracts feature information from the input data through addition operations (or subtraction operations) and absolute value operations.
  • AdderNet Since the computational complexity of the addition operation is much smaller than that of the multiplication operation, the computational power consumption of AdderNet is much smaller than that of a CNN with comparable performance. For example, replacing the convolution operation in CNN with addition operation (or subtraction operation) and taking the absolute value operation, we get AdderNet. This can greatly reduce the computational power consumption of CNN while ensuring performance.
  • Y(m,n,t) represents the element of the mth row, nth column, nth page t in the output feature map
  • X(m+i,n+j,k) represents the m+ith element in the input feature map
  • the element on the kth page of the row n+j column, F(i,j,k,t) represents the element of the kth page of the ith row of the jth class of the filter
  • t represents the number of channels of the filter
  • d represents the filter is the number of rows of the filter
  • c in represents the number of channels of the input feature map.
  • AdderNet uses the L1 distance to extract features in the forward calculation process, that is, the addition operation is used to replace the multiplication operation, which reduces the computational complexity, power consumption and hardware area.
  • the Winograd algorithm is a commonly used fast convolution calculation method, which can greatly reduce the calculation amount of CNN and improve the calculation efficiency without affecting the calculation results.
  • the winograd algorithm satisfies the following formula:
  • Y represents the output data matrix
  • g represents the convolution kernel, that is, the weight matrix
  • d represents the slice (title) of the input feature map, that is, the input data matrix
  • represents element-wise multiplication, that is, the matrix dot product between.
  • A, G and B all represent transformation matrices, specifically, A represents an output transformation matrix, G represents a weight transformation matrix, and B represents an input transformation matrix.
  • g will not change, so the transformed weight matrix It can be pre-computed, that is, the transformed weight matrix is pre-computed before the forward computation starts. This can further reduce the computational power consumption of the neural network model.
  • the winograd algorithm in the shape of F(m, r) is used to quickly calculate the convolution operation with the size of the convolution kernel of r and the size of the output feature map of m.
  • Dimensions can also be understood as dimensions.
  • the transformation matrix may be determined according to the dimension and stride of the convolution kernel.
  • B, G, and A are fixed for the combination of convolution kernel and stride of a certain size, and can be derived according to the winograd algorithm.
  • the common form of the winograd algorithm is F(2 ⁇ 2,3 ⁇ 3).
  • the output data matrix Y is a 2 ⁇ 2 matrix
  • the weight matrix g is a 3 ⁇ 3 matrix.
  • the transformation matrix For a combination of a 3 ⁇ 3 weight matrix and a stride of 1, the transformation matrix satisfies the following formula:
  • the input data matrix d satisfies the following formula.
  • the transformed input data matrix satisfies the following formula:
  • the weight matrix g satisfies the following formula.
  • the transformed weight matrix satisfies the following formula:
  • each result requires 9 (3*3) multiplication operations, and 4 results require 36 multiplication operations.
  • the winograd algorithm is used, in addition to the transformation overhead, only 16 (4*4) multiplication operations in step 3) are required, and the speedup ratio of the multiplication times reaches 2.25 (36/16) times.
  • the elements in the transformation matrix are all values of 0, ⁇ 1, and ⁇ 1/2, which can be accomplished by light-weight hardware operations such as transforming sign bits and shifting operations, that is, the transformation overhead is usually small.
  • an embodiment of the present application provides a system architecture 100 .
  • a data collection device 160 is used to collect training data.
  • the training data may include the training image and the processing result corresponding to the training image.
  • the classification result corresponding to the training image the classification result of the training image may be the result of manual pre-labeling.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 by training based on the training data maintained in the database 130 .
  • the training device 120 processes the input raw data and compares the output value with the target value until the difference between the value output by the training device 120 and the target value is reached. The value is less than a certain threshold, thus completing the training of the target model/rule 101.
  • the target model/rule 101 in this embodiment of the present application may specifically be a neural network model.
  • a neural network model For example, Convolutional Neural Networks or Residual Networks.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR) AR/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, and input data
  • I/O interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, and input data
  • it may include: data to be processed input by the client device.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the processing result of the data obtained above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete the above-mentioned goals. tasks to provide the user with the desired result.
  • the user can manually specify the input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a target model/rule 101 is obtained by training according to the training device 120, and the target model/rule 101 may be the neural network in the present application in the embodiment of the present application.
  • the neural network in the embodiment of the present application may be For CNN or residual network, etc.
  • CNN is a very common neural network.
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 .
  • the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer eg, 221
  • the features extracted by the later convolutional layers eg, 226 become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer of
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the fully connected layer 230 to generate one or a set of outputs of the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 3), and the parameters contained in the multiple hidden layers may be based on the relevant training data of specific task types It is obtained by pre-training, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
  • the output layer 240 After the multi-layer hidden layers in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (as shown in Figure 3, the propagation from the direction 210 to 240 is forward propagation) is completed
  • the back propagation (as shown in Figure 3, the propagation from the 240 to 210 direction is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 3 is only used as an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models. Including a part of the network structure shown in FIG. 3 , for example, the convolutional neural network adopted in this embodiment of the present application may only include an input layer 210 , a convolutional layer/pooling layer 220 and an output layer 240 .
  • FIG. 4 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 50 .
  • the chip can be set in the execution device 110 as shown in FIG. 2 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 2 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the methods in the embodiments of the present application can be implemented in the chip as shown in FIG. 4 .
  • the neural network processor 50 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and tasks are allocated by the main CPU.
  • the core part of the NPU is the operation circuit 503, and the controller 504 controls the operation circuit 503 to extract the data in the memory (weight memory or input memory) and perform operations.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 508 .
  • the vector calculation unit 507 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 507 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling (pooling), batch normalization (BN), local response normalization (local response normalization) )Wait.
  • vector computation unit 507 can store the processed output vectors to unified buffer 506 .
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 507 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 503, eg, for use in subsequent layers in a neural network.
  • Unified memory 506 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through the storage unit access controller 505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 502, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 510 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 509 through the bus.
  • the instruction fetch memory (instruction fetch buffer) 509 connected with the controller 504 is used to store the instructions used by the controller 504;
  • the controller 504 is used for invoking the instructions cached in the memory 509 to control the working process of the operation accelerator.
  • the unified memory 506, the input memory 501, the weight memory 502 and the instruction fetch memory 509 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access Memory
  • HBM high bandwidth memory
  • the execution device 110 in FIG. 2 or the chip in FIG. 4 described above can execute each step of the method for computing the neural network model of the embodiment of the present application.
  • the training device 120 in FIG. 2 or the chip in FIG. 4 described above can execute various steps of the neural network model training method according to the embodiment of the present application.
  • an embodiment of the present application provides a system architecture 300 .
  • the system architecture includes a local device 301, a local device 302, an execution device 310 and a data storage system 350, wherein the local device 301 and the local device 302 are connected with the execution device 310 through a communication network.
  • the execution device 310 may be implemented by one or more servers.
  • the execution device 310 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 310 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 310 may use the data in the data storage system 350 or call the program code in the data storage system 350 to implement the neural network model computing method or the neural network model training method in the embodiments of the present application.
  • the execution device 110 may perform the following processes:
  • the matrix performs feature extraction to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer through the winograd algorithm, and each element in the intermediate matrix is based on the transformed input data matrix. It is determined by the L1 distance between the element at the corresponding position in the transformed weight matrix; the output data matrix is obtained by transforming the output data of the intermediate matrix through the winograd algorithm.
  • a user may operate respective user devices (eg, local device 301 and local device 302 ) to interact with execution device 310 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the execution device 310 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • any communication mechanism/standard communication network which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 301 and the local device 302 obtain the relevant parameters of the neural network from the execution device 310, deploy the neural network on the local device 301 and the local device 302, and use the neural network to perform image classification and image processing. processing, speech processing, or text processing, etc.
  • a neural network may be directly deployed on the execution device 310, and the execution device 310 obtains the data to be processed from the local device 301 and the local device 302, and uses the neural network model to process the data to be processed.
  • the above execution device 310 may also be a cloud device, in this case, the execution device 310 may be deployed in the cloud; or, the above execution device 310 may also be a terminal device, in this case, the execution device 310 may be deployed on the user terminal side, the embodiment of the present application This is not limited.
  • a large number of matrix operations are usually involved in neural network models, such as convolution operations, and the delay of matrix operations will become the main factor restricting computing efficiency, affecting the overall processing efficiency of the neural network model, making it difficult to deploy the neural network model to real-time performance. in demanding scenarios.
  • the computational load of the neural network model is large, the computing power of the hardware is required to be high, which makes it difficult to deploy the neural network model to hardware devices with low computing power, such as mobile phones and other terminal devices.
  • the embodiment of the present application provides an operation method for a neural network model, which replaces the element-wise multiplication operation in the winograd algorithm with an addition operation, which further reduces the computational load of the neural network model and improves the operation speed of the neural network model. Reduced computing overhead.
  • FIG. 6 shows a computing device for a neural network model provided by an embodiment of the present application.
  • the device 600 includes an input data preprocessing module 610 , a weight preprocessing module 620 , and an acceleration module 630 .
  • the input data preprocessing module 610 performs input data transformation on the input data matrix by using the winograd algorithm to obtain the transformed input data matrix.
  • the input data matrix can be part or all of the input feature map. For example, the dimension of the input feature map is 8*8, the input data matrix can be 4*4, and the input data matrix can also be 8*8.
  • the weight preprocessing module 620 performs weight transformation on the weight matrix through the winograd algorithm to obtain the transformed weight matrix.
  • weight preprocessing module 620 is an optional module.
  • the transformed weight matrix may be stored in the apparatus 600 in advance, or the transformed weight matrix may also be sent to the apparatus 600 by another apparatus.
  • the transformed weight matrix may be calculated before executing the operation of the neural network model, that is, obtained by offline calculation, or may be calculated when executing the operation of the neural network model, that is, obtained by online calculation.
  • the weight matrix is unchanged, and obtaining the transformed weight matrix through offline calculation can further improve the operation speed and reduce the operation cost.
  • the acceleration module 630 is configured to perform feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix; perform winograd transformation on the intermediate matrix to obtain an output data matrix.
  • Each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix
  • the output data matrix can be part or all of the output feature map.
  • the acceleration module can obtain the intermediate matrix by performing a subtraction operation, an operation of calculating an absolute value, and an inversion operation on the transformed input data matrix and the transformed weight matrix.
  • the acceleration module 630 may include a matrix operation module and a post-processing module.
  • the matrix operation module is used to perform feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix.
  • the post-processing module is used to transform the output data of the intermediate matrix through the winograd algorithm to obtain the output data matrix.
  • the methods of the embodiments of the present application may be executed by a device capable of executing the winograd algorithm.
  • the operations in each module in FIG. 6 may be performed by software, may also be performed by hardware, or may be performed jointly by software and hardware. This embodiment of the present application does not limit this.
  • each module in FIG. 6 may be performed by the chip in FIG. 4 .
  • the external memory in FIG. 4 can be used to store input data matrices, calculation results, and the like.
  • the neural network processor 50 may obtain the input data matrix from external memory.
  • the operations in the input data preprocessing module 610 may be performed by the vector calculation unit 507 .
  • the vector calculation unit 507 includes an input data preprocessing module 610 .
  • the external memory can be used to store the transformed weight matrix obtained by offline calculation.
  • the neural network processor 50 may obtain the input data matrix and the transformed weight matrix from the external memory.
  • the operations in the weight preprocessing module 620 can be performed by the vector calculation module 507 .
  • the vector calculation module 507 includes a weight preprocessing module 620 .
  • the operation of the matrix operation module in the acceleration module 630 may be performed by the operation circuit 503 .
  • the operation circuit 503 includes the matrix operation module.
  • the operations of the post-processing module in the acceleration module 630 may be performed by the vector calculation module 507 .
  • the vector calculation module 507 includes this post-processing module.
  • the acceleration module 630 can also be a special module set on the basis of the chip shown in FIG. 4 . Compared with using the arithmetic circuit and the vector calculation module to perform the operations in the acceleration module 630, setting the special module can further improve the acceleration module. 630 processing speed.
  • FIG. 7 shows a method 700 for computing a neural network model provided by an embodiment of the present application.
  • the method shown in FIG. 7 can be executed by an execution device of a neural network model, which can be a cloud service device or a terminal device, for example, a device with sufficient computing power such as a computer and a server to execute the operation of the neural network model, It may also be a system composed of cloud service devices and terminal devices.
  • the method 700 may be executed by the execution device 110 in FIG. 2 , the neural network processor 50 in FIG. 4 , or the execution device 310 in FIG. 5 , or a local device.
  • method 700 may also be performed by a device that provides AutoML services.
  • the device providing the AutoML service may be a cloud service device.
  • the method 700 may be specifically executed by the execution device 110 shown in FIG. 2 , and the data to be processed in the method 700 may be input data given by the client device 140 shown in FIG. 2 .
  • the method 700 includes steps S710 to S730. Steps S710 to S730 will be described in detail below.
  • the data to be processed includes image data, voice data, or text data.
  • the type of data to be processed is related to the task of the neural network model.
  • the data to be processed may be images.
  • image processing tasks include image classification, image detection, image segmentation, image recognition or image generation, etc.
  • the data to be processed can be text.
  • text processing tasks include text recognition or text translation.
  • the data to be processed may be speech data.
  • speech processing tasks include speech recognition and the like. This embodiment of the present application does not limit the type of data to be processed.
  • the data to be processed is an image
  • the image to be processed may be an image captured by a terminal device (or other device or device such as a computer, a server, etc.) through a camera, or the image to be processed may also be obtained from a terminal device (or a computer). (for example, an image stored in an album of the terminal device, or an image acquired by the terminal device from the cloud), which is not limited in this embodiment of the present application.
  • the neural network model in this embodiment of the present application may be an existing neural network model, for example, a residual network.
  • the neural network model may also be a self-constructed neural network model of other structures. This embodiment of the present application does not limit this.
  • the input data matrix of the data to be processed refers to the data matrix input to the at least one feature extraction layer.
  • the feature extraction layer includes an adder layer, that is, an addition operation is used as a filter to perform feature extraction.
  • an adder layer that is, an addition operation is used as a filter to perform feature extraction.
  • the feature extraction layer includes a convolution layer, that is, feature extraction is performed with a convolution kernel as a filter.
  • a convolution layer that is, feature extraction is performed with a convolution kernel as a filter.
  • the convolutional layer can be replaced with an adder layer, and then the method 700 is performed on the feature extraction layer.
  • winograd transformation is performed on the input data matrix by using the input transformation matrix to obtain the transformed input data matrix.
  • Winograd transformation includes input data transformation, weight transformation, and output data transformation performed by the winograd algorithm. That is to say, the above three transformations can be understood as winograd transformations.
  • B represents the input transformation matrix
  • B T represents the transpose matrix of B
  • d represents the input data matrix before transformation
  • the input transformation matrix may be an existing winograd input transformation matrix.
  • the input data matrix d before transformation is a 4 ⁇ 4 matrix
  • the weight matrix g before transformation is a 3 ⁇ 3 matrix
  • the input transformation matrix can be:
  • the input data matrix can be part or all of the input feature map.
  • the dimension of the input feature map is 8*8, and the input data matrix may be a 4*4 matrix in the input feature map, or the input data matrix may also be the input feature map.
  • the input feature map may be the data itself input into the neural network model, for example, an image to be processed.
  • the input feature map may also be a feature map obtained after one or more feature extractions are performed by some feature extraction layers in the neural network model, for example, a feature map obtained after one or more feature extractions are performed on the image to be processed.
  • the way of feature extraction can use existing solutions, for example, convolutional layer processing. That is to say, the input feature map in this embodiment of the present application may be a feature map obtained after performing one or more convolution processing on the image to be processed.
  • the feature extraction method may also adopt the method 700 in this embodiment of the present application. That is to say, the input feature map in the embodiment of the present application may be a feature map obtained after processing the image to be processed by the method 700 in the embodiment of the present application.
  • the embodiment of the present application does not limit the acquisition method of the input feature map.
  • step S710 may be performed by the input data preprocessing module 610 in FIG. 6 .
  • the transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer by using the winograd algorithm.
  • Each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • L1 distance can also be called L1 regular distance, L1 norm distance, Manhattan distance or taxi distance.
  • the transformed weight matrix By performing winograd transformation on the weight matrix by the weight transformation matrix, the transformed weight matrix can be obtained.
  • the weight matrix is a parameter of one or more feature extraction kernels in the at least one feature extraction layer in the neural network model.
  • a feature extraction layer may include one or more feature extraction kernels, ie filters.
  • the feature extraction kernel is used to perform feature extraction on the data input to the neural network model.
  • G represents the weight transformation matrix
  • G T represents the transpose matrix of G
  • g represents the weight matrix before transformation
  • the weight transformation matrix may be an existing winograd weight transformation matrix.
  • the input data matrix d before transformation is a 4 ⁇ 4 matrix
  • the weight matrix g before transformation is a 3 ⁇ 3 matrix
  • the weight transformation matrix can be:
  • the transformed weight matrix can be obtained through offline transformation or through online transformation.
  • the transformed weight matrix is obtained through offline transformation, which means that the transformed weight matrix is transformed before performing the operation of the neural network model.
  • the transformed weight matrix can be transformed before the neural network model is deployed.
  • the weight matrix is unchanged, and obtaining the transformed weight matrix through offline calculation can further improve the operation speed and reduce the operation cost.
  • the transformed weight matrix is obtained through online transformation, which means that the transformed weight matrix is obtained through transformation during the operation of the neural network model.
  • each element in the intermediate matrix is the inverse of the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • the intermediate matrix X satisfies the following formula:
  • the minus sign between the two items in the above formula indicates element-wise subtraction (element-wise minus), and
  • Step S720 may be implemented by performing a subtraction operation, an operation of calculating an absolute value, and an inversion operation on the transformed input data matrix and the transformed weight matrix.
  • step S720 includes steps S721 to S723.
  • addition operation and the subtraction operation are substantially the same, and the result of the subtraction operation can also be obtained through the addition operation.
  • the addition operation and the subtraction operation are collectively referred to as the addition operation.
  • step S720 may be performed by the acceleration module 630 in FIG. 6 .
  • the neural network model may include multiple feature extraction layers, and each feature extraction layer may include one or more feature extraction kernels, that is, weight matrices.
  • each feature extraction layer may include one or more feature extraction kernels, that is, weight matrices.
  • one or more times of feature extraction processing may be performed in each feature extraction layer, and one or more times of the multiple times of feature extraction may use the method 700 to perform a corresponding operation process. That is, performing the method 700 on a feature extraction layer of the neural network model may include performing the method 700 on a feature extraction kernel in the feature extraction layer.
  • winograd transformation is performed on the intermediate matrix through the output transformation matrix to obtain the output data matrix.
  • the output data matrix Y satisfies the following formula:
  • the output data matrix Y satisfies the following formula:
  • A represents the output transformation matrix
  • a T represents the transpose matrix of A.
  • the output transformation matrix may be the existing winograd input transformation matrix.
  • the input data matrix d before transformation is a 4 ⁇ 4 matrix
  • the weight matrix g before transformation is a 3 ⁇ 3 matrix
  • the output transformation matrix can be:
  • the output data matrix may be part or all of the output feature map.
  • step S730 may be performed by the acceleration module 630 in FIG. 6 .
  • the point multiplication operation in winograd is replaced with an addition operation such as the operation of calculating the L1 distance, which reduces the calculation amount of the feature extraction process, improves the running speed of the model, and reduces the calculation overhead.
  • the solution of the embodiment of the present application can be understood as a solution of combining the winograd algorithm and the AdderNet.
  • it can also be understood as a scheme that uses the winograd algorithm to optimize the adder layer of AdderNet. Specifically, from the perspective of AdderNet, this scheme reduces the number of addition operations in the adder layer of AdderNet and reduces the computational complexity of the model; from the perspective of winograd algorithm, this scheme replaces multiplication operations with addition operations , which reduces the computational complexity of the model.
  • the value of an element in the output transformation matrix is any one of the following: 0, -1, 1.
  • the value of the element in the output transformation matrix is 0, 1 or -1, which can reduce the multiplication operation, further reduce the number of calculations, and help reduce the calculation amount of the model.
  • the output transformation matrix may be:
  • the element of at least one row of the output transformation matrix is the inverse of the element of the position corresponding to the at least one row in the first matrix, and the element of the other row of the output transformation matrix and the position corresponding to the other row in the first matrix are the inverse numbers.
  • the first matrix can be:
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • any row of A' can be negated without affecting the final result of winograd transformation.
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • c 0 is 0, c 1 is -1, and c 2 is 1.
  • the weight transformation matrix can be:
  • the above-mentioned output transformation matrix and weight transformation matrix satisfy the general solution form of the winograd algorithm, which can ensure that the result of the convolution calculation obtained by the winograd algorithm is the same as the result of the conventional convolution calculation. In this way, in the case of using the existing input transformation matrix, the above-mentioned output transformation matrix and weight transformation matrix can still be applied to the convolution calculation.
  • the transformation matrix in the winograd algorithm can include various forms, as long as the general solution form of winograd is satisfied, the result of the convolution calculation obtained by the winograd algorithm can be guaranteed to be the same as the result of the conventional convolution calculation.
  • the above-mentioned output transformation matrix and weight transformation matrix are only examples in the case of using the input transformation matrix of the conventional winograd algorithm. In the case of input data matrix adjustment, the output transformation matrix and the weight transformation matrix can also take other forms.
  • the elements at various positions in the output data matrix may be unbalanced, resulting in a slower decline of the loss function during the training process and a lower accuracy of the trained model.
  • this situation also affects the subsequent processing of the output data matrix, such as normalization (batchnorm) processing, which affects the performance of the model.
  • the intermediate matrix X is a 4 ⁇ 4 matrix and the output data matrix Y is a 2 ⁇ 2 matrix.
  • the intermediate matrix X can be represented as follows:
  • x 0 , x 1 . . . x 16 represent the elements in the intermediate matrix X.
  • the output data matrix can be represented as:
  • y 0 , y 1 , y 2 and y 3 represent elements in the output data matrix Y.
  • y 0 x 0 +x 1 +x 2 +x 4 +x 5 +x 6 +x 8 +x 9 +x 10 ;
  • y 1 x 1 -x 2 -x 3 +x 5 -x 6 -x 7 +x 9 -x 10 -x 11 ;
  • y 2 x 4 +x 5 +x 6 -x 8 -x 9 -x 10 -x 12 -x 13 -x 14 ;
  • y 3 x 5 -x 6 -x 7 -x 9 +x 10 +x 11 -x 13 +x 14 +x 15 ;
  • the number of addition operations in the equation corresponding to each element in Y is different. Accordingly, the number of subtraction operations in the equation corresponding to each element in Y is different. In other words, the number of signs of the elements in the equation corresponding to each element in Y is different.
  • the equation corresponding to y 0 includes 9 addition operations
  • the equation corresponding to y 1 includes 3 addition operations
  • the number of addition operations in the equation corresponding to y 0 and the equation corresponding to y 1 is different.
  • the magnitude of each element in X is usually the same, and each element in X is always a non-positive number.
  • the number of addition and subtraction operations in the equation corresponding to each element in Y is determined by the sign of the element in the output transformation matrix, that is, the sign of the element in the output transformation matrix affects the output The distribution of eigenvalues in the data matrix.
  • the embodiment of the present application also provides an output transformation matrix, which can improve the operation speed of the neural network model and avoid the performance degradation of the neural network model.
  • the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.
  • the value of an element in the output transformation matrix can be any of 0, 1 or -1, in which case the number of +1's in each column element in the output transformation matrix is the same, The number of -1's in each column element is the same.
  • the output transformation matrix may include any of the following:
  • a 0 , A 1 , A 2 , and A 3 respectively represent four kinds of output transformation matrices. It should be understood that the four output transformation matrices are only examples, and in the case of other combinations of c 0 , c 1 and c 2 being 0, 1 and -1, that is, between the rows of the four output transformation matrices In the case of mutual exchange, other forms of output transformation matrices can be obtained, which are not listed here.
  • y 0 x 0 -x 1 -x 2 -x 4 +x 5 +x 6 -x 8 +x 9 +x 10 ;
  • y 1 -x 1 +x 2 -x 3 +x 5 -x 6 +x 7 +x 9 -x 10 +x 11 ;
  • y 2 x 4 +x 5 +x 6 +x 8 -x 9 -x 10 -x 12 +x 13 +x 14 ;
  • y 3 x 5 -x 6 +x 7 -x 9 +x 10 -x 11 +x 13 -x 14 +x 15 ;
  • the number of addition operations in the equation corresponding to each element in Y is the same. Accordingly, the number of subtraction operations in the equation corresponding to each element in Y is the same.
  • the number of positive and negative signs in the elements in the intermediate matrix in the equation corresponding to each element in Y is the same.
  • the equation corresponding to y 0 includes 5 positive signs
  • the equation corresponding to y 1 includes 5 positive signs.
  • the equation corresponding to y 0 has the same number of positive signs as the equation corresponding to y 1 .
  • the balance of elements in each position in the output data matrix is guaranteed. It should be understood that A 0 is only used as an example here, and A 1 , A 2 , and A 3 can all achieve the same effect.
  • the weight transformation matrix may include any of the following:
  • G 0 , G 1 , G 2 , and G 3 respectively represent four kinds of weight transformation matrices. It should be understood that the four weight transformation matrices are only examples, and in the case of other combinations of c 0 , c 1 and c 2 being 0, 1 and -1, that is, each row of the four weight transformation matrices In the case of mutual exchange, other forms of weight transformation matrices can be obtained, which are not listed here.
  • the weight transformation matrices G 0 , G 1 , G 2 , and G 3 are in one-to-one correspondence with the output transformation matrices A 0 , A 1 , A 2 , and A 3 , and the corresponding weight transformation matrices and output transformation matrices are used together.
  • G 0 and A 0 are used together.
  • the input transformation matrix can be the existing input transformation matrix in winograd.
  • the output transformation matrix and the weight transformation matrix can adopt the above four forms.
  • the output transformation matrix and weight transformation matrix can also take other forms. That is to say, as long as the transformation matrix can satisfy the general solution form of winograd, and can ensure the balance of elements in each position in the output data matrix.
  • the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.
  • the number of positive numbers in each column element in the output transformation matrix is the same. +The number of 1 is the same, and the number of -1 in each column element is the same, which can balance the magnitude of each position of the output data matrix, that is, reduce the imbalance in the accumulation of eigenvalues, which is beneficial to the training of the model.
  • it is advantageous to perform subsequent processing on the output data matrix eg, batchnorm processing.
  • the output transformation matrix and the weight transformation matrix satisfy the general solution form of winograd, which can ensure that the output result is consistent with the actual result of the convolution operation during the convolution operation, that is, it is suitable for the convolution operation in the model.
  • winograd algorithm to accelerate the convolution operation will not affect the calculation results.
  • the operation of taking the absolute value is adopted, and the distributive law of multiplication is no longer applicable. That is to say, if the winograd algorithm is used to accelerate the adder layer, there is a certain gap between the calculation result and the original calculation result. Combining the winograd algorithm with the adder layer in the above way will result in a performance degradation of the neural network model.
  • the embodiments of the present application also provide a method for training a neural network model, which can improve the performance of the neural network model.
  • FIG. 8 shows a training method 800 of a neural network model provided by an embodiment of the present application.
  • the method shown in FIG. 8 can be performed by a training device for a neural network model, which can be a cloud service device or a terminal device, such as a computer, a server, and other devices with sufficient computing power to perform neural network model training, It may also be a system composed of cloud service devices and terminal devices.
  • the method 800 may be performed by the training device 120 in FIG. 2 , the neural network processor 50 in FIG. 4 , or the execution device 310 in FIG. 5 .
  • the operation method of the training method 800 in FIG. 8 during the forward propagation process of the neural network model is consistent with the method in FIG. 7 . It is sufficient to replace "data to be processed" in method 700 with "training data".
  • the forward propagation process in the method 800 reference may be made to the foregoing method 700. In order to avoid unnecessary repetition, the repeated description is appropriately omitted when introducing the method 800 below.
  • the method 800 includes steps S810 to S850. Steps S810 to S850 will be described in detail below.
  • S810 Perform input data transformation on the input data matrix of the training data by using the winograd algorithm to obtain a transformed input data matrix.
  • the type of training data is related to the task of the neural network model.
  • the training data may be images.
  • image processing tasks include image classification, image detection, image segmentation, or image generation, etc.
  • the training data may be text.
  • text processing tasks include text recognition or text translation.
  • the training data may be speech data.
  • speech processing tasks include speech recognition and the like. This embodiment of the present application does not limit the type of training data.
  • the training data may be pre-stored.
  • the training data may be training data maintained in the database 130 shown in FIG. 2 .
  • the input data matrix refers to the data matrix input to the at least one feature extraction layer.
  • the transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer through the winograd algorithm, and each element in the intermediate matrix corresponds to the transformed weight matrix according to the transformed input data matrix.
  • the Lp distance between the elements of the position is determined.
  • the Lp distance can also be called the Minkowski distance, which is the definition of a set of distances. p is a parameter.
  • each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • the intermediate matrix X satisfies the following formula:
  • each element in the intermediate matrix is determined according to the L2 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • the intermediate matrix X satisfies the following formula:
  • the L2 distance can also be called the Euclidean distance.
  • the output data matrix is part or all of the output feature map
  • the processing result of the training data can be determined according to the output feature map
  • the value of the loss function is calculated according to the processing result of the training data.
  • the processing result of the training data is related to the type of training data and the task of the neural network model.
  • the training data is image data
  • the image processing may include image super-score processing, image denoising processing, image recognition processing, etc.
  • the image processing results include image over-score, image denoising, or image classification, etc. This embodiment of the present application does not limit this.
  • the training data is speech data
  • the speech processing may include speech recognition and the like.
  • the speech processing results include speech recognition results and the like. This embodiment of the present application does not limit this.
  • the output feature map can also be further processed by other processing, for example, through activation function processing, etc., to obtain the processing result of the training data.
  • the neural network model is trained according to the value of the loss function.
  • p is 2; during the nth iteration of training the neural network model, p is 1, where m and n are positive integers, and m less than n.
  • the L2 distance is used to calculate the intermediate matrix in the forward calculation process, and the partial derivative of the loss function to the weights in the first weight matrix is calculated based on the L2 distance in the backward propagation process.
  • the forward calculation is performed according to the L2 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix, and the forward calculation is performed according to the difference between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • Backpropagation is performed on the L2 distance between.
  • the forward calculation and the backward calculation are performed based on the L2 distance.
  • the L1 distance is used to calculate the intermediate matrix in the forward calculation process, and the partial derivative of the loss function to the weights in the first weight matrix is calculated based on the L1 distance in the backward propagation process.
  • the forward calculation is performed according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix, and the forward calculation is performed according to the difference between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.
  • Backpropagation is performed on the L1 distance between.
  • the forward calculation and the backward calculation are performed based on the L1 distance.
  • the first weight matrix includes the weight matrix before transformation or the weight matrix after transformation. That is, the first weight matrix can be the weight matrix g before transformation, or the weight matrix after transformation During the training process, the gradient of the weight matrix before transformation can be calculated, and then the value of the weight matrix before transformation can be adjusted. Alternatively, the gradient of the transformed weight matrix may also be calculated, and then the value of the transformed weight matrix may be adjusted.
  • L1 distance is used to perform forward calculation and backpropagation, that is, L2 distance is used to approximate L1 distance.
  • the winograd algorithm will be difficult to optimize, which may cause the network training to fail to converge.
  • the L2 distance is used to assist the training, and the L2 distance is more friendly to the winograd algorithm, which can improve the convergence speed of the training process, thereby improving the training effect of the model.
  • training is performed based on the L1 distance to further improve the training effect of the model using the L1 distance.
  • the trained model uses the L1 distance, which is more friendly to hardware.
  • the partial derivatives of the loss function with respect to the weights in the first weight matrix satisfy the following formula:
  • p is the calculated norm, p ⁇ [1,2], w represents the weight in the first weight matrix, and w can be the weight in the weight matrix g before transformation, or it can be in the weight matrix after transformation Weight
  • x represents the data in the input data matrix before transformation or the data in the input data matrix after transformation
  • the data in the input data matrix before transformation is the eigenvalue in the input feature map
  • L represents the loss function
  • i represents The number of layers of the neural network model
  • i is an integer.
  • sign() represents the sign function.
  • the above formula is the partial derivative of the loss function obtained based on the L2 distance to the first weight matrix, that is to say, backpropagation is performed based on the L2 distance.
  • the above formula is the partial derivative of the loss function obtained based on the L1 distance to the first weight matrix, that is to say, backpropagation is performed based on the L2 distance.
  • the value of p is determined according to the number of iterations of the training process.
  • the initial value of p is 2, and the value of p decreases as the number of iterations increases.
  • the value of p can be decremented at every iteration or every few iterations.
  • a can be fixed, that is, the decrease in the value of p is constant at each iteration.
  • a can be 0.05, i.e., at each iteration, p decreases by 0.05.
  • a can also be varied. For example, as the number of iterations increases, the decrease in the value of p increases gradually. For example, at the first iteration, p is 2, at the second iteration, a is 0.01, then p is 1.99, at the third iteration, a is 0.02, then p is 1.98.
  • the value of p is decreased by a every k iterations.
  • the trained neural network model can be used to perform the target task.
  • the target task may be an image processing task, such as target detection, image segmentation, instance segmentation, image denoising, image super-resolution, and the like.
  • the target task may be a speech processing task, eg, speech recognition.
  • the target task can be a text processing task such as text recognition or text translation.
  • Table 1 shows the comparison of the experimental results of the scheme of the embodiment of the present application and the existing scheme on the classification data of CIFAR-10.
  • the operation method of the present application can improve the operation efficiency, but will lead to a decrease in the accuracy of the model.
  • using the adjusted transformation matrix can improve the accuracy of the model.
  • using the training method of the present application to train the model can further improve the accuracy of the model, which is close to the accuracy of AdderNet.
  • Table 2 shows the comparison of the experimental results of the solutions of the embodiments of the present application and the existing solutions on the underlying visual task.
  • the computing method of the embodiment of the present application can achieve a higher index than AdderNet.
  • the computing method of the embodiment of the present application can achieve a visual effect close to that of AdderNet.
  • FIG. 9 is a schematic block diagram of an apparatus for training a neural network model according to an embodiment of the present application.
  • the training apparatus 3000 of the neural network model shown in FIG. 9 includes an acquisition unit 3010 and a processing unit 3020 .
  • the acquiring unit 3010 and the processing unit 3020 may be used to execute the training method of the neural network model of the embodiment of the present application, and specifically, may be used to execute the method 800 .
  • the acquiring unit 3010 is used for acquiring training data.
  • the processing unit 3020 is configured to perform the following operations on at least one feature extraction layer of the neural network model:
  • the model is trained; during the m-th iteration of training the neural network model, p is 2, and during the n-th iteration of training the neural network model, p is 1, and m and n are positive integers , where m is less than n.
  • the initial value of p is 2, and the value of p decreases as the number of iterations increases.
  • training the neural network model according to the value of the loss function includes: adjusting the weights in the first weight matrix according to the partial derivatives of the weights in the first weight matrix by the loss function, the first weight matrix Include the weight matrix before transformation or the weight matrix after transformation.
  • the partial derivative of the loss function to the weight in the first weight matrix satisfies the following formula:
  • p is the norm of the calculation
  • w represents the weight
  • x represents the data in the input data matrix before transformation or the data in the input data matrix after transformation
  • L represents the loss function
  • i represents The number of layers of the neural network model
  • sign() represents the sign function.
  • FIG. 10 is a schematic block diagram of a computing device 4000 for a neural network model provided by an embodiment of the present application.
  • the apparatus 4000 shown in FIG. 10 includes an acquisition unit 4010 and a processing unit 4020 .
  • the acquiring unit 4010 and the processing unit 4020 may be used to execute the operation method of the neural network model of the embodiment of the present application, for example, may be used to execute the method 700 .
  • the acquiring unit 4010 is configured to acquire data to be processed, and the data to be processed includes image data, voice data or text data.
  • the processing unit 4020 is configured to perform the following operations on at least one feature extraction layer of the neural network model: transform the input data matrix of the data to be processed through the input transformation matrix of the winograd algorithm to obtain the transformed input data matrix; use the transformed input data matrix;
  • the weight matrix performs feature extraction on the transformed input data matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer through the weight transformation matrix of the winograd algorithm.
  • Each element of is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix; the intermediate matrix is transformed by the output transformation matrix of the winograd algorithm to obtain the output data matrix.
  • the output data matrix satisfies the following formula:
  • Y represents the output data matrix
  • A represents the output transformation matrix
  • a T represents the transpose matrix of A
  • G represents the weight transformation matrix
  • G T represents the transpose matrix of G
  • g represents the weight matrix before transformation
  • B represents the input transformation matrix
  • B T represents the transposed matrix of B
  • d represents the input data matrix before transformation.
  • the value of an element in the output transformation matrix is any one of 0, -1 or 1.
  • the output transformation matrix is:
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • the element of at least one row in the output transformation matrix is the inverse of the element in the position corresponding to the at least one row in the first matrix, and the elements of other rows in the output transformation matrix are the same as the elements in the first matrix and The elements of the positions corresponding to other rows are the same, and the first matrix is:
  • A' represents the first matrix
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • the weight transformation matrix is:
  • c 0 , c 1 and c 2 are any of 0, -1, and 1, respectively.
  • the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.
  • training apparatus 3000 and apparatus 4000 are embodied in the form of functional units.
  • unit here can be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit, or a combination of the two that realizes the above-mentioned functions.
  • the hardware circuits may include application specific integrated circuits (ASICs), electronic circuits, processors for executing one or more software or firmware programs (eg, shared processors, proprietary processors, or group processors) etc.) and memory, merge logic and/or other suitable components to support the described functions.
  • ASICs application specific integrated circuits
  • processors for executing one or more software or firmware programs eg, shared processors, proprietary processors, or group processors
  • the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • FIG. 11 is a schematic diagram of a hardware structure of a training apparatus for a neural network model provided by an embodiment of the present application.
  • the apparatus 5000 for training a neural network model shown in FIG. 11 includes a memory 5001 , a processor 5002 , a communication interface 5003 and a bus 5004 .
  • the memory 5001 , the processor 5002 , and the communication interface 5003 are connected to each other through the bus 5004 for communication.
  • the memory 5001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 5001 may store a program, and when the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 is configured to execute each step of the neural network model training method of the embodiment of the present application. Specifically, the processor 5002 can execute the method 800 shown in FIG. 8 above.
  • the processor 5002 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more
  • the integrated circuit is used to execute the relevant program to realize the training method of the neural network model in the method embodiment of the present application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 4 .
  • each step of the training method of the neural network model of the present application can be completed by an integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the above-mentioned processor 5002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and performs the functions required to be performed by the units included in the training device shown in FIG. The training method of the neural network model shown.
  • the communication interface 5003 implements communication between the apparatus 5000 and other devices or a communication network using a transceiving device such as, but not limited to, a transceiver. For example, training data can be obtained through the communication interface 5003 .
  • the bus 5004 may include a pathway for communicating information between the various components of the device 5000 (eg, the memory 5001, the processor 5002, the communication interface 5003).
  • FIG. 12 is a schematic diagram of a hardware structure of a computing device of a neural network model according to an embodiment of the present application.
  • the data processing apparatus 6000 shown in FIG. 12 includes a memory 6001 , a processor 6002 , a communication interface 6003 , and a bus 6004 .
  • the memory 6001 , the processor 6002 , and the communication interface 6003 are connected to each other through the bus 6004 for communication.
  • the memory 6001 may be ROM, static storage device and RAM.
  • the memory 6001 can store programs. When the programs stored in the memory 6001 are executed by the processor 6002, the processor 6002 and the communication interface 6003 are used to execute various steps of the neural network model computing method of the embodiment of the present application. Specifically, the processor 6002 may perform steps S710 to S730 in the method shown in FIG. 7 above.
  • the processor 6002 may adopt a general-purpose CPU, a microprocessor, an ASIC, a GPU, or one or more integrated circuits, and is used to execute a related program, so as to realize the required execution of the unit in the computing device of the neural network model of the embodiment of the present application function, or execute the operation method of the neural network model of the method embodiment of the present application.
  • the processor 6002 may also be an integrated circuit chip with signal processing capability, for example, the chip shown in FIG. 4 .
  • each step of the operation method of the neural network model of the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or an instruction in the form of software.
  • the above-mentioned processor 6002 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001 and, in combination with its hardware, completes the functions required to be performed by the units included in the computing device of the neural network model of the embodiment of the present application, or executes the method embodiment of the present application.
  • the operation method of the neural network model may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001 and, in combination with its hardware, completes the functions required to be performed by the units included in the computing device of the neural network model of the embodiment of the present application, or executes the method embodiment of the present application.
  • the communication interface 6003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 6000 and other devices or a communication network.
  • the data to be processed can be acquired through the communication interface 6003 .
  • the bus 6004 may include a pathway for communicating information between the various components of the device 6000 (eg, the memory 6001, the processor 6002, the communication interface 6003).
  • apparatus 5000 and apparatus 6000 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 5000 and the apparatus 6000 may also include the necessary components to achieve normal operation. of other devices. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 5000 and the apparatus 6000 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 5000 and the apparatus 6000 may only include the necessary devices for implementing the embodiments of the present application, and do not necessarily include all the devices shown in FIG. 11 and FIG. 12 .
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory Fetch memory
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server or data center by wire (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “plurality” means two or more.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • at least one item (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了人工智能领域中的一种神经网络模型的运算方法、训练方法及装置,在该运算方法中,利用winograd变换后的权重矩阵对winograd变换后的输入数据矩阵进行特征提取,得到中间矩阵,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离确定的,通过winograd算法对中间矩阵进行输出数据变换,得到输出数据矩阵。本申请的方案将winograd中的点乘操作替换为计算L1距离的操作等加法操作,减少了特征提取过程的计算量,提高了模型的运行速度,减少了运算开销。

Description

神经网络模型的运算方法、训练方法及装置 技术领域
本申请涉及人工智能领域,并且更具体地,涉及神经网络模型的运算方法、训练方法及装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
神经网络模型中通常涉及大量的矩阵运算。以卷积运算为例,卷积操作涉及乘法运算,计算复杂度较高,运算过程的时延较长。矩阵运算通常占用神经网络模型运算的大部分时间,矩阵运算的时延会成为制约计算效率的主要因素,影响神经网络模型整体的处理效率,造成较大的功耗损失。
因此,如何降低神经网络模型的运算开销成为一个亟待解决的问题。
发明内容
本申请提供一种神经网络模型的运算方法、训练方法及装置,能够降低神经网络模型的运算开销,提高处理效率。
第一方面,提供了一种神经网络模型的运算方法,该方法包括:在神经网络模型的至少一个特征提取层中执行以下操作:通过winograd算法的输入变换矩阵对待处理数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵,待处理数据包括图像数据、语音数据或者文本数据;利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,变换后的权重矩阵是通过winograd算法的权重变换矩阵对至少一个特征提取层的权重矩阵进行权重变换得到的,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离确定的;通过winograd算法的输出变换矩阵对中间矩阵进行输出数据变换以得到输出数据矩阵。
根据本申请实施例的方案,将winograd中的点乘操作替换为计算L1距离的操作等加法操作,减少了特征提取过程的计算量,提高了模型的运行速度,减少了运算开销。
待处理数据包括图像数据、语音数据或文本数据等。
待处理数据的类型与神经网络模型的任务有关。例如,神经网络模型用于图像处理任务,则该待处理数据可以为图像。具体地,图像处理任务包括图像分类、图像检测、图像 分割、图像识别或图像生成等。再如,神经网络模型用于文本处理任务,则该待处理数据可以为文本。具体地,文本处理任务包括文本识别或文本翻译等。再如,神经网络模型用于语音处理任务,则该待处理数据可以为语音数据。具体地,语音处理任务包括语音识别等。本申请实施例对待处理数据的类型不做限定。
待处理数据的输入数据矩阵指的是输入该至少一个特征提取层的数据矩阵。
输入数据矩阵可以为输入特征图的部分或全部。输入特征图可以为输入至神经网络模型中的数据自身,例如,待处理的图像。输入特征图也可以为经过神经网络模型中的部分特征提取层进行一次或多次特征提取后得到的特征图。
具体地,通过输入转换矩阵对输入数据矩阵进行winograd变换,得到变换后的输入数据矩阵
该权重矩阵为神经网络模型中该至少一个特征提取层中的一个或多个特征提取核的参数。
具体地,通过权重变换矩阵对权重矩阵进行winograd变换,可以得到变换后的权重矩阵。
具体地,通过输出转换矩阵对中间矩阵进行winograd变换,得到输出数据矩阵。
输出数据矩阵可以为输出特征图的部分或全部。
结合第一方面,在第一方面的某些实现方式中,变换后的权重矩阵是离线变换得到的。
变换后的权重矩阵是在执行神经网络模型的运算之前变换得到的。例如,变换后的权重矩阵可以是在神经网络模型部署之前变换得到的。
根据本申请实施例的方案,在神经网络模型的推理过程中,权重矩阵是不变的,通过离线计算得到变换后的权重矩阵能够进一步提高运算速度,降低运算开销。
结合第一方面,在第一方面的某些实现方式中,中间矩阵中的每个元素为变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离的相反数。
结合第一方面,在第一方面的某些实现方式中,输出数据矩阵满足如下公式:
Y=A T[-|[GgG T]-[B TdB]|]A;
其中,Y表示输出数据矩阵,A表示输出变换矩阵,A T表示A的转置矩阵,G表示权重变换矩阵,G T表示G的转置矩阵,g表示变换前的权重矩阵,B表示输入变换矩阵,B T表示B的转置矩阵,d表示变换前的输入数据矩阵。
结合第一方面,在第一方面的某些实现方式中,输出变换矩阵中的元素的值为0,-1或1中的任一项。
根据本申请实施例的方案,输出变换矩阵中的元素的取值为0,1或-1,可以减少乘法操作,进一步减少计算次数,有利于减少模型的计算量。
结合第一方面,在第一方面的某些实现方式中,输出变换矩阵为:
Figure PCTCN2021091574-appb-000001
其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
结合第一方面,在第一方面的某些实现方式中,输出变换矩阵中的至少一行的元素为 第一矩阵中与至少一行对应的位置的元素的相反数,输出变换矩阵中的其他行的元素和第一矩阵中与其他行对应的位置的元素相同,第一矩阵为:
Figure PCTCN2021091574-appb-000002
其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
也就是说,A'的任一行均可以取反,而不影响winograd变换的最终结果。
例如,c 0为0,c 1为-1,c 2为1。再如,c 0为-1,c 1为1,c 2为0。
结合第一方面,在第一方面的某些实现方式中,权重变换矩阵为:
Figure PCTCN2021091574-appb-000003
根据本申请实施例的方案,上述输出变换矩阵和权重变换矩阵满足winograd算法的通解形式,能够保证通过winograd算法得到的卷积计算的结果与常规的卷积计算的结果相同。这样,在采用现有的输入变换矩阵的情况下,上述输出变换矩阵和权重变换矩阵仍能适用于卷积计算。
结合第一方面,在第一方面的某些实现方式中,输出变换矩阵中的各列元素中的正数的数量是相同的,各列元素中负数的数量是相同的。
根据本申请实施例的方案,输出变换矩阵中的各列元素中的正数的数量是相同的,各列元素中的负数的数量是相同的,例如,输出变换矩阵中的各列元素中的﹢1的数量是相同的,各列元素中的-1的数量是相同的,能够均衡输出数据矩阵各个位置的量级,即减轻特征值累加的不均衡性,有利于模型的训练。此外,有利于对输出数据矩阵执行后续的处理,例如,均一化(batchnorm)处理。
第二方面,提供了一种神经网络模型的训练方法,该方法包括:通过winograd算法的输入变换矩阵对训练数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵;利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,变换后的权重矩阵是通过winograd算法的权重变换就很对至少一个特征提取层的权重矩阵进行权重变换得到的,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的Lp距离确定的;通过winograd算法的输出变换矩阵对中间矩阵进行输出数据变换以得到输出数据矩阵;根据输出数据矩阵确定损失函数的值;根据损失函数的值对神经网络模型进行训练;在对神经网络模型进行训练的第m次迭代的过程中,p为2,在对神经网络模型进行训练的第n次迭代的过程中,p为1,m和n为正整数,m小于n。
根据本申请实施例的方案,在训练前期,利用L2距离辅助训练,L2距离对winograd 算法更加友好,能够提高训练过程的收敛速度,进而提高模型的训练效果。在训练后期,基于L1距离进行训练,以便进一步提高采用L1距离的模型的训练效果,训练好的模型中采用L1距离,对硬件更加友好。
训练数据的类型与神经网络模型的任务有关。例如,神经网络模型用于图像处理任务,则该训练数据可以为图像。具体地,图像处理任务包括图像分类、图像检测、图像分割或图像生成等。再如,神经网络模型用于文本处理任务,则该训练数据可以为文本。具体地,文本处理任务包括文本识别或文本翻译等。再如,神经网络模型用于语音处理任务,则该训练数据可以为语音数据。具体地,语音处理任务包括语音识别等。本申请实施例对训练数据的类型不做限定。
具体地,输出数据矩阵为输出特征图的部分或全部,根据输出特征图可以确定训练数据的处理结果,根据训练数据的处理结果计算损失函数的值。
训练数据的处理结果与训练数据的类型以及神经网络模型的任务有关。
示例性地,训练数据为图像数据,图像处理可以包括图像超分处理、图像去噪处理、图像识别处理等,相应地,图像处理结果包括图像超分、图像去噪或图像类别等。本申请实施例对此不做限定。
示例性地,训练数据为语音数据,语音处理可以包括语音识别等,相应地,语音处理结果包括语音识别结果等。本申请实施例对此不作限定。
结合第二方面,在第二方面的某些实现方式中,在对神经网络模型进行训练的过程中,p的初值为2,p的值随着迭代次数的增加而减少。
也就是说,在训练过程中将p从2减少为1。
结合第二方面,在第二方面的某些实现方式中,根据损失函数的值对神经网络模型进行训练,包括:根据损失函数对第一权重矩阵中的权重的偏导数调整第一权重矩阵中的权重,第一权重矩阵包括变换前的权重矩阵或变换后的权重矩阵。
在训练过程中,可以计算变换前的权重矩阵的梯度,进而调整变换前的权重矩阵的值。或者,也可以计算变换后的权重矩阵的梯度,进而调整变换后的权重矩阵的值。
结合第二方面,在第二方面的某些实现方式中,损失函数对第一权重矩阵中的权重的偏导数满足以下公式:
Figure PCTCN2021091574-appb-000004
Figure PCTCN2021091574-appb-000005
其中,p为计算的范数,p∈[1,2],w表示权重,x表示变换前的输入数据矩阵中的数据或变换后的输入数据矩阵中的数据,L表示损失函数,i表示神经网络模型的层数,sign()表示符号函数。
当p为2时,上式即为基于L2距离得到的损失函数对第一权重矩阵的偏导数,也就是说基于L2距离执行反向传播。p为1时,上式即为基于L1距离得到的损失函数对第一权重矩阵的偏导数,也就是说基于L2距离执行反向传播。
第三方面,提供了一种神经网络模型的运算装置,该装置包括用于执行上述第一方面以及第一方面中的任意一种实现方式中的方法的模块或单元。
第四方面,提供了一种神经网络模型的训练装置,该装置包括用于执行上述第二方面 以及第二方面中的任意一种实现方式中的方法的模块或单元。
应理解,在上述第一方面中对相关内容的扩展、限定、解释和说明也适用于第二方面、第三方面和第四方面中相同的内容。
第五方面,提供了一种神经网络模型的运算装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面以及第一方面中的任意一种实现方式中的方法。
上述第五方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
第六方面,提供了一种神经网络模型的训练装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第二方面以及第二方面中的任意一种实现方式中的方法。
上述第六方面中的处理器既可以是中央处理器,也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器、神经网络处理器和张量处理器等等。其中,TPU是谷歌为机器学习全定制的人工智能加速器专用集成电路。
第七方面,提供一种计算机可读存储介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面或第二方面中的任意一种实现方式中的方法。
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第二方面中的任意一种实现方式中的方法。
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或第二方面中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面或第二方面中的任意一种实现方式中的方法。
上述芯片具体可以是现场可编程门阵列(field-programmable gate array,FPGA)或者专用集成电路(application-specific integrated circuit,ASIC)。
附图说明
图1是本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种系统架构的结构示意图;
图3为本申请实施例提供的一种卷积神经网络的结构示意图;
图4为本申请实施例提供的一种芯片的硬件结构示意图;
图5为本申请实施例提供的一种系统架构的示意图;
图6为本申请实施例提供的一种神经网络模型的运算装置的示意性框图;
图7为本申请实施例提供的一种神经网络模型的运算方法的示意性流程图;
图8为本申请实施例提供的一种神经网络模型的训练方法的示意性流程图;
图9是本申请实施例提供的一种神经网络模型的训练装置的示意性框图;
图10是本申请实施例提供的一种神经网络模型的运算装置的示意性框图;
图11是本申请实施例提供的另一种神经网络模型的训练装置的示意性框图;
图12是本申请实施例提供的另一种神经网络模型的运算装置的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“信息技术(information technology,IT)价值链”(垂直轴)两个维度对上述人工智能主题框架进行详细的阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。
基础设施可以通过传感器与外部沟通,基础设施的计算能力可以由智能芯片提供。
这里的智能芯片可以是中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专门应用的集成电路(application specific integrated circuit,ASIC)以及现场可编程门阵列(field programmable gate array,FPGA)等硬件加速芯片。
基础设施的基础平台可以包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。
例如,对于基础设施来说,可以通过传感器和外部沟通获取数据,然后将这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据:
基础设施的上一层的数据用于表示人工智能领域的数据来源。该数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理:
上述数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等处理方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力:
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用:
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
本申请实施例可以应用在人工智能中的很多领域,例如,智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市等领域。
具体地,本申请实施例可以具体应用在自动驾驶、图像分类、图像检索、图像语义分割、图像质量增强、图像超分辨率和自然语言处理等需要使用(深度)神经网络的领域。
下面对图片分类和监控这两种应用场景进行简单的介绍。
图片分类:
当用户在终端设备(例如,手机)或者云盘上存储了大量的图片时,通过对相册中图像进行识别可以方便用户或者系统对相册进行分类管理,提升用户体验。
利用本申请实施例的神经网络模型的运算方法,能够降低硬件开销,对终端设备更友好。此外,能够提高利用该神经网络对图片进行分类的速度,有利于实时为不同的类别的图片打上标签,便于用户查看和查找。另外,这些图片的分类标签也可以提供给相册管理系统进行分类管理,节省用户的管理时间,提高相册管理的效率,提升用户体验。
监控:
监控场景包括:智慧城市、野外监控、室内监控、室外监控、车内监控等。其中,智慧城市场景下,需要进行多种属性识别,例如行人属性识别和骑行属性识别,深度神经网络凭借着其强大的能力在多种属性识别中发挥着重要的作用。
通过采用本申请实施例的神经网络模型的运算方法,能够提高神经网络模型的处理效率,有利于对输入的道路画面进行实时处理,更快地识别出道路画面中的不同的属性信息,同时降低功耗。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021091574-appb-000006
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。
f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号变换为输出信号。该激活函数的输出信号可以作为下一层的输入。例如,激活函数可以是ReLU,tanh或sigmoid函数。
神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输 出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021091574-appb-000007
其中,
Figure PCTCN2021091574-appb-000008
是输入向量,
Figure PCTCN2021091574-appb-000009
是输出向量,
Figure PCTCN2021091574-appb-000010
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021091574-appb-000011
经过如此简单的操作得到输出向量。由于DNN层数多,系数W和偏移向量
Figure PCTCN2021091574-appb-000012
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021091574-appb-000013
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021091574-appb-000014
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用 于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。通常地,loss越小,该深度神经网络的训练质量越高,loss越大,深度神经网络的训练质量越低。类似的,loss波动越小,训练越稳定;loss波动越大,训练越不稳定。
(5)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
例如,神经网络模型每次训练产生的loss值在神经网络模型中从后向前逐层传递。传递到每一层时,同时计算出该层参数的更新量(偏导运算),这个更新量与梯度(gradient)相关。
(6)加法器神经网络(adder neural network,AdderNet)
AdderNet的结构与CNN的结构类似。CNN中的卷积层可以用于对输入数据进行特征提取处理或者说滤波处理。CNN中的卷积层可以作为CNN中的特征提取层。AdderNet中的加法器层也可以用于对输入数据进行特征提取处理或者说滤波处理。AdderNet中的加法器层可以作为AdderNet中的特征提取层。CNN中的卷积核的参数可以理解为AdderNet中的加法器的参数。CNN中的卷积核和AdderNet中的加法器均可以理解为滤波器(filter)。
CNN中的各个卷积层通过卷积操作从输入数据中提取特征信息,AdderNet采用l1距离计算输出特征。具体地,AdderNet中的加法器层通过加法操作(或减法操作)以及取绝对值操作从输入数据中提取特征信息。
由于加法运算的计算复杂度远小于乘法运算,AdderNet的运算功耗远小于与其性能相当的CNN的运算功耗。例如,将CNN中的卷积操作替换为加法操作(或减法操作)以及取绝对值操作,得到AdderNet。这样能够在保证性能的同时,大幅减少CNN的运算功耗。
AdderNet中的加法器层的输出特征图满足如下公式:
Figure PCTCN2021091574-appb-000015
其中,Y(m,n,t)表示输出特征图中的第m行第n列第t页的元素,X(m+i,n+j,k)表示输入特征图中的第m+i行第n+j列第k页的元素,F(i,j,k,t)表示滤波器的第i行第j类第k页的元素,t表示该滤波器的通道数,d表示滤波器的行数,c in表示输入特征图的通道数。
AdderNet在前向计算过程中采用L1距离提取特征,即利用加法运算替代乘法运算,减小了计算复杂度,降低了功耗以及硬件面积。
(7)winograd算法
Winograd算法是一种常用的卷积快速计算方法,能够在不影响计算结果的情况下大幅减少CNN的计算量,提高运算效率。
winograd算法满足以下公式:
Y=A T[[GgG T]⊙[B TdB]]A;
其中,Y表示输出数据矩阵;g表示卷积核,即权重矩阵;d表示输入特征图的分片(title),即输入数据矩阵;⊙表示逐元素相乘(element-wise multiplication),即矩阵之间的点乘。A、G和B均表示变换矩阵,具体地,A表示输出变换矩阵,G表示权重变换矩阵,B表示输入变换矩阵。其中,在神经网络的推理过程中,g不会发生改变,因此,变换后的权重矩阵
Figure PCTCN2021091574-appb-000016
可以是预先计算的,即在前向计算开始前,预先计算变换后的权重矩阵。这样可以进一步减少神经网络模型的运算功耗。
形如F(m,r)的winograd算法用于快速计算卷积核的尺寸为r、输出特征图的尺寸为m的卷积运算。尺寸也可以理解为维数。
变换矩阵可以是根据卷积核的维数和步长(stride)确定的。或者说,B、G、A对于特定大小的卷积核与stride的组合都是固定的,可以根据winograd算法推导出来。
在实际应用中,winograd算法的常用形式为F(2×2,3×3)。
其中,输出数据矩阵Y为2×2的矩阵,权重矩阵g为3×3的矩阵。
对于3×3的权重矩阵和stride为1的组合,变换矩阵满足如下公式:
Figure PCTCN2021091574-appb-000017
下面对F(2×2,3×3)的winograd算法的计算流程进行举例说明。
1)通过4×4的输入变换矩阵B对4×4的输入数据矩阵d进行变换,得到变换后的4×4的输入数据矩阵
Figure PCTCN2021091574-appb-000018
例如,输入数据矩阵d满足如下公式。
Figure PCTCN2021091574-appb-000019
变换后的输入数据矩阵满足如下公式:
Figure PCTCN2021091574-appb-000020
2)通过4×3的权重变换矩阵G对3×3的权重矩阵g进行变换,得到变换后的4×4的权重矩阵
Figure PCTCN2021091574-appb-000021
例如,权重矩阵g满足如下公式。
Figure PCTCN2021091574-appb-000022
变换后的权重矩阵满足如下公式:
Figure PCTCN2021091574-appb-000023
3)将变换后的输入数据矩阵与变换后的权重矩阵进行逐元素相乘操作,即将两个矩阵中对应位置的元素相乘,得到4×4的中间矩阵
Figure PCTCN2021091574-appb-000024
4)通过4×2的输出变换矩阵A对中间矩阵进行变换,得到2×2的输出数据矩阵
Figure PCTCN2021091574-appb-000025
该输出矩阵即为输入数据矩阵d与权重矩阵g进行卷积计算的结果。
若采用卷积操作得到输出数据矩阵中的4个结果,则每个结果需要9(3*3)次乘法操作,4个结果需要36次乘法操作。若采用winograd算法,除变换的开销之外,仅需要步骤3)中的16(4*4)次乘法操作,乘法次数的加速比达到2.25(36/16)倍。变换矩阵中的元素均为0、±1、±1/2的数值,可以通过变换符号位和移位操作等轻量的硬件操作完成,也就是说,变换的开销通常较小。
如图2所示,本申请实施例提供了一种系统架构100。在图2中,数据采集设备160用于采集训练数据。例如,针对本申请实施例的神经网络模型的训练方法来说,若训练数据为图像数据,则训练数据可以包括训练图像以及训练图像对应的处理结果。例如,训练图像对应的分类结果,训练图像的分类结果可以是人工预先标注的结果。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的原始数据进行处理,将输出值与目标值进行对比,直到训练设备120输出的值与目标值的差值小于一定的阈值,从而完成目标模型/规则101的训练。
本申请实施例中的目标模型/规则101具体可以为神经网络模型。例如,卷积神经网络或残差网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图2所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图2中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,输入数据在本申请实施例中可以包括:客户设备输入的待处理的数据。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的数据的处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图2中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图2所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的神经网络,具体的,本申请实施例的神经网络可以为CNN或残差网络等。
CNN是一种非常常见的神经网络,下面结合图3重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图3所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。
卷积层/池化层220:
卷积层:
如图3所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输 入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图3中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图3所示的231、232至23n),该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等。
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出 层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成,反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图3所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,仅包括图3中所示的网络结构的一部分,比如,本申请实施例中所采用的卷积神经网络可以仅包括输入层210、卷积层/池化层220和输出层240。
图4为本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器50。该芯片可以被设置在如图2所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图2所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。本申请实施例中的方法可在如图4所示的芯片中得以实现。
神经网络处理器50可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU50作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路503,控制器504控制运算电路503提取存储器(权重存储器或输入存储器)中的数据并进行运算。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)508中。
向量计算单元507可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元507可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization,BN),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器506用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC) 将外部存储器中的输入数据搬运到输入存储器501和/或统一存储器506、将外部存储器中的权重数据存入权重存储器502,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)510,用于通过总线实现主CPU、DMAC和取指存储器509之间进行交互。
与控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
控制器504,用于调用指存储器509中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
上文中介绍的图2中的执行设备110或图4中的芯片能够执行本申请实施例的神经网络模型的运算方法的各个步骤。上文中介绍的图2中的训练设备120或图4中的芯片能够执行本申请实施例的神经网络模型的训练方法的各个步骤。
如图5所示,本申请实施例提供了一种系统架构300。该系统架构包括本地设备301、本地设备302以及执行设备310和数据存储系统350,其中,本地设备301和本地设备302通过通信网络与执行设备310连接。
执行设备310可以由一个或多个服务器实现。可选的,执行设备310可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备310可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备310可以使用数据存储系统350中的数据,或者调用数据存储系统350中的程序代码来实现本申请实施例的神经网络模型的运算方法或神经网络模型的训练方法。
具体地,在一种实现方式中,执行设备110可以执行以下过程:
在神经网络模型的至少一个特征提取层中执行以下操作:通过winograd算法对待处理数据的输入数据矩阵进行输入数据变换,得到变换后的输入数据矩阵;利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取,得到中间矩阵,其中,变换后的权重矩阵是通过winograd算法对至少一个特征提取层的权重矩阵进行权重变换得到的,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离确定的;通过winograd算法对中间矩阵进行输出数据变换,得到输出数据矩阵。
用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备310进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备310进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备301、本地设备302从执行设备310获取到神经网络的相关参数,将神经网络部署在本地设备301、本地设备302上,利用该神经网络进行图像 分类、进行图像处理、语音处理或者文本处理等等。
在另一种实现中,执行设备310上可以直接部署神经网络,执行设备310通过从本地设备301和本地设备302获取待处理数据,并采用神经网络模型对待处理数据进行处理。
上述执行设备310也可以为云端设备,此时,执行设备310可以部署在云端;或者,上述执行设备310也可以为终端设备,此时,执行设备310可以部署在用户终端侧,本申请实施例对此并不限定。
神经网络模型中通常涉及大量的矩阵运算,例如,卷积运算,矩阵运算的时延会成为制约计算效率的主要因素,影响神经网络模型整体的处理效率,使得神经网络模型难以部署到对实时性要求较高的场景中。同时,当神经网络模型的计算量较大时,对硬件的算力要求较高,导致神经网络模型难以部署到算力较低的硬件设备,例如,手机等终端设备上。
因此,如何降低神经网络模型的运算开销成为一个亟待解决的问题。
本申请实施例提供了一种神经网络模型的运算方法,将winograd算法中的element-wise multiplication的操作替换为加法操作,进一步减少了神经网络模型的计算量,提高了神经网络模型的运算速度,降低了运算开销。
为了更好地说明本申请实施例的方法,下面结合附图对本申请实施例的神经网络模型的运算装置进行说明。
图6示出了本申请实施例提供的一种神经网络模型的运算装置,如图6所示,装置600包括输入数据前处理模块610、权重前处理模块620和加速模块630。
其中,输入数据前处理模块610通过winograd算法对输入数据矩阵进行输入数据变换,得到变换后的输入数据矩阵。输入数据矩阵可以为输入特征图的部分或全部。例如,输入特征图的维数为8*8,输入数据矩阵可以为4*4,输入数据矩阵也可以为8*8。
权重前处理模块620通过winograd算法对权重矩阵进行权重变换,得到变换后的权重矩阵。
需要说明的是,权重前处理模块620为可选模块。
示例性地,变换后的权重矩阵可以预先存储在装置600中,或者,变换后的权重矩阵也可以是由其他装置发送至装置600中的。
也就是说,变换后的权重矩阵可以是执行神经网络模型的运算之前计算得到的,即离线计算得到的,也可以是执行神经网络模型的运算时计算得到的,即在线计算得到的。在神经网络模型的推理过程中,权重矩阵是不变的,通过离线计算得到变换后的权重矩阵能够进一步提高运算速度,降低运算开销。
加速模块630用于利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取,得到中间矩阵;对中间矩阵进行winograd变换,得到输出数据矩阵。中间矩阵中的每个元素是根据变换后的输入数据矩阵与所述变换后的权重矩阵中对应位置的元素之间的L1距离确定的
输出数据矩阵可以为输出特征图的部分或全部。
具体地,加速模块可以通过对变换后的输入数据矩阵和变换后的权重矩阵执行减法运算、计算绝对值的运算和取反运算得到中间矩阵。
具体过程详见后文中的方法700,此处不再赘述。
示例性地,加速模块630可以包括矩阵运算模块和后处理模块。矩阵运算模块用于对 利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取,得到中间矩阵。后处理模块用于通过winograd算法对中间矩阵进行输出数据变换,得到输出数据矩阵。
本申请实施例的方法可以由能够执行winograd算法的装置执行。具体地,图6中的各个模块中的运算可以由软件执行,也可以由硬件执行,还可以由软件和硬件共同执行。本申请实施例对此不做限定。
示例性地,图6中的各个模块的运算可以由图4中的芯片执行。
具体地,图4中的外部存储器可以用于存储输入数据矩阵和计算结果等。神经网络处理器50可以从外部存储器中获取输入数据矩阵。
输入数据前处理模块610中的运算可以由向量计算单元507执行。或者,向量计算单元507中包括输入数据前处理模块610。
进一步地,外部存储器可以用于存储离线计算得到的变换后的权重矩阵。在该情况下,神经网络处理器50可以从外部存储器中获取输入数据矩阵和变换后的权重矩阵。
若外部存储器中没有存储离线计算得到的变换后的权重矩阵,权重前处理模块620中的运算可以由向量计算模块507执行。或者,向量计算模块507中包括权重前处理模块620。
加速模块630中的矩阵运算模块的运算可以由运算电路503执行。或者,运算电路503包括该矩阵运算模块。
加速模块630中的后处理模块的运算可以由向量计算模块507执行。或者,向量计算模块507包括该后处理模块。
可替换地,加速模块630也可以是在图4所示的芯片的基础上设置的专用模块,相对于采用运算电路和向量计算模块执行加速模块630中的操作,设置专用模块能够进一步提升加速模块630的处理速度。
下面结合图7对本申请实施例中的神经网络模型的运算方法进行详细的描述。
图7示出了本申请实施例提供的神经网络模型的运算方法700。图7所示的方法可以由神经网络模型的执行装置来执行,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行神经网络模型运算的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法700可以由图2中的执行设备110、图4中的神经网络处理器50或图5中的执行设备310或本地设备执行。或者,方法700也可以由提供AutoML服务的设备执行。示例性地,提供AutoML服务的设备可以是云服务设备。
例如,方法700具体可以由如图2所示的执行设备110执行,方法700中的待处理数据可以是如图2所示的客户设备140给出的输入数据。
方法700包括步骤S710至步骤S730。下面对步骤S710至步骤S730进行详细介绍。
在神经网络模型的至少一个特征提取层中执行以下步骤:
S710,通过winograd算法对待处理数据的输入数据矩阵进行输入数据变换,得到变换后的输入数据矩阵。
待处理数据包括图像数据、语音数据或文本数据等。
待处理数据的类型与神经网络模型的任务有关。例如,神经网络模型用于图像处理任务,则该待处理数据可以为图像。具体地,图像处理任务包括图像分类、图像检测、图像分割、图像识别或图像生成等。再如,神经网络模型用于文本处理任务,则该待处理数据 可以为文本。具体地,文本处理任务包括文本识别或文本翻译等。再如,神经网络模型用于语音处理任务,则该待处理数据可以为语音数据。具体地,语音处理任务包括语音识别等。本申请实施例对待处理数据的类型不做限定。
示例性地,待处理数据为图像,待处理图像可以是终端设备(或者电脑、服务器等其他装置或设备)通过摄像头拍摄到的图像,或者,该待处理图像还可以是从终端设备(或者电脑、服务器等其他装置或设备)内部获得的图像(例如,终端设备的相册中存储的图像,或者终端设备从云端获取的图像),本申请实施例对此并不限定。
本申请实施例中的神经网络模型可以是现有的神经网络模型,例如,残差网络。或者,该神经网络模型也可以是自行构建的其他结构的神经网络模型。本申请实施例对此不作限定。
待处理数据的输入数据矩阵指的是输入该至少一个特征提取层的数据矩阵。
示例性地,特征提取层包括加法器层,即以加法操作作为滤波器进行特征提取。具体描述可以参见前文中的AdderNet中的特征提取层,此处不再赘述。
可替换地,特征提取层包括卷积层,即以卷积核作为滤波器进行特征提取。具体描述可以参见前文中的卷积神经网络,此处不再赘述。在该情况下,可以将该卷积层替换为加法器层,然后对该特征提取层执行方法700。
具体地,通过输入转换矩阵对输入数据矩阵进行winograd变换,得到变换后的输入数据矩阵。
Winograd变换包括通过winograd算法执行的输入数据变换、权重变换和输出数据变换。也就是说上述三种变换均可以理解为winograd变换。
示例性地,变换后的输入数据矩阵
Figure PCTCN2021091574-appb-000026
满足如下公式:
Figure PCTCN2021091574-appb-000027
其中,B表示输入变换矩阵,B T表示B的转置矩阵,d表示变换前的输入数据矩阵。
示例性地,输入转换矩阵可以为现有的winograd输入变换矩阵。
例如,变换前的输入数据矩阵d为4×4的矩阵,变换前的权重矩阵g为3×3的矩阵,输入变换矩阵可以为:
Figure PCTCN2021091574-appb-000028
输入数据矩阵可以为输入特征图的部分或全部。例如,输入特征图的维数为8*8,输入数据矩阵可以为输入特征图中4*4的矩阵,或者,输入数据矩阵也可以为该输入特征图。
需要说明的是,输入特征图可以为输入至神经网络模型中的数据自身,例如,待处理的图像。输入特征图也可以为经过神经网络模型中的部分特征提取层进行一次或多次特征提取后得到的特征图,例如,对待处理的图像进行一次或多次特征提取后得到的特征图。特征提取的方式可以采用现有方案,例如,卷积层处理。也就是说,本申请实施例中的输入特征图可以是对待处理的图像进行一次或多次卷积处理之后得到的特征图。或者,特征提取的方式也可以采用本申请实施例中的方法700。也就是说,本申请实施例中的输入特征图可以是通过本申请实施例的方法700对待处理的图像进行处理后得到的特征图。本申 请实施例对输入特征图的获取方式不做限定。
示例性地,步骤S710可以由图6中的输入数据前处理模块610执行。
S720,利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取,得到中间矩阵。其中,变换后的权重矩阵是通过winograd算法对该至少一个特征提取层的权重矩阵进行权重变换得到的。中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离确定的。
L1距离也可以称为L1正则距离、L1范数距离、曼哈顿距离或出租车距离。
通过权重变换矩阵对权重矩阵进行winograd变换,可以得到变换后的权重矩阵。
具体地,该权重矩阵为神经网络模型中该至少一个特征提取层中的一个或多个特征提取核的参数。一个特征提取层中可以包括一个或多个特征提取核,即滤波器。特征提取核用于对输入神经网络模型的数据进行特征提取。
示例性地,变换后的权重矩阵
Figure PCTCN2021091574-appb-000029
满足如下公式:
Figure PCTCN2021091574-appb-000030
其中,G表示权重变换矩阵,G T表示G的转置矩阵,g表示变换前的权重矩阵。
示例性地,权重转换矩阵可以为现有的winograd权重变换矩阵。
例如,变换前的输入数据矩阵d为4×4的矩阵,变换前的权重矩阵g为3×3的矩阵,权重变换矩阵可以为:
Figure PCTCN2021091574-appb-000031
变换后的权重矩阵可以是离线变换得到的,也可以是在线变换得到的。
变换后的权重矩阵是离线变换得到的,指的是,变换后的权重矩阵是在执行神经网络模型的运算之前变换得到的。例如,变换后的权重矩阵可以是在神经网络模型部署之前变换得到的。在神经网络模型的推理过程中,权重矩阵是不变的,通过离线计算得到变换后的权重矩阵能够进一步提高运算速度,降低运算开销。
变换后的权重矩阵是在线变换得到的,指的是,变换后的权重矩阵是在执行神经网络模型的运算的过程中变换得到的。
具体地,中间矩阵中的每个元素为变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离的相反数。
示例性地,中间矩阵X满足如下公式:
X=-|[GgG T]-[B TdB]|;
本申请实施例中,上式中的两项之间的减号表示逐元素相减(element-wise minus),|·|表示对矩阵中的每个元素计算绝对值。
步骤S720可以通过对变换后的输入数据矩阵和变换后的权重矩阵执行减法运算、计算绝对值的运算和取反运算实现。
具体地,步骤S720包括步骤S721至步骤S723。
S721,对变换后的输入数据矩阵和变换后的权重矩阵进行逐元素相减运算,得到差值矩阵。
S722,计算该差值矩阵的绝对值,得到绝对值矩阵。
S723,计算该绝对值矩阵的相反数,得到中间矩阵。
需要说明的是,加法操作和减法操作是实质上是相同的,减法运算的结果也可以通过加法操作得到,为了描述简便,本申请实施例中,将加法操作和减法操作统称为加法操作。
示例性地,步骤S720可以由图6中的加速模块630执行。
需要说明的是,神经网络模型中可以包括多个特征提取层,每个特征提取层可以包括一个或多个特征提取核,即权重矩阵。相应地,在每个特征提取层中可以执行一次或多次特征提取处理,该多次特征提取中的一次或多次可以采用方法700执行相应的运算过程。也就是说,对神经网络模型的一个特征提取层执行方法700可以包括对该特征提取层中的一个特征提取核执行方法700。
S730,通过winograd算法对中间矩阵进行输出数据变换,得到输出数据矩阵。
具体地,通过输出转换矩阵对中间矩阵进行winograd变换,得到输出数据矩阵。
示例性地,输出数据矩阵Y满足如下公式:
Y=A TXA;
具体地,输出数据矩阵Y满足如下公式:
Y=A T[-|[GgG T]-[B TdB]|]A;
其中,A表示输出变换矩阵,A T表示A的转置矩阵。
示例性地,输出转换矩阵可以为现有的winograd输入变换矩阵。
例如,变换前的输入数据矩阵d为4×4的矩阵,变换前的权重矩阵g为3×3的矩阵,输出变换矩阵可以为:
Figure PCTCN2021091574-appb-000032
其中,输出数据矩阵可以为输出特征图的部分或全部。
示例性地,步骤S730可以由图6中的加速模块630执行。
根据本申请实施例的方案,将winograd中的点乘操作替换为计算L1距离的操作等加法操作,减少了特征提取过程的计算量,提高了模型的运行速度,减少了运算开销。
本申请实施例的方案可以理解为将winograd算法与AdderNet合并的方案。或者,也可以理解为利用winograd算法对AdderNet的加法器层进行优化的方案。具体来说,从AdderNet的角度而言,该方案减少了AdderNet的加法器层中的加法运算的次数,减少了模型的计算量;从winograd算法的角度而言,该方案利用加法操作替换乘法操作,减少了模型的计算量。
可选地,输出变换矩阵中的元素的取值为以下任一项:0,-1,1。
这样,输出变换矩阵中的元素的取值为0,1或-1,可以减少乘法操作,进一步减少计算次数,有利于减少模型的计算量。
可选地,在变换前的权重矩阵的维数为3×3,输出数据矩阵Y的维数为2×2的情况下, 输出变换矩阵可以为:
Figure PCTCN2021091574-appb-000033
可选地,输出变换矩阵的至少一行的元素为第一矩阵中与该至少一行对应的位置的元素的相反数,输出变换矩阵的其他行的元素和第一矩阵中与其他行对应的位置的元素相同,第一矩阵可以为:
Figure PCTCN2021091574-appb-000034
其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
也就是说,A'的任一行均可以取反,而不影响winograd变换的最终结果。
其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
例如,c 0为0,c 1为-1,c 2为1。再如,c 0为-1,c 1为1,c 2为0。
可选地,在变换前的权重矩阵的维数为3×3,输出数据矩阵Y的维数为2×2的情况下,权重变换矩阵可以为:
Figure PCTCN2021091574-appb-000035
上述输出变换矩阵和权重变换矩阵满足winograd算法的通解形式,能够保证通过winograd算法得到的卷积计算的结果与常规的卷积计算的结果相同。这样,在采用现有的输入变换矩阵的情况下,上述输出变换矩阵和权重变换矩阵仍能适用于卷积计算。
需要说明的是,winograd算法中的变换矩阵可以包括多种形式,只要满足winograd的通解形式即可保证通过winograd算法得到的卷积计算的结果与常规的卷积计算的结果相同。上述输出变换矩阵和权重变换矩阵仅为采用现有的winograd算法的输入变换矩阵的情况下的示例。在输入数据矩阵调整的情况下,输出变换矩阵和权重变换矩阵还可以采用其他形式。
然而,输出数据矩阵中的各个位置的元素可能存在不均衡的情况,导致训练过程中损失函数下降速度较慢,训练得到的模型的准确率较低。此外,该情况还会影响输出数据矩阵的后续处理过程,例如,均一化(batchnorm)处理,影响模型的性能。
下面以中间矩阵X为4×4的矩阵,输出数据矩阵Y为2×2的矩阵为例进行说明。
中间矩阵X可以表示为如下形式:
Figure PCTCN2021091574-appb-000036
其中,x 0、x 1…x 16表示中间矩阵X中的元素。
输出数据矩阵可以表示为如下形式:
Figure PCTCN2021091574-appb-000037
其中,y 0、y 1、y 2和y 3表示输出数据矩阵Y中的元素。
由Y=A TXA可以得到,若采用现有winograd算法中的输出变换矩阵,即:
Figure PCTCN2021091574-appb-000038
Y中的元素满足如下公式:
y 0=x 0+x 1+x 2+x 4+x 5+x 6+x 8+x 9+x 10
y 1=x 1-x 2-x 3+x 5-x 6-x 7+x 9-x 10-x 11
y 2=x 4+x 5+x 6-x 8-x 9-x 10-x 12-x 13-x 14
y 3=x 5-x 6-x 7-x 9+x 10+x 11-x 13+x 14+x 15
由上式可以看出,Y中的各个元素对应的等式中的加法运算的数量是不同的。相应地,Y中的各个元素对应的等式中的减法运算的数量是不同的。或者说,Y中的各个元素对应的等式中的元素的正负号的数量是不同的。例如,y 0对应的等式中包括9个加法运算,而y 1对应的等式中包括3个加法运算,y 0对应的等式和y 1对应的等式中的加法运算的数量不同。X中的各个元素的量级通常是相同的,X中的每个元素恒为非正数,由于Y中的各个元素对应的等式中的加法运算和减法运算的数量不同,导致输出数据矩阵Y中的各个元素的量级不一致,即输出数据矩阵中各个位置的元素存在不均衡的情况,会影响进而影响模型的性能。
Y中的各个元素对应的等式中的加法运算和减法运算的数量是由输出变换矩阵中的元素的正负号决定的,也就是说,输出变换矩阵中的元素的正负号会影响输出数据矩阵中的特征值的分布。
因此,本申请实施例还提供了一种输出变换矩阵,能够在提高神经网络模型的运算速度的同时,避免神经网络模型的性能下降。
可选地,输出变换矩阵中的各列元素中的正数的数量是相同的,各列元素中的负数的数量是相同的。
这样可以保证输出数据矩阵中各个位置的元素的均衡性。
如前所述,输出变换矩阵中的元素的值可以为0,1或-1中的任一项,在该情况下,输出变换矩阵中的各列元素中的+1的数量是相同的,各列元素中的-1的数量是相同的。
例如,c 0为0,c 1为-1,c 2为1。在该情况下,输出变换矩阵可以包括以下任一项:
Figure PCTCN2021091574-appb-000039
其中,A 0、A 1、A 2、A 3分别表示4种输出变换矩阵。应理解,该4种输出变换矩阵仅为示例,在c 0、c 1和c 2的取值为0、1和-1的其他组合的情况下,也就是4种输出变换矩阵的各行之间互相交换的情况下,可以得到其他形式的输出变换矩阵,此处不一一列举。
在采用输出变换矩阵A 0的情况下,由Y=A TXA可以得到,Y中的元素满足如下公式:
y 0=x 0-x 1-x 2-x 4+x 5+x 6-x 8+x 9+x 10
y 1=-x 1+x 2-x 3+x 5-x 6+x 7+x 9-x 10+x 11
y 2=x 4+x 5+x 6+x 8-x 9-x 10-x 12+x 13+x 14
y 3=x 5-x 6+x 7-x 9+x 10-x 11+x 13-x 14+x 15
由上式可以看出,Y中的各个元素对应的等式中的加法运算的数量是相同的。相应地,Y中的各个元素对应的等式中的减法运算的数量是相同的。或者说,Y中的各个元素对应的等式中的中间矩阵中的元素中的正负号的数量是相同的。例如,y 0对应的等式中包括5个正号,而y 1对应的等式中包括5个正号,y 0对应的等式和y 1对应的等式中的正号的数量相同,保证了输出数据矩阵中各个位置的元素的均衡性。应理解,此处仅以A 0作为示例,A 1、A 2、A 3均可以达到同样的效果。
例如,c 0为0,c 1为-1,c 2为1。在该情况下,权重变换矩阵可以包括以下任一项:
Figure PCTCN2021091574-appb-000040
G 0、G 1、G 2、G 3分别表示4种权重变换矩阵。应理解,该4种权重变换矩阵仅为示例,在c 0、c 1和c 2的取值为0、1和-1的其他组合的情况下,也就是该4种权重变换矩阵的各行之间互相交换的情况下,可以得到其他形式的权重变换矩阵,此处不一一列举。
权重变换矩阵G 0、G 1、G 2、G 3与输出变换矩阵A 0、A 1、A 2、A 3是一一对应的,具有对应关系的权重变换矩阵和输出变换矩阵配合使用。例如,G 0和A 0配合使用。在采用该4种权重变换矩阵和输出变换矩阵的情况下,输入变换矩阵可以采用现有的winograd中的输入变换矩阵。
在输入数据矩阵采用现有的winograd的情况下,输出变换矩阵和权重变换矩阵可以采用上述四种形式。在输入数据矩阵调整的情况下,输出变换矩阵和权重变换矩阵也可以采用其他形式。也就是说,只要变换矩阵能够满足winograd的通解形式,且能够保证输出数据矩阵中各个位置的元素的均衡性即可。
根据本申请实施例的方案,输出变换矩阵中的各列元素中的正数的数量是相同的,各列元素中的负数的数量是相同的,例如,输出变换矩阵中的各列元素中的﹢1的数量是相同的,各列元素中的-1的数量是相同的,能够均衡输出数据矩阵各个位置的量级,即减轻 特征值累加的不均衡性,有利于模型的训练。此外,有利于对输出数据矩阵执行后续的处理,例如,均一化(batchnorm)处理。
此外,输出变换矩阵与权重变换矩阵满足winograd的通解形式,能够保证在进行卷积运算时,输出结果与卷积运算的实际结果一致,即适用于模型中的卷积运算。
如前所述,采用winograd算法对卷积操作进行加速,不会影响计算结果。而方法700中采用了取绝对值的操作,乘法分配律不再适用。也就是说,若采用winograd算法对加法器层进行加速,计算结果与原本的计算结果有一定差距。将winograd算法与加法器层采用上述方式合并会导致神经网络模型的性能下降。
本申请实施例还提供了一种神经网络模型的训练方法,能够提高神经网络模型的性能。
下面结合图8对本申请实施例中的神经网络模型的训练方法进行详细的描述。
图8示出了本申请实施例提供的神经网络模型的训练方法800。图8所示的方法可以由神经网络模型的训练装置来执行,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行神经网络模型训练的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法800可以由图2中的训练设备120、图4中的神经网络处理器50或图5中的执行设备310执行。
图8的训练方法800在执行神经网络模型的前向传播过程中的运算方法与图7中的方法一致。将方法700中的“待处理数据”替换为“训练数据”即可。方法800中的前向传播过程的具体实现方式可以参照前述方法700,为了避免不必要的重复,下面在介绍方法800时适当省略重复的描述。
方法800包括步骤S810至步骤S850。下面对步骤S810至步骤S850进行详细介绍。
对神经网络模型中的至少一个特征提取层执行以下操作:
S810,通过winograd算法对训练数据的输入数据矩阵进行输入数据变换,得到变换后的输入数据矩阵。
训练数据的类型与神经网络模型的任务有关。例如,神经网络模型用于图像处理任务,则该训练数据可以为图像。具体地,图像处理任务包括图像分类、图像检测、图像分割或图像生成等。再如,神经网络模型用于文本处理任务,则该训练数据可以为文本。具体地,文本处理任务包括文本识别或文本翻译等。再如,神经网络模型用于语音处理任务,则该训练数据可以为语音数据。具体地,语音处理任务包括语音识别等。本申请实施例对训练数据的类型不做限定。
示例性地,训练数据可以是预先存储的。例如,该训练数据可以是图2所示的数据库130中维护的训练数据。
输入数据矩阵指的是输入该至少一个特征提取层的数据矩阵。
S820,利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取,得到中间矩阵。其中,变换后的权重矩阵是通过winograd算法对该至少一个特征提取层的权重矩阵进行权重变换得到的,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的Lp距离确定的。
Lp距离也可以称为闵可夫斯基距离(Minkowski distance),是一组距离的定义。p为参数。
当p为1时,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离确定的。
示例性地,中间矩阵X满足如下公式:
X=-|[GgG T]-[B TdB]|;
当p为2时,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L2距离确定的。
示例性地,中间矩阵X满足如下公式:
X=-([GgG T]-[B TdB]) 2
L2距离也可以称为欧式距离。
S830,通过winograd算法对中间矩阵进行输出数据变换,得到输出数据矩阵。
S840,根据输出数据矩阵确定损失函数的值。
具体地,输出数据矩阵为输出特征图的部分或全部,根据输出特征图可以确定训练数据的处理结果,根据训练数据的处理结果计算损失函数的值。
训练数据的处理结果与训练数据的类型以及神经网络模型的任务有关。
示例性地,训练数据为图像数据,图像处理可以包括图像超分处理、图像去噪处理、图像识别处理等,相应地,图像处理结果包括图像超分、图像去噪或图像类别等。本申请实施例对此不做限定。
示例性地,训练数据为语音数据,语音处理可以包括语音识别等,相应地,语音处理结果包括语音识别结果等。本申请实施例对此不作限定。
输出特征图还可以进一步进过其他处理,例如,经过激活函数的处理等,进而得到训练数据的处理结果。
S850,根据损失函数的值对神经网络模型进行训练。
在对神经网络模型进行训练的第m次迭代的过程中,p为2;在对神经网络模型进行训练的第n次迭代的过程中,p为1,其中,m和n为正整数,m小于n。
在第m次迭代的过程中,在前向计算的过程中采用L2距离计算中间矩阵,在反向传播的过程中基于L2距离计算损失函数对第一权重矩阵中的权重的偏导数。具体地,根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L2距离执行前向计算,根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L2距离执行反向传播。或者,可以理解为,基于L2距离执行前向计算和反向计算。
在第n次迭代的过程中,在前向计算的过程中采用L1距离计算中间矩阵,在反向传播的过程中基于L1距离计算损失函数对第一权重矩阵中的权重的偏导数。具体地,根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离执行前向计算,根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离执行反向传播。或者,可以理解为,基于L1距离执行前向计算和反向计算。
第一权重矩阵包括变换前的权重矩阵或变换后的权重矩阵。即第一权重矩阵可以为变换前的权重矩阵g,或者,变换后的权重矩阵
Figure PCTCN2021091574-appb-000041
在训练过程中,可以计算变换前的权重矩阵的梯度,进而调整变换前的权重矩阵的值。或者,也可以计算变换后的权重矩阵的梯度,进而调整变换后的权重矩阵的值。
也就是说,在神经网络模型训练的前期,基于L2距离执行前向计算和反向传播,在神经网络模型训练的后期,采用L1距离执行前向计算和反向传播,即采用L2距离逼近L1距离。
若仅基于L1距离进行训练,会导致winograd算法难以优化,可能导致网络训练无法收敛。本申请实施例在训练前期,利用L2距离辅助训练,L2距离对winograd算法更加友好,能够提高训练过程的收敛速度,进而提高模型的训练效果。在训练后期,基于L1距离进行训练,以便进一步提高采用L1距离的模型的训练效果,训练好的模型中采用L1距离,对硬件更加友好。
可选地,损失函数对第一权重矩阵中的权重的偏导数满足以下公式:
Figure PCTCN2021091574-appb-000042
Figure PCTCN2021091574-appb-000043
其中,p为计算的范数,p∈[1,2],w表示第一权重矩阵中的权重,w可以为变换前的权重矩阵g中的权重,也可以为变换后的权重矩阵中的权重,x表示变换前的输入数据矩阵中的数据或变换后的输入数据矩阵中的数据,变换前的输入数据矩阵中的数据即为输入特征图中的特征值,L表示损失函数,i表示神经网络模型的层数,i为整数。
Figure PCTCN2021091574-appb-000044
表示损失函数对第i层的特征值的偏导数,
Figure PCTCN2021091574-appb-000045
表示损失函数对第i+1层的特征值的偏导数。
Figure PCTCN2021091574-appb-000046
表示损失函数对第i层第一权重矩阵中的权重的偏导数,sign()表示符号函数。
当p为2时,上式即为基于L2距离得到的损失函数对第一权重矩阵的偏导数,也就是说基于L2距离执行反向传播。p为1时,上式即为基于L1距离得到的损失函数对第一权重矩阵的偏导数,也就是说基于L2距离执行反向传播。
可选地,p的值是根据训练过程的迭代次数确定的。
进一步地,p的初始值为2,p的值随着迭代次数的增加而减少。
也就是说,在训练过程中将p从2减少为1。
p的值可以是在每一次迭代时减小一次,也可以是每隔几次迭代减小一次。
示例性地,在训练过程中,每次迭代时,p的值减少a,a的值可以根据需要设定。a可以是固定的,也就是说,每次迭代时,p的值的减少量是不变的。例如,a可以为0.05,即每次迭代时,p减少0.05。或者,a也可以是变化的。例如,随着迭代次数的增加,p的值的减少量逐渐增大。例如,第1次迭代时,p为2,第2次迭代时,a为0.01,则p为1.99,第3次迭代时,a为0.02,则p为1.98。
或者,在训练过程中,每迭代k次,p的值减少a。
或者,采用l2距离进行训练直至收敛,减少p的值,重新训练。
需要说明的是,以上仅为示例,还可以采用其他方式将p从2减少为1,本申请实施例对此不作限定。
训练好的神经网络模型可以用于执行目标任务。示例性地,目标任务可以为图像处理任务,例如,目标检测,图像分割,实例分割,图像去噪,图像超分辨率等。或者,目标任务可以为语音处理任务,例如,语音识别等。或者,目标任务可以为文本处理任务,例 如,文本识别或文本翻译等。
表1示出了本申请实施例的方案以及现有方案在CIFAR-10的分类数据上的实验结果的对比。
表1
方法 准确率
AdderNet 91.84
本申请的运算方法(采用现有的变换矩阵) 86.13
本申请的运算方法(采用调整后的变换矩阵) 88.60
采用本申请的训练方法得到的模型 91.47
如表1所述,在采用现有的变换矩阵的情况下,本申请的运算方法能够提高运算效率,但会导致模型的准确率下降。相对于采用现有的变换矩阵,采用调整后的变换矩阵能够提高模型的准确率。而且,相对于采用现有的变换矩阵,在采用调整后的变换矩阵的情况下,采用本申请的训练方法对模型进行训练能够进一步提高模型的准确率,接近AdderNet的准确率。
表2示出了本申请实施例的方案以及现有方案在底层视觉任务上的实验结果的对比。
表2
方法 PSNR
卷积神经网络 57.31
AdderNet 57.22
本申请的运算方法(采用调整后的变换矩阵) 57.27
如表2所示,本申请实施例的运算方法能够达到比AdderNet更高的指标。此外,本申请实施例的运算方法能够达到与AdderNet接近的视觉效果。
下面结合图9至图12对本申请实施例的装置进行说明。应理解,下面描述的装置能够执行前述本申请实施例的方法,为了避免不必要的重复,下面在介绍本申请实施例的装置时适当省略重复的描述。
图9是本申请实施例的神经网络模型的训练装置的示意性框图。图9所示的神经网络模型的训练装置3000包括获取单元3010和处理单元3020。
获取单元3010和处理单元3020可以用于执行本申请实施例的神经网络模型的训练方法,具体地,可以用于执行方法800。
获取单元3010用于获取训练数据。
处理单元3020用于对神经网络模型的至少一个特征提取层执行以下操作:
通过winograd算法的输入变换矩阵对训练数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵;利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,变换后的权重矩阵是通过winograd算法的权重变换矩阵对至少一个特征提取层的权重矩阵进行权重变换得到的,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的Lp距离确定的;通过winograd算法的输出变换矩阵对中间矩阵进行输出数据变换以得到输出数据矩阵;根据输出数据矩阵确定损失函数的值;根据损失函数的值对神经网络模型进行训练;在对神经网络模型进行训练的第m次迭代的过程中,p为2,在对神经网络模型进行训练的第n次迭代的过程中,p为1,m和n为正整数,m小于n。
可选地,作为一个实施例,在对神经网络模型进行训练的过程中,p的初值为2,p的值随着迭代次数的增加而减少。
可选地,作为一个实施例,根据损失函数的值对神经网络模型进行训练,包括:根据损失函数对第一权重矩阵中的权重的偏导数调整第一权重矩阵中的权重,第一权重矩阵包括变换前的权重矩阵或变换后的权重矩阵。
可选地,作为一个实施例,损失函数对第一权重矩阵中的权重的偏导数满足以下公式:
Figure PCTCN2021091574-appb-000047
Figure PCTCN2021091574-appb-000048
其中,p为计算的范数,p∈[1,2],w表示权重,x表示变换前的输入数据矩阵中的数据或变换后的输入数据矩阵中的数据,L表示损失函数,i表示神经网络模型的层数,sign()表示符号函数。
图10是本申请实施例提供的神经网络模型的运算装置4000的示意性框图。图10所示的装置4000包括获取单元4010和处理单元4020。
获取单元4010和处理单元4020可以用于执行本申请实施例的神经网络模型的运算方法,例如,可以用于执行方法700。
获取单元4010用于获取待处理的数据,待处理数据包括图像数据、语音数据或者文本数据。
处理单元4020用于对神经网络模型的至少一个特征提取层执行以下操作:通过winograd算法的输入变换矩阵对待处理数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵;利用变换后的权重矩阵对变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,变换后的权重矩阵是通过winograd算法的权重变换矩阵对至少一个特征提取层的权重矩阵进行权重变换得到的,中间矩阵中的每个元素是根据变换后的输入数据矩阵与变换后的权重矩阵中对应位置的元素之间的L1距离确定的;通过winograd算法的输出变换矩阵对中间矩阵进行输出数据变换以得到输出数据矩阵。
可选地,作为一个实施例,输出数据矩阵满足如下公式:
Y=A T[-|[GgG T]-[B TdB]|]A;
其中,Y表示输出数据矩阵,A表示输出变换矩阵,A T表示A的转置矩阵,G表示权重变换矩阵,G T表示G的转置矩阵,g表示变换前的权重矩阵,B表示输入变换矩阵,B T表示B的转置矩阵,d表示变换前的输入数据矩阵。
可选地,作为一个实施例,输出变换矩阵中的元素的值为0,-1或1中的任一项。
可选地,作为一个实施例,输出变换矩阵为:
Figure PCTCN2021091574-appb-000049
其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
可选地,作为一个实施例,输出变换矩阵中的至少一行的元素为第一矩阵中与至少一行对应的位置的元素的相反数,输出变换矩阵中的其他行的元素和第一矩阵中与其他行对 应的位置的元素相同,第一矩阵为:
Figure PCTCN2021091574-appb-000050
其中,A'表示第一矩阵,c 0、c 1和c 2分别是0,-1,1中的任一项。
可选地,作为一个实施例,权重变换矩阵为:
Figure PCTCN2021091574-appb-000051
其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
可选地,作为一个实施例,输出变换矩阵中的各列元素中的正数的数量是相同的,各列元素中负数的数量是相同的。
需要说明的是,上述训练装置3000以及装置4000以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图11是本申请实施例提供的神经网络模型的训练装置的硬件结构示意图。图11所示的神经网络模型的训练装置5000(该装置5000具体可以是一种计算机设备)包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
存储器5001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002用于执行本申请实施例的神经网络模型的训练方法的各个步骤。具体地,处理器5002可以执行上文中图8所示的方法800。
处理器5002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法 实施例的神经网络模型的训练方法。
处理器5002还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图4所示的芯片。在实现过程中,本申请的神经网络模型的训练方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器5002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成图9所示的训练装置中包括的单元所需执行的功能,或者,执行本申请方法实施例的图8所示的神经网络模型的训练方法。
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取训练数据。
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。
图12是本申请实施例的神经网络模型的运算装置的硬件结构示意图。图12所示的数据处理装置6000包括存储器6001、处理器6002、通信接口6003以及总线6004。其中,存储器6001、处理器6002、通信接口6003通过总线6004实现彼此之间的通信连接。
存储器6001可以是ROM,静态存储设备和RAM。存储器6001可以存储程序,当存储器6001中存储的程序被处理器6002执行时,处理器6002和通信接口6003用于执行本申请实施例的神经网络模型的运算方法的各个步骤。具体地,处理器6002可以执行上文中图7所示的方法中的步骤S710至S730。
处理器6002可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的神经网络模型的运算装置中的单元所需执行的功能,或者执行本申请方法实施例的神经网络模型的运算方法。
处理器6002还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图4所示的芯片。在实现过程中,本申请实施例的神经网络模型的运算方法的各个步骤可以通过处理器6002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器6002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器6001,处理器6002读取存储器6001中的信息,结合其硬件完成本申请实施例的神经网络模型的运算装置中包括的单元所需执行的 功能,或者执行本申请方法实施例的神经网络模型的运算方法。
通信接口6003使用例如但不限于收发器一类的收发装置,来实现装置6000与其他设备或通信网络之间的通信。例如,可以通过通信接口6003获取待处理的数据。
总线6004可包括在装置6000各个部件(例如,存储器6001、处理器6002、通信接口6003)之间传送信息的通路。
应注意,尽管上述装置5000和装置6000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置5000和装置6000还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置5000和装置6000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置5000和装置6000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图11和图12中所示的全部器件。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介 质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的 介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (27)

  1. 一种神经网络模型的运算方法,其特征在于,在神经网络模型的至少一个特征提取层中执行以下操作:
    通过winograd算法的输入变换矩阵对待处理数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵,所述待处理数据包括图像数据、语音数据或者文本数据;
    利用变换后的权重矩阵对所述变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,所述变换后的权重矩阵是通过所述winograd算法的权重变换矩阵对所述至少一个特征提取层的权重矩阵进行权重变换得到的,所述中间矩阵中的每个元素是根据所述变换后的输入数据矩阵与所述变换后的权重矩阵中对应位置的元素之间的L1距离确定的;
    通过所述winograd算法的输出变换矩阵对所述中间矩阵进行输出数据变换以得到输出数据矩阵。
  2. 根据权利要求1所述的运算方法,其特征在于,所述输出数据矩阵满足如下公式:
    Y=A T[-|[GgG T]-[B TdB]|]A;
    其中,Y表示所述输出数据矩阵,A表示所述输出变换矩阵,A T表示A的转置矩阵,G表示所述权重变换矩阵,G T表示G的转置矩阵,g表示所述变换前的权重矩阵,B表示所述输入变换矩阵,B T表示B的转置矩阵,d表示所述变换前的输入数据矩阵。
  3. 根据权利要求1或2所述的运算方法,其特征在于,所述输出变换矩阵中的元素的值为0,-1或1中的任一项。
  4. 根据权利要求3所述的运算方法,其特征在于,所述输出变换矩阵为:
    Figure PCTCN2021091574-appb-100001
    其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
  5. 根据权利要求3所述的运算方法,其特征在于,所述输出变换矩阵中的至少一行的元素为第一矩阵中与所述至少一行对应的位置的元素的相反数,所述输出变换矩阵中的其他行的元素和所述第一矩阵中与所述其他行对应的位置的元素相同,所述第一矩阵为:
    Figure PCTCN2021091574-appb-100002
    其中,A'表示所述第一矩阵,c 0、c 1和c 2分别是0,-1,1中的任一项。
  6. 根据权利要求2至5中任一项所述的运算方法,其特征在于,所述权重变换矩阵为:
    Figure PCTCN2021091574-appb-100003
    其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
  7. 根据权利要求1至6中任一项所述的运算方法,其特征在于,所述输出变换矩阵中的各列元素中的正数的数量是相同的,所述各列元素中负数的数量是相同的。
  8. 一种神经网络模型的训练方法,其特征在于,包括:
    通过winograd算法的输入变换矩阵对训练数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵;
    利用变换后的权重矩阵对所述变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,所述变换后的权重矩阵是通过所述winograd算法的权重变换矩阵对所述至少一个特征提取层的权重矩阵进行权重变换得到的,所述中间矩阵中的每个元素是根据所述变换后的输入数据矩阵与所述变换后的权重矩阵中对应位置的元素之间的Lp距离确定的;
    通过所述winograd算法的输出变换矩阵对所述中间矩阵进行输出数据变换以得到输出数据矩阵;
    根据所述输出数据矩阵确定损失函数的值;
    根据所述损失函数的值对所述神经网络模型进行训练;
    在对所述神经网络模型进行训练的第m次迭代的过程中,p为2,在对所述神经网络模型进行训练的第n次迭代的过程中,p为1,m和n为正整数,m小于n。
  9. 根据权利要求8所述的训练方法,其特征在于,在对所述神经网络模型进行训练的过程中,p的初值为2,p的值随着迭代次数的增加而减少。
  10. 根据权利要求8或9所述的训练方法,其特征在于,所述根据损失函数的值对所述神经网络模型进行训练,包括:
    根据所述损失函数对第一权重矩阵中的权重的偏导数调整所述第一权重矩阵中的权重,所述第一权重矩阵包括所述变换前的权重矩阵或所述变换后的权重矩阵。
  11. 根据权利要求10所述的训练方法,其特征在于,所述损失函数对第一权重矩阵中的权重的偏导数满足以下公式:
    Figure PCTCN2021091574-appb-100004
    Figure PCTCN2021091574-appb-100005
    其中,p∈[1,2],w表示所述权重,x表示所述变换前的输入数据矩阵中的数据或所述变换后的输入数据矩阵中的数据,L表示所述损失函数,i表示所述神经网络模型的层数,sign()表示符号函数。
  12. 一种神经网络模型的运算装置,其特征在于,包括:
    获取单元,用于获取待处理数据,所述待处理数据包括图像数据、语音数据或者文本数据;
    处理单元,用于对神经网络模型的至少一个特征提取层执行以下操作:
    通过winograd算法的输入数据矩阵对待处理数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵;
    利用变换后的权重矩阵对所述变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,所述变换后的权重矩阵是通过所述winograd算法的权重变换矩阵对所述至少一个特征提取层的权重矩阵进行权重变换得到的,所述中间矩阵中的每个元素是根据所述变换后的输入数据矩阵与所述变换后的权重矩阵中对应位置的元素之间的L1距离确定的;
    通过所述winograd算法的输出变换矩阵对所述中间矩阵进行输出数据变换以得到输出数据矩阵。
  13. 根据权利要求12所述的运算装置,其特征在于,所述输出数据矩阵满足如下公式:
    Y=AT[-|[GgG T]-[B TdB]|]A;
    其中,Y表示所述输出数据矩阵,A表示所述输出变换矩阵,A T表示A的转置矩阵,G表示所述权重变换矩阵,G T表示G的转置矩阵,g表示所述变换前的权重矩阵,B表示所述输入变换矩阵,B T表示B的转置矩阵,d表示所述变换前的输入数据矩阵。
  14. 根据权利要求12或13所述的运算装置,其特征在于,所述输出变换矩阵中的元素的值为0,-1或1中的任一项。
  15. 根据权利要求14所述的运算装置,其特征在于,所述输出变换矩阵为:
    Figure PCTCN2021091574-appb-100006
    其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
  16. 根据权利要求15所述的运算装置,其特征在于,所述输出变换矩阵中的至少一行的元素为第一矩阵中与所述至少一行对应的位置的元素的相反数,所述输出变换矩阵中的其他行的元素和所述第一矩阵中与所述其他行对应的位置的元素相同,所述第一矩阵为:
    Figure PCTCN2021091574-appb-100007
    其中,A'表示所述第一矩阵,c 0、c 1和c 2分别是0,-1,1中的任一项。
  17. 根据权利要求13至16中任一项所述的运算装置,其特征在于,所述权重变换矩阵为:
    Figure PCTCN2021091574-appb-100008
    其中,c 0、c 1和c 2分别是0,-1,1中的任一项。
  18. 根据权利要求12至17中任一项所述的运算装置,其特征在于,所述输出变换矩阵中的各列元素中的正数的数量是相同的,所述各列元素中负数的数量是相同的。
  19. 一种神经网络模型的训练装置,其特征在于,包括:
    获取单元,用于获取训练数据;
    处理单元,用于:
    通过winograd算法的输入变换矩阵对训练数据的输入数据矩阵进行输入数据变换以得到变换后的输入数据矩阵;
    利用变换后的权重矩阵对所述变换后的输入数据矩阵进行特征提取以得到中间矩阵,其中,所述变换后的权重矩阵是通过所述winograd算法的权重变换矩阵对所述至少一个特征提取层的权重矩阵进行权重变换得到的,所述中间矩阵中的每个元素是根据所述变换后的输入数据矩阵与所述变换后的权重矩阵中对应位置的元素之间的Lp距离确定的;
    通过所述winograd算法的输出变换矩阵对所述中间矩阵进行输出数据变换以得到输出数据矩阵;
    根据所述输出数据矩阵确定损失函数的值;
    根据所述损失函数的值对所述神经网络模型进行训练;
    在对所述神经网络模型进行训练的第m次迭代的过程中,p为2,在对所述神经网络模型进行训练的第n次迭代的过程中,p为1,m和n为正整数,m小于n。
  20. 根据权利要求19所述的训练装置,其特征在于,在对所述神经网络模型进行训练的过程中,p的初值为2,p的值随着迭代次数的增加而减少。
  21. 根据权利要求19或20所述的训练装置,其特征在于,所述根据损失函数的值对所述神经网络模型进行训练,包括:根据所述损失函数对第一权重矩阵中的权重的偏导数调整所述第一权重矩阵中的权重,所述第一权重矩阵包括所述变换前的权重矩阵或所述变换后的权重矩阵。
  22. 根据权利要求21所述的训练装置,其特征在于,所述损失函数对第一权重矩阵中的权重的偏导数满足以下公式:
    Figure PCTCN2021091574-appb-100009
    Figure PCTCN2021091574-appb-100010
    其中,p∈[1,2],w表示所述权重,x表示所述变换前的输入数据矩阵中的数据或所述变换后的输入数据矩阵中的数据,L表示所述损失函数,i表示所述神经网络模型的层 数,sign()表示符号函数。
  23. 一种神经网络模型的运算装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令以执行如权利要求1至7中任一项所述的方法。
  24. 一种神经网络模型的训练装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令以执行如权利要求8至11中任一项所述的方法。
  25. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储设备执行的程序代码,所述程序代码包括用于执行如权利要求1至7或权利要求8至11中任一项所述的方法。
  26. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至7或权利要求8至11中任一项所述的方法。
  27. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令以执行如权利要求1至7或权利要求8至11中任一项所述的方法。
PCT/CN2021/091574 2021-04-30 2021-04-30 神经网络模型的运算方法、训练方法及装置 WO2022227024A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/091574 WO2022227024A1 (zh) 2021-04-30 2021-04-30 神经网络模型的运算方法、训练方法及装置
CN202180094093.7A CN116888605A (zh) 2021-04-30 2021-04-30 神经网络模型的运算方法、训练方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/091574 WO2022227024A1 (zh) 2021-04-30 2021-04-30 神经网络模型的运算方法、训练方法及装置

Publications (1)

Publication Number Publication Date
WO2022227024A1 true WO2022227024A1 (zh) 2022-11-03

Family

ID=83847731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091574 WO2022227024A1 (zh) 2021-04-30 2021-04-30 神经网络模型的运算方法、训练方法及装置

Country Status (2)

Country Link
CN (1) CN116888605A (zh)
WO (1) WO2022227024A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117788843A (zh) * 2024-02-27 2024-03-29 青岛超瑞纳米新材料科技有限公司 一种基于神经网络算法的碳纳米管图像处理方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018107383A1 (zh) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 神经网络的卷积运算方法、装置及计算机可读存储介质
CN109190756A (zh) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 基于Winograd卷积的运算装置及包含该装置的神经网络处理器
CN111382854A (zh) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 一种卷积神经网络处理方法、装置、设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018107383A1 (zh) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 神经网络的卷积运算方法、装置及计算机可读存储介质
CN109190756A (zh) * 2018-09-10 2019-01-11 中国科学院计算技术研究所 基于Winograd卷积的运算装置及包含该装置的神经网络处理器
CN111382854A (zh) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 一种卷积神经网络处理方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117788843A (zh) * 2024-02-27 2024-03-29 青岛超瑞纳米新材料科技有限公司 一种基于神经网络算法的碳纳米管图像处理方法
CN117788843B (zh) * 2024-02-27 2024-04-30 青岛超瑞纳米新材料科技有限公司 一种基于神经网络算法的碳纳米管图像处理方法

Also Published As

Publication number Publication date
CN116888605A (zh) 2023-10-13

Similar Documents

Publication Publication Date Title
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2020216227A9 (zh) 图像分类方法、数据处理方法和装置
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2022052601A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
WO2021218517A1 (zh) 获取神经网络模型的方法、图像处理方法及装置
WO2021164750A1 (zh) 一种卷积层量化方法及其装置
WO2021018245A1 (zh) 图像分类方法及装置
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
WO2021008206A1 (zh) 神经网络结构的搜索方法、图像处理方法和装置
CN110222718B (zh) 图像处理的方法及装置
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2022100165A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
WO2022012668A1 (zh) 一种训练集处理方法和装置
WO2021136058A1 (zh) 一种处理视频的方法及装置
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2020062299A1 (zh) 一种神经网络处理器、数据处理方法及相关设备
CN113627163A (zh) 一种注意力模型、特征提取方法及相关装置
WO2022179606A1 (zh) 一种图像处理方法及相关装置
CN111652349A (zh) 一种神经网络的处理方法及相关设备
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
CN114298289A (zh) 一种数据处理的方法、数据处理设备及存储介质
WO2023071658A1 (zh) Ai模型的处理方法、运算方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938498

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180094093.7

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21938498

Country of ref document: EP

Kind code of ref document: A1