CN116888605A

CN116888605A - Operation method, training method and device of neural network model

Info

Publication number: CN116888605A
Application number: CN202180094093.7A
Authority: CN
Inventors: 李文硕; 王云鹤; 伍玮翔; 辛晨; 王璇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2023-10-13
Also published as: WO2022227024A1

Abstract

The application discloses an operation method, a training method and a device of a neural network model in the artificial intelligence field, in the operation method, a weight matrix after the wingrad transformation is utilized to conduct feature extraction on an input data matrix after the wingrad transformation to obtain an intermediate matrix, each element in the intermediate matrix is determined according to an L1 distance between the input data matrix after the transformation and an element at a corresponding position in the weight matrix after the transformation, and output data transformation is conducted on the intermediate matrix through the wingrad algorithm to obtain an output data matrix. According to the scheme, the dot multiplication operation in the wingrad is replaced by addition operations such as the operation of calculating the L1 distance, so that the calculated amount in the feature extraction process is reduced, the running speed of the model is improved, and the operation cost is reduced.

Description

Operation method, training method and device of neural network model

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method of operation, a training method, and an apparatus for neural network models.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like.

A large number of matrix operations are typically involved in neural network models. Taking convolution operation as an example, the convolution operation involves multiplication operation, the calculation complexity is high, and the time delay of the operation process is long. The matrix operation generally occupies most of the time of the neural network model operation, the time delay of the matrix operation becomes a main factor for restricting the calculation efficiency, the overall processing efficiency of the neural network model is affected, and larger power consumption loss is caused.

Therefore, how to reduce the operation overhead of the neural network model is a urgent problem to be solved.

Disclosure of Invention

The application provides an operation method, a training method and a training device of a neural network model, which can reduce the operation cost of the neural network model and improve the processing efficiency.

In a first aspect, a method for operating a neural network model is provided, the method including: the following operations are performed in at least one feature extraction layer of the neural network model: performing input data transformation on an input data matrix of data to be processed through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix, wherein the data to be processed comprises image data, voice data or text data; performing feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer by using a weight transformation matrix of a wingrad algorithm, and each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix; and carrying out output data transformation on the intermediate matrix through an output transformation matrix of the winograd algorithm to obtain an output data matrix.

According to the scheme of the embodiment of the application, the dot multiplication operation in the wingrad is replaced by addition operations such as the operation of calculating the L1 distance, so that the calculated amount of the feature extraction process is reduced, the running speed of the model is improved, and the operation cost is reduced.

The data to be processed includes image data, voice data, text data, or the like.

The type of data to be processed is related to the task of the neural network model. For example, if the neural network model is used for an image processing task, the data to be processed may be an image. Specifically, the image processing tasks include image classification, image detection, image segmentation, image recognition, image generation, or the like. For another example, if the neural network model is used for a text processing task, the data to be processed may be text. In particular, text processing tasks include text recognition or text translation, and the like. For another example, if the neural network model is used for a voice processing task, the data to be processed may be voice data. Specifically, the speech processing tasks include speech recognition and the like. The embodiment of the application does not limit the type of the data to be processed.

The input data matrix of the data to be processed refers to the data matrix input to the at least one feature extraction layer.

The input data matrix may be part or all of the input feature map. The input feature map may be the data itself input into the neural network model, for example, an image to be processed. The input feature map may also be a feature map obtained by performing one or more feature extractions through a part of feature extraction layers in the neural network model.

Specifically, the input data matrix is subjected to Winograd transformation through the input transformation matrix to obtain a transformed input data matrix

The weight matrix is a parameter of one or more feature extraction kernels in the at least one feature extraction layer in the neural network model.

Specifically, the weight matrix is subjected to Winograd transformation through the weight transformation matrix, so that a transformed weight matrix can be obtained.

Specifically, the intermediate matrix is subjected to Winograd transformation through the output transformation matrix, and an output data matrix is obtained.

The output data matrix may be part or all of the output profile.

With reference to the first aspect, in certain implementations of the first aspect, the transformed weight matrix is obtained by offline transformation.

The transformed weight matrix is transformed prior to performing the operation of the neural network model. For example, the transformed weight matrix may be transformed prior to deployment of the neural network model.

According to the scheme of the embodiment of the application, the weight matrix is unchanged in the reasoning process of the neural network model, and the calculation speed can be further improved and the calculation cost can be reduced by obtaining the transformed weight matrix through offline calculation.

With reference to the first aspect, in certain implementations of the first aspect, each element in the intermediate matrix is an inverse of an L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.

With reference to the first aspect, in certain implementations of the first aspect, the output data matrix satisfies the following formula:

Y＝A ^T [-|[GgG ^T ]-[B ^T dB]|]A；

wherein Y represents an output data matrix, A represents an output transformation matrix, A ^T Denote the transpose matrix of A, G denote the weight transform matrix, G ^T Represents the transposed matrix of G, G represents the weight matrix before transformation, B represents the input transformation matrix, B ^T The transposed matrix of B and d represent the input data matrix prior to transformation.

With reference to the first aspect, in certain implementations of the first aspect, the values of the elements in the output transform matrix are any one of 0, -1 or 1.

According to the scheme of the embodiment of the application, the values of the elements in the output transformation matrix are 0,1 or-1, so that multiplication operation can be reduced, calculation times can be further reduced, and the calculation amount of a model can be reduced.

With reference to the first aspect, in certain implementations of the first aspect, the output transform matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.

With reference to the first aspect, in some implementations of the first aspect, elements of at least one row in the output transform matrix are opposite numbers of elements of a position corresponding to the at least one row in the first matrix, elements of other rows in the output transform matrix are the same as elements of positions corresponding to the other rows in the first matrix, and the first matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.

That is, any row of A' may be inverted without affecting the final result of the wingrad transformation.

For example, c ₀ Is 0, c ₁ Is-1, c ₂ 1. For another example, c ₀ Is-1, c ₁ 1, c ₂ Is 0.

With reference to the first aspect, in certain implementations of the first aspect, the weight transformation matrix is:

according to the scheme of the embodiment of the application, the output transformation matrix and the weight transformation matrix meet the general solution form of the winograd algorithm, and the convolution calculation result obtained by the winograd algorithm can be ensured to be the same as the conventional convolution calculation result. Thus, in the case of using the existing input transform matrix, the output transform matrix and the weight transform matrix described above can be applied to convolution calculation.

With reference to the first aspect, in certain implementations of the first aspect, the number of positive numbers in each column element in the output transform matrix is the same, and the number of negative numbers in each column element is the same.

According to the scheme of the embodiment of the application, the number of positive numbers in each column element in the output transformation matrix is the same, the number of negative numbers in each column element is the same, for example, the number of plus 1 in each column element in the output transformation matrix is the same, and the number of minus 1 in each column element is the same, so that the magnitude of each position of the output data matrix can be balanced, namely the imbalance of the accumulation of the characteristic values is relieved, and the training of a model is facilitated. In addition, it is advantageous to perform subsequent processing, such as a homogenization (batch norm) process, on the output data matrix.

In a second aspect, a method for training a neural network model is provided, the method comprising: performing input data transformation on an input data matrix of training data through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix; performing feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer through weight transformation of a wingrad algorithm, and each element in the intermediate matrix is determined according to Lp distances between the transformed input data matrix and elements at corresponding positions in the transformed weight matrix; performing output data transformation on the intermediate matrix through an output transformation matrix of a winograd algorithm to obtain an output data matrix; determining a value of a loss function from the output data matrix; training the neural network model according to the value of the loss function; in the process of the mth iteration of training the neural network model, p is 2, in the process of the nth iteration of training the neural network model, p is 1, m and n are positive integers, and m is smaller than n.

According to the scheme provided by the embodiment of the application, in the early stage of training, the L2 distance is utilized for assisting training, the L2 distance is more friendly to a wingrad algorithm, the convergence speed of the training process can be improved, and the training effect of the model is further improved. At the later stage of training, training is performed based on the L1 distance so as to further improve the training effect of a model adopting the L1 distance, and the trained model adopts the L1 distance, so that the training device is more friendly to hardware.

The type of training data is related to the task of the neural network model. For example, the neural network model is used for image processing tasks, and the training data may be images. Specifically, the image processing tasks include image classification, image detection, image segmentation, image generation, or the like. For another example, the neural network model is used for text processing tasks, and the training data may be text. In particular, text processing tasks include text recognition or text translation, and the like. For another example, the neural network model is used for speech processing tasks, and the training data may be speech data. Specifically, the speech processing tasks include speech recognition and the like. The embodiment of the application does not limit the type of the training data.

Specifically, the output data matrix is part or all of the output feature map, the processing result of the training data can be determined according to the output feature map, and the value of the loss function is calculated according to the processing result of the training data.

The processing result of the training data is related to the type of the training data and the task of the neural network model.

Illustratively, the training data is image data, and the image processing may include image super-division processing, image denoising processing, image recognition processing, and the like, and accordingly, the image processing result includes image super-division, image denoising, or image category, and the like. The embodiment of the present application is not limited thereto.

Illustratively, the training data is speech data, the speech processing may include speech recognition or the like, and the speech processing results include speech recognition results or the like, accordingly. The embodiment of the present application is not limited thereto.

With reference to the second aspect, in some implementations of the second aspect, during training of the neural network model, an initial value of p is 2, and a value of p decreases with an increase in the number of iterations.

That is, p is reduced from 2 to 1 during training.

With reference to the second aspect, in certain implementations of the second aspect, training the neural network model according to the value of the loss function includes: and adjusting the weights in the first weight matrix according to the partial derivative of the loss function on the weights in the first weight matrix, wherein the first weight matrix comprises a weight matrix before transformation or a weight matrix after transformation.

In the training process, the gradient of the weight matrix before transformation can be calculated, and then the value of the weight matrix before transformation is adjusted. Alternatively, the gradient of the transformed weight matrix may be calculated, and the values of the transformed weight matrix may be adjusted.

With reference to the second aspect, in certain implementations of the second aspect, the partial derivative of the loss function with respect to the weights in the first weight matrix satisfies the following formula:

where p is the calculated norm, p e [1,2], w represents the weight, x represents the data in the input data matrix before transformation or the data in the input data matrix after transformation, L represents the loss function, i represents the number of layers of the neural network model, sign () represents the sign function.

When p is 2, the above equation is the partial derivative of the loss function on the first weight matrix based on the L2 distance, that is, back propagation is performed based on the L2 distance. When p is 1, the above formula is the partial derivative of the loss function obtained based on the L1 distance to the first weight matrix, that is, back propagation is performed based on the L2 distance.

In a third aspect, an operation device of a neural network model is provided, the device comprising means or units for performing the method of any one of the implementations of the first aspect and the first aspect.

In a fourth aspect, a training apparatus of a neural network model is provided, the apparatus comprising means or units for performing the method of any one of the implementations of the second aspect and the above-described second aspect.

It should be appreciated that the extensions, limitations, explanations and illustrations of the relevant content in the first aspect described above also apply to the same content in the second, third and fourth aspects.

In a fifth aspect, there is provided an operation device of a neural network model, the device comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the first aspect and any implementation manner of the first aspect when the program stored in the memory is executed.

The processor in the fifth aspect may be a central processing unit (central processing unit, CPU) or a combination of a CPU and a neural network operation processor, where the neural network operation processor may include a graphics processor (graphics processing unit, GPU), a neural network processor (neural-network processing unit, NPU), a tensor processor (tensor processing unit, TPU), and the like. Wherein the TPU is an artificial intelligence accelerator application specific integrated circuit fully customized for machine learning by google (google).

In a sixth aspect, there is provided a training apparatus for a neural network model, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the second aspect and any implementation manner of the second aspect when the program stored in the memory is executed.

The processor in the sixth aspect may be a central processing unit or a combination of a CPU and a neural network operation processor, where the neural network operation processor may include a graphics processor, a neural network processor, a tensor processor, and the like. Wherein, TPU is an artificial intelligent accelerator application specific integrated circuit which is fully customized by google for machine learning.

In a seventh aspect, a computer readable storage medium is provided, the computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method in any one of the implementations of the first or second aspects.

In an eighth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the implementations of the first or second aspects described above.

In a ninth aspect, a chip is provided, the chip including a processor and a data interface, the processor reading instructions stored on a memory through the data interface, and executing the method in any implementation manner of the first aspect or the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in any implementation manner of the first aspect or the second aspect.

The chip may be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present application;

fig. 2 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of an operation device of a neural network model according to an embodiment of the present application;

FIG. 7 is a schematic flowchart of an operation method of a neural network model according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a training method of a neural network model according to an embodiment of the present application;

FIG. 9 is a schematic block diagram of a training apparatus for neural network models provided by an embodiment of the present application;

FIG. 10 is a schematic block diagram of an computing device of a neural network model according to an embodiment of the present application;

FIG. 11 is a schematic block diagram of another training apparatus for neural network models provided by an embodiment of the present application;

fig. 12 is a schematic block diagram of an operation device of another neural network model according to an embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

FIG. 1 illustrates a schematic diagram of an artificial intelligence framework that describes the overall workflow of an artificial intelligence system, applicable to general artificial intelligence field requirements.

The above-described artificial intelligence topic framework is described in detail below from two dimensions, the "Smart information chain" (horizontal axis) and the "information technology (information technology, IT) value chain" (vertical axis).

The "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process.

The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform.

The infrastructure may communicate with the outside through sensors, and the computing power of the infrastructure may be provided by the smart chip.

The smart chip may be a hardware acceleration chip such as a central processing unit (central processing unit, CPU), a neural network processor (neural-network processing unit, NPU), a graphics processor (graphics processing unit, GPU), an application-specific integrated circuit (application specific integrated circuit, ASIC), or a field programmable gate array (field programmable gate array, FPGA).

The basic platform of the infrastructure can comprise a distributed computing framework, network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection network and the like.

For example, for an infrastructure, data may be obtained through sensor and external communication and then provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data:

the data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to internet of things data of traditional equipment, wherein the data comprise service data of an existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) And (3) data processing:

such data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities:

after the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application:

the intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

The embodiment of the application can be applied to various fields in artificial intelligence, such as intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city and the like.

Specifically, the embodiment of the application can be particularly applied to the fields of automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, natural language processing and the like which need to use (depth) neural networks.

The following describes the two application scenes of picture classification and monitoring in a simple way.

Classifying pictures:

when a user stores a large number of pictures on terminal equipment (for example, a mobile phone) or a cloud disk, the user or the system can conveniently manage the album in a classified mode by identifying the images in the album, and user experience is improved.

By utilizing the operation method of the neural network model, hardware cost can be reduced, and the neural network model is more friendly to terminal equipment. In addition, the method can improve the speed of classifying the pictures by using the neural network, is beneficial to labeling the pictures in different categories in real time, and is convenient for users to check and search. In addition, the classification labels of the pictures can also be provided for an album management system to carry out classification management, so that the management time of a user is saved, the album management efficiency is improved, and the user experience is improved.

And (3) monitoring:

the monitoring scene comprises: smart city, field monitoring, indoor monitoring, outdoor monitoring, in-car monitoring, etc. In the smart city scenario, various attribute identifications, such as pedestrian attribute identification and riding attribute identification, are required, and the deep neural network plays an important role in various attribute identifications by virtue of the strong capability of the deep neural network.

By adopting the operation method of the neural network model, the processing efficiency of the neural network model can be improved, the real-time processing of the input road picture is facilitated, different attribute information in the road picture can be recognized more quickly, and meanwhile, the power consumption is reduced.

Since embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, the following description will first discuss the terms and concepts related to neural networks that may be involved in embodiments of the present application.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit whose intercept 1 is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is x _s B is the bias of the neural unit.

f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to transform an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next layer. For example, the activation function may be a ReLU, tanh, or sigmoid function.

A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network

Deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three types: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein,is the input vector which is to be used for the input,is the output vector of the vector,is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for input vectorsThe output vector is obtained through such a simple operation. Since the DNN layers are many, the coefficient W and the offset vectorAnd the number of (2) is also relatively large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

In summary, the coefficients of the kth neuron of the L-1 layer to the jth neuron of the L layer are defined as

It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(3) Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, which can be regarded as a filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The convolution kernel can be formed in a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (of course, the process of pre-configuring parameters for each layer in the deep neural network is usually performed before the first update), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible. In general, the smaller the loss, the higher the training quality of the deep neural network, and the larger the loss, the lower the training quality of the deep neural network. Similarly, the smaller the loss ripple, the more stable the training; the greater the loss fluctuation, the less stable the training.

(5) Back propagation algorithm

The neural network can adopt the size of parameters in the neural network model corrected in the training process by adopting a Back Propagation (BP) algorithm, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is forwarded to the output to generate error loss, and the error loss is converged by the parameters in the neural network model updated by the back propagation error loss information. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

For example, the loss value generated by each training of the neural network model is passed layer by layer from back to front in the neural network model. When transferred to each layer, the update amount (partial derivative operation) of the layer parameter is calculated at the same time, and the update amount is related to gradient (gradient).

(6) Adder neural networks (adder neural network, adderNet)

The structure of AdderNet is similar to that of CNN. The convolution layer in CNN may be used to perform feature extraction processing or filtering processing on the input data. The convolution layer in the CNN may act as a feature extraction layer in the CNN. The adder layer in the adernet can also be used to perform feature extraction processing or filtering processing on the input data. The adder layer in the adernet may act as a feature extraction layer in the adernet. The parameters of the convolution kernel in CNN can be understood as parameters of the adder in adernet. Both the convolution kernel in CNN and the adder in adernet can be understood as filters.

Each convolution layer in the CNN extracts characteristic information from input data through convolution operation, and AdderNet calculates output characteristics by adopting l1 distance. Specifically, the adder layer in adernet extracts feature information from input data by an addition operation (or a subtraction operation) and an absolute value taking operation.

Since the computational complexity of the addition operation is much smaller than that of the multiplication operation, the operational power consumption of the AdderNet is much smaller than that of the CNN whose performance is equivalent to that of the AdderNet. For example, the convolution operation in CNN is replaced by an addition operation (or a subtraction operation) and an absolute value taking operation, resulting in adernet. Therefore, the operation power consumption of the CNN can be greatly reduced while the performance is ensured.

The output profile of the adder layer in AdderNet satisfies the following formula:

wherein Y (m, n, t) represents an element of the mth row and the nth column of the t-th page in the output feature map, and X (m+i, n+j, k) represents the input feature mapIn the (m+i) th row and (n+j) th column, F (i, j, k, t) represents the element of the (j) th column of the filter, t represents the number of channels of the filter, d represents the number of lines of the filter, c _in The number of channels representing the input profile.

In the forward calculation process, the AdderNet adopts L1 distance to extract the characteristics, namely, the addition operation is utilized to replace the multiplication operation, so that the calculation complexity is reduced, and the power consumption and the hardware area are reduced.

(7) Winograd algorithm

The Winograd algorithm is a common convolution rapid calculation method, and can greatly reduce the calculation amount of CNN and improve the calculation efficiency under the condition of not affecting the calculation result.

The wingrad algorithm satisfies the following formula:

Y＝A ^T [[GgG ^T ]⊙[B ^T dB]]A；

wherein Y represents an output data matrix; g represents a convolution kernel, i.e., a weight matrix; d represents a tile (title) of the input feature map, i.e., an input data matrix; as indicated by the letter, "(element-wise multiplication), i.e., dot product between the matrices. A. G and B each represent a transformation matrix, specifically, a represents an output transformation matrix, G represents a weight transformation matrix, and B represents an input transformation matrix. Wherein g is not changed in the reasoning process of the neural network, so that the transformed weight matrixThe transformed weight matrix may be pre-computed, i.e. pre-computed before the forward computation starts. This can further reduce the operational power consumption of the neural network model.

The winograd algorithm, shaped as F (m, r), is used to quickly calculate the convolution operation with the size r of the convolution kernel and the size m of the output feature map. Dimensions are also understood as dimensions.

The transformation matrix may be determined from the dimension and step size (stride) of the convolution kernel. Alternatively, B, G, A is fixed for a combination of convolution kernels and stride of a particular size and can be derived from a wingrad algorithm.

In practical applications, the general form of the wingrad algorithm is F (2×2,3×3).

Wherein the output data matrix Y is a 2×2 matrix, and the weight matrix g is a 3×3 matrix.

For a combination of a weight matrix of 3×3 and stride of 1, the transform matrix satisfies the following formula:

the calculation flow of the wingrad algorithm of F (2×2,3×3) is exemplified below.

1) Transforming the 4×4 input data matrix d by the 4×4 input transformation matrix B to obtain a transformed 4×4 input data matrix

For example, the input data matrix d satisfies the following formula.

The transformed input data matrix satisfies the following formula:

2) Transforming the weight matrix G of 3×3 by the weight matrix G of 4×3 to obtain a transformed weight matrix of 4×4

For example, the weight matrix g satisfies the following formula.

The transformed weight matrix satisfies the following formula:

3) The transformed input data matrix and the transformed weight matrix are multiplied element by element, namely the elements at corresponding positions in the two matrices are multiplied to obtain a 4 multiplied by 4 intermediate matrix

4) Transforming the intermediate matrix by using a 4×2 output transformation matrix A to obtain a 2×2 output data matrixThe output matrix is the result of convolution calculation of the input data matrix d and the weight matrix g.

If a convolution operation is used to obtain 4 results in the output data matrix, then 9 (3*3) multiplication operations are required for each result, and 36 multiplication operations are required for the 4 results. If the winograd algorithm is adopted, only 16 (4*4) multiplication operations in the step 3) are needed except the expense of transformation, and the acceleration ratio of the multiplication times is up to 2.25 (36/16). The elements in the transformation matrix are all 0, ±1/2 values, which can be achieved by lightweight hardware operations such as transformation of sign bits and shift operations, that is, the transformation overhead is usually small.

As shown in fig. 2, an embodiment of the present application provides a system architecture 100. In fig. 2, a data acquisition device 160 is used to acquire training data. For example, for the training method of the neural network model according to the embodiment of the present application, if the training data is image data, the training data may include a training image and a processing result corresponding to the training image. For example, the classification result corresponding to the training image may be a manually pre-labeled result.

After the training data is collected, the data collection device 160 stores the training data in the database 130 and the training device 120 trains the target model/rule 101 based on the training data maintained in the database 130.

The training device 120 obtains the target model/rule 101 based on the training data, and the training device 120 processes the input raw data and compares the output value with the target value until the difference between the value output by the training device 120 and the target value is smaller than a certain threshold value, thereby completing the training of the target model/rule 101.

The target model/rule 101 in the embodiment of the present application may be specifically a neural network model. Such as convolutional neural networks or residual networks. In practical applications, the training data maintained in the database 130 is not necessarily collected by the data collecting device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 2, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or may also be a server or cloud. In fig. 2, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include in an embodiment of the present application: data to be processed entered by the client device.

In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results, such as the processing results of the data obtained as described above, to the client device 140, thereby providing the processing results to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or to complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 2, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 2, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 2, the training device 120 trains to obtain a target model/rule 101, where the target model/rule 101 may be a neural network in the present application in the embodiment of the present application, and specifically, the neural network in the embodiment of the present application may be a CNN or a residual network.

CNN is a very common neural network, and the structure of CNN is described in detail below with reference to fig. 3. As described in the foregoing description of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning architecture, where the deep learning architecture refers to learning at multiple levels at different abstraction levels through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

As shown in fig. 3, convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully-connected layer (fully connected layer) 230.

Convolution layer/pooling layer 220:

convolution layer:

the convolution/pooling layer 220 as shown in fig. 3 may include layers as examples 221-226, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, layer 223 is a convolutional layer, layer 224 is a pooling layer, layer 225 is a convolutional layer, and layer 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 are pooling layers, 224, 225 are convolutional layers, and 226 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of one convolution layer will be described below using the convolution layer 221 as an example.

The convolution layer 221 may include a number of convolution operators, also known as kernels, which act in image processing as a filter to extract specific information from the input image matrix, which may be a weight matrix in nature, which is typically predefined, and which is typically processed on the input image in a horizontal direction (or two pixels followed by two pixels … … depending on the value of the step size stride) to accomplish the task of extracting specific features from the image during the convolution operation on the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The plurality of weight matrixes have the same size (row and column), the feature images extracted by the plurality of weight matrixes with the same size have the same size, and the extracted feature images with the same size are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network 200 can perform correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 200 increases, features extracted by the later convolutional layers (e.g., 226) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers, as illustrated by layers 221-226 in FIG. 3, 220. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Full connection layer 230:

after processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not yet sufficient to output the desired output information. Because, as previously described, the convolution/pooling layer 220 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 200 needs to utilize fully-connected layer 230 to generate the output of the required number of classes or groups. Thus, multiple hidden layers (231, 232 to 23n shown in fig. 3) may be included in the fully connected layer 230, and parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the hidden layers in the fully connected layer 230, i.e., the final layer of the overall convolutional neural network 200 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 200 (e.g., propagation from 210 to 240 in fig. 3) is completed (e.g., propagation from 240 to 210 in fig. 3) and the backward propagation (e.g., propagation from 240 to 210 in fig. 3) begins to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, only includes a part of the network structure shown in fig. 3, for example, the convolutional neural network used in the embodiment of the present application may include only the input layer 210, the convolutional layer/pooling layer 220, and the output layer 240.

Fig. 4 is a hardware structure of a chip according to an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in an execution device 110 as shown in fig. 2 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 2 for completing the training work of the training device 120 and outputting the target model/rule 101. The method of the embodiment of the present application may be implemented in a chip as shown in fig. 4.

The neural network processor 50 may be a neural Network Processor (NPU), tensor processor (tensor processing unit, TPU), or graphics processor (graphics processing unit, GPU) among all suitable processors for large-scale exclusive-or operation processing. Taking NPU as an example: the neural network processor NPU50 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) and tasks are distributed by the main CPU. The NPU has a core part of an arithmetic circuit 503, and a controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic. Wherein the TPU is an artificial intelligence accelerator application specific integrated circuit fully customized for machine learning by google (google).

In some implementations, the arithmetic circuitry 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 501 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization, BN), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 507 can store the vector of processed outputs to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to an output of the operation circuit 503, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used for storing input data and output data.

The weight data is directly transferred to the input memory 501 and/or the unified memory 506 through the memory cell access controller 505 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 502, and the data in the unified memory 506 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 510 for interfacing between the main CPU, DMAC and finger memory 509 via a bus.

An instruction fetch memory (instruction fetch buffer) 509 connected to the controller 504 for storing instructions used by the controller 504;

And a controller 504 for calling the instruction cached in the instruction memory 509 to control the operation of the operation accelerator.

Typically, the unified memory 506, the input memory 501, the weight memory 502, and the finger memory 509 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

The execution device 110 in fig. 2 or the chip in fig. 4 described above can execute the steps of the operation method of the neural network model of the embodiment of the present application. The training device 120 in fig. 2 or the chip in fig. 4 described above is capable of performing the respective steps of the training method of the neural network model of the embodiment of the present application.

As shown in fig. 5, an embodiment of the present application provides a system architecture 300. The system architecture comprises a local device 301, a local device 302, and an executing device 310 and a data storage system 350, wherein the local device 301 and the local device 302 are connected to the executing device 310 through a communication network.

The execution device 310 may be implemented by one or more servers. Alternatively, the execution device 310 may be used with other computing devices, such as: data storage, routers, load balancers, etc. The execution device 310 may be disposed on one physical site or distributed across multiple physical sites. The execution device 310 may implement the operation method of the neural network model or the training method of the neural network model according to the embodiment of the present application using data in the data storage system 350 or invoking program codes in the data storage system 350.

Specifically, in one implementation, the execution device 110 may perform the following process:

the following operations are performed in at least one feature extraction layer of the neural network model: performing input data transformation on an input data matrix of data to be processed through a winograd algorithm to obtain a transformed input data matrix; performing feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer through a Winograd algorithm, and each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix; and carrying out output data transformation on the intermediate matrix through a winograd algorithm to obtain an output data matrix.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with execution device 310. Each local device may represent any computing device, such as a personal computer, computer workstation, smart phone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set top box, game console, etc.

The local device of each user may interact with the performing device 310 through a communication network of any communication mechanism/communication standard, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In one implementation, the local device 301, 302 obtains relevant parameters of the neural network from the executing device 310, deploys the neural network on the local device 301, 302, uses the neural network to perform image classification, image processing, speech processing, text processing, or the like.

In another implementation, the neural network may be deployed directly on the execution device 310, where the execution device 310 processes the data to be processed by obtaining the data to be processed from the local device 301 and the local device 302, and using a neural network model.

The executing device 310 may also be a cloud device, where the executing device 310 may be deployed at the cloud; alternatively, the executing device 310 may be a terminal device, and in this case, the executing device 310 may be disposed on the user terminal side, which is not limited in the embodiment of the present application.

In the neural network model, a large number of matrix operations, for example, convolution operations, are generally involved, and the time delay of the matrix operations becomes a main factor limiting the calculation efficiency, so that the overall processing efficiency of the neural network model is affected, and the neural network model is difficult to deploy in a scene with high requirement on real-time performance. Meanwhile, when the calculated amount of the neural network model is large, the calculation force requirement on hardware is high, so that the neural network model is difficult to deploy to hardware equipment with low calculation force, such as terminal equipment of a mobile phone and the like.

The embodiment of the application provides an operation method of a neural network model, which replaces the operation of element-wise multiplication in a Winograd algorithm with addition operation, further reduces the calculated amount of the neural network model, improves the operation speed of the neural network model and reduces the operation cost.

In order to better explain the method according to the embodiment of the present application, an operation device of the neural network model according to the embodiment of the present application is described below with reference to the accompanying drawings.

Fig. 6 shows an operation device of a neural network model according to an embodiment of the present application, and as shown in fig. 6, an apparatus 600 includes an input data preprocessing module 610, a weight preprocessing module 620, and an acceleration module 630.

The input data preprocessing module 610 performs input data transformation on the input data matrix through a winograd algorithm, so as to obtain a transformed input data matrix. The input data matrix may be part or all of the input feature map. For example, the dimension of the input feature map is 8×8, the input data matrix may be 4*4, and the input data matrix may be 8×8.

The weight preprocessing module 620 performs weight transformation on the weight matrix through a wingrad algorithm to obtain a transformed weight matrix.

It should be noted that, the weight preprocessing module 620 is an optional module.

Illustratively, the transformed weight matrix may be pre-stored in the apparatus 600, or the transformed weight matrix may be transmitted by other apparatuses into the apparatus 600.

That is, the transformed weight matrix may be calculated before the operation of the neural network model is performed, that is, calculated offline, or may be calculated when the operation of the neural network model is performed, that is, calculated online. In the reasoning process of the neural network model, the weight matrix is unchanged, and the calculation speed can be further improved and the calculation cost can be reduced by obtaining the transformed weight matrix through offline calculation.

The acceleration module 630 is configured to perform feature extraction on the transformed input data matrix by using the transformed weight matrix, so as to obtain an intermediate matrix; and performing Winograd transformation on the intermediate matrix to obtain an output data matrix. Each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix

The output data matrix may be part or all of the output profile.

Specifically, the acceleration module may obtain an intermediate matrix by performing subtraction operation, operation of calculating an absolute value, and inversion operation on the transformed input data matrix and the transformed weight matrix.

The specific process is detailed in method 700 hereinafter, and is not described herein.

Illustratively, the acceleration module 630 may include a matrix operation module and a post-processing module. The matrix operation module is used for extracting characteristics of the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix. The post-processing module is used for carrying out output data transformation on the intermediate matrix through a winograd algorithm to obtain an output data matrix.

The method of the embodiments of the present application may be performed by a device capable of executing a wingrad algorithm. Specifically, the operations in the respective modules in fig. 6 may be performed by software, may be performed by hardware, or may be performed by both software and hardware. The embodiment of the present application is not limited thereto.

The operations of the various modules in fig. 6 may be performed by the chip in fig. 4, for example.

Specifically, the external memory in fig. 4 may be used to store an input data matrix, a calculation result, and the like. The neural network processor 50 may retrieve the input data matrix from an external memory.

The operations in the input data preprocessing module 610 may be performed by the vector calculation unit 507. Alternatively, the vector calculation unit 507 includes an input data preprocessing module 610.

Further, an external memory may be used to store the transformed weight matrix computed offline. In this case, the neural network processor 50 may acquire the input data matrix and the transformed weight matrix from the external memory.

If the transformed weight matrix obtained by the offline computation is not stored in the external memory, the operation in the weight preprocessing module 620 may be performed by the vector computation module 507. Alternatively, the vector calculation module 507 includes a weight preprocessing module 620.

The operations of the matrix operation module in the acceleration module 630 may be performed by the operation circuit 503. Alternatively, the arithmetic circuit 503 includes the matrix arithmetic module.

The operations of the post-processing modules in the acceleration module 630 may be performed by the vector calculation module 507. Alternatively, the vector calculation module 507 includes the post-processing module.

Alternatively, the acceleration module 630 may be a dedicated module provided on the basis of the chip shown in fig. 4, and the dedicated module may be provided to further increase the processing speed of the acceleration module 630 with respect to performing the operations in the acceleration module 630 using the arithmetic circuit and the vector calculation module.

The following describes the operation method of the neural network model in the embodiment of the present application in detail with reference to fig. 7.

Fig. 7 illustrates an operation method 700 of the neural network model according to an embodiment of the present application. The method shown in fig. 7 may be performed by an execution device of the neural network model, and the device may be a cloud service device, or may be a terminal device, for example, a device having an operation capability sufficient for executing the neural network model operation, such as a computer or a server, or may be a system formed by the cloud service device and the terminal device. Illustratively, the method 700 may be performed by the executing device 110 of fig. 2, the neural network processor 50 of fig. 4, or the executing device 310 or the local device of fig. 5. Alternatively, the method 700 may be performed by a device providing an autopl service. For example, the device providing the autopl service may be a cloud service device.

For example, the method 700 may be specifically performed by the execution device 110 as shown in fig. 2, and the data to be processed in the method 700 may be input data given by the client device 140 as shown in fig. 2.

The method 700 includes steps S710 to S730. Step S710 to step S730 are described in detail below.

The following steps are performed in at least one feature extraction layer of the neural network model:

s710, carrying out input data transformation on an input data matrix of the data to be processed through a winograd algorithm to obtain a transformed input data matrix.

The data to be processed is an image, and the image to be processed may be an image captured by a terminal device (or other apparatuses or devices such as a computer, a server, etc.) through a camera, or the image to be processed may also be an image obtained from inside the terminal device (or other apparatuses or devices such as a computer, a server, etc.), for example, an image stored in an album of the terminal device, or an image obtained from a cloud end by the terminal device, which is not limited by the embodiment of the present application.

The neural network model in the embodiments of the present application may be an existing neural network model, for example, a residual network. Alternatively, the neural network model may be a neural network model of other structure that is self-building. The embodiment of the present application is not limited thereto.

Illustratively, the feature extraction layer comprises an adder layer, i.e. feature extraction with an addition operation as a filter. For a specific description, reference may be made to the feature extraction layer in adernet, supra, and no further description is given here.

Alternatively, the feature extraction layer includes a convolution layer, i.e., feature extraction is performed with a convolution kernel as a filter. For a specific description, reference may be made to the convolutional neural network in the foregoing, and no further description is given here. In this case, the convolutional layer may be replaced with an adder layer, and then the method 700 is performed on the feature extraction layer.

Specifically, the input data matrix is subjected to Winograd transformation through the input transformation matrix, and the transformed input data matrix is obtained.

Winograd transforms include input data transforms, weight transforms, and output data transforms performed by the Winograd algorithm. That is, all three transforms are understood to be Winograd transforms.

Illustratively, the transformed input data matrixThe following formula is satisfied:

wherein B represents an input transformation matrix, B ^T The transposed matrix of B and d represent the input data matrix prior to transformation.

The input transformation matrix may be an existing wingrad input transformation matrix, for example.

For example, the input data matrix d before transformation is a 4×4 matrix, the weight matrix g before transformation is a 3×3 matrix, and the input transformation matrix may be:

the input data matrix may be part or all of the input feature map. For example, the dimension of the input feature map is 8×8, and the input data matrix may be a matrix of 4*4 in the input feature map, or the input data matrix may be the input feature map.

It should be noted that the input feature map may be the data itself input into the neural network model, for example, an image to be processed. The input feature map may also be a feature map obtained after one or more feature extractions performed on a part of the feature extraction layers in the neural network model, for example, a feature map obtained after one or more feature extractions performed on an image to be processed. The manner of feature extraction may be in accordance with existing schemes, such as convolutional layer processing. That is, the input feature map in the embodiment of the present application may be a feature map obtained after performing one or more convolution processes on an image to be processed. Alternatively, the feature extraction method 700 in the embodiment of the present application may also be used. That is, the input feature map in the embodiment of the present application may be a feature map obtained after processing an image to be processed by the method 700 in the embodiment of the present application. The embodiment of the application does not limit the acquisition mode of the input feature map.

Illustratively, step S710 may be performed by the input data preprocessing module 610 of fig. 6.

S720, extracting features of the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix. The transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer through a wingrad algorithm. Each element in the intermediate matrix is determined from the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.

The L1 distance may also be referred to as an L1 canonical distance, an L1 norm distance, a manhattan distance, or a taxi distance.

And performing wingrad transformation on the weight matrix through the weight transformation matrix to obtain a transformed weight matrix.

Specifically, the weight matrix is a parameter of one or more feature extraction kernels in the at least one feature extraction layer in the neural network model. One feature extraction layer may include one or more feature extraction kernels, i.e., filters. The feature extraction core is used for extracting features of data input into the neural network model.

Illustratively, the transformed weight matrixThe following formula is satisfied:

Wherein G represents a weight transformation matrix, G ^T The transposed matrix of G is represented, and G represents the weight matrix before transformation.

The weight conversion matrix may be an existing wingrad weight conversion matrix, for example.

For example, the input data matrix d before transformation is a 4×4 matrix, the weight matrix g before transformation is a 3×3 matrix, and the weight transformation matrix may be:

the transformed weight matrix can be obtained through offline transformation or online transformation.

The transformed weight matrix is obtained by offline transformation, which means that the transformed weight matrix is obtained by transformation before the operation of the neural network model is performed. For example, the transformed weight matrix may be transformed prior to deployment of the neural network model. In the reasoning process of the neural network model, the weight matrix is unchanged, and the calculation speed can be further improved and the calculation cost can be reduced by obtaining the transformed weight matrix through offline calculation.

The transformed weight matrix is obtained by online transformation, which means that the transformed weight matrix is obtained by transformation in the process of executing the operation of the neural network model.

Specifically, each element in the intermediate matrix is the opposite number of L1 distances between the transformed input data matrix and the elements at the corresponding positions in the transformed weight matrix.

Illustratively, the intermediate matrix X satisfies the following formula:

X＝-|[GgG ^T ]-[B ^T dB]|；

in the embodiment of the application, the minus sign between two terms in the above formula represents element-by-element subtraction (element-wise minus), and |·| represents calculating an absolute value for each element in the matrix.

Step S720 may be implemented by performing subtraction operation, operation of calculating absolute values, and inversion operation on the transformed input data matrix and the transformed weight matrix.

Specifically, step S720 includes steps S721 to S723.

S721, performing element-by-element subtraction operation on the transformed input data matrix and the transformed weight matrix to obtain a difference matrix.

S722, calculating the absolute value of the difference matrix to obtain an absolute value matrix.

And S723, calculating the opposite number of the absolute value matrix to obtain an intermediate matrix.

It should be noted that the addition operation and the subtraction operation are substantially the same, and the result of the subtraction operation may also be obtained by the addition operation.

Illustratively, step S720 may be performed by the acceleration module 630 of fig. 6.

It should be noted that, the neural network model may include a plurality of feature extraction layers, and each feature extraction layer may include one or more feature extraction kernels, i.e., a weight matrix. Accordingly, one or more feature extraction processes may be performed in each feature extraction layer, one or more of which may perform a corresponding operation using method 700. That is, performing method 700 on one feature extraction layer of the neural network model may include performing method 700 on one feature extraction core in the feature extraction layer.

And S730, performing output data transformation on the intermediate matrix through a winograd algorithm to obtain an output data matrix.

Illustratively, the output data matrix Y satisfies the following formula:

Y＝A ^T XA；

specifically, the output data matrix Y satisfies the following formula:

Y＝A ^T [-|[GgG ^T ]-[B ^T dB]|]A；

wherein A represents an output transformation matrix, A ^T Representing the transposed matrix of a.

The output transformation matrix may be an existing wingrad input transformation matrix, for example.

For example, the input data matrix d before transformation is a 4×4 matrix, the weight matrix g before transformation is a 3×3 matrix, and the output transformation matrix may be:

wherein the output data matrix may be part or all of the output feature map.

Illustratively, step S730 may be performed by the acceleration module 630 of fig. 6.

The scheme of the embodiment of the application can be understood as a scheme of combining a wingrad algorithm with AdderNet. Alternatively, it can be understood that the adder layer of the adernet is optimized by utilizing the wingrad algorithm. Specifically, from the perspective of AdderNet, the scheme reduces the number of addition operations in the adder layer of AdderNet, and reduces the calculation amount of the model; from the perspective of the winograd algorithm, the scheme replaces multiplication operation with addition operation, and the calculated amount of the model is reduced.

Optionally, the values of the elements in the output transformation matrix are any one of the following: 0, -1,1.

Therefore, the value of the element in the output transformation matrix is 0,1 or-1, multiplication operation can be reduced, calculation times are further reduced, and the calculation amount of the model is reduced.

Alternatively, in the case where the dimension of the weight matrix before transformation is 3×3 and the dimension of the output data matrix Y is 2×2, the output transformation matrix may be:

optionally, the elements of at least one row of the output transformation matrix are the opposite numbers of the elements of the first matrix at the positions corresponding to the at least one row, the elements of other rows of the output transformation matrix are the same as the elements of the first matrix at the positions corresponding to the other rows, and the first matrix may be:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1,1.

Wherein c ₀ 、c ₁ And c ₂ Each 0, -1,1.

Alternatively, in the case where the dimension of the weight matrix before transformation is 3×3 and the dimension of the output data matrix Y is 2×2, the weight transformation matrix may be:

The output transformation matrix and the weight transformation matrix meet the general solution form of the wingrad algorithm, and the convolution calculation result obtained by the wingrad algorithm can be ensured to be the same as the conventional convolution calculation result. Thus, in the case of using the existing input transform matrix, the output transform matrix and the weight transform matrix described above can be applied to convolution calculation.

It should be noted that, the transformation matrix in the wingrad algorithm may include various forms, so long as the general solution form of the wingrad is satisfied, it can be ensured that the result of the convolution calculation obtained by the wingrad algorithm is the same as the result of the conventional convolution calculation. The above-described output transform matrix and weight transform matrix are only examples in the case of an input transform matrix employing the existing wingrad algorithm. In the case of input data matrix adjustment, the output transform matrix and the weight transform matrix may also take other forms.

However, there may be an imbalance among the elements at each position in the output data matrix, resulting in a slow decrease of the loss function during training, and a low accuracy of the trained model. In addition, this situation can also affect the subsequent processing of the output data matrix, e.g., the homogenization (batch norm) process, affecting the performance of the model.

The following description will take, as an example, a matrix in which the intermediate matrix X is 4×4 and the output data matrix Y is 2×2.

The intermediate matrix X may be expressed as follows:

wherein x is ₀ 、x ₁ …x ₁₆ Representing elements in the intermediate matrix X.

The output data matrix may be expressed in the form of:

wherein y is ₀ 、y ₁ 、y ₂ And y ₃ Representing the elements in the output data matrix Y.

From y=a ^T XA is available if the output transform matrix in the existing wingrad algorithm is used, namely:

the elements in Y satisfy the following formula:

y ₀ ＝x ₀ +x ₁ +x ₂ +x ₄ +x ₅ +x ₆ +x ₈ +x ₉ +x ₁₀ ；

y ₁ ＝x ₁ -x ₂ -x ₃ +x ₅ -x ₆ -x ₇ +x ₉ -x ₁₀ -x ₁₁ ；

y ₂ ＝x ₄ +x ₅ +x ₆ -x ₈ -x ₉ -x ₁₀ -x ₁₂ -x ₁₃ -x ₁₄ ；

y ₃ ＝x ₅ -x ₆ -x ₇ -x ₉ +x ₁₀ +x ₁₁ -x ₁₃ +x ₁₄ +x ₁₅ ；

as can be seen from the above equation, the number of addition operations in the equation corresponding to each element in Y is different. Accordingly, the number of subtraction operations in the equation corresponding to each element in Y is different. Alternatively, the number of signs of the elements in the equation corresponding to the respective elements in Y is different. For example, y ₀ The corresponding equation includes 9 additions, y ₁ Corresponding equationIncludes 3 addition operations, y ₀ Corresponding equation and y ₁ The number of additions in the corresponding equations is different. The magnitudes of the elements in X are generally the same, and each element in X is always a non-positive number, and due to the different numbers of addition operations and subtraction operations in equations corresponding to the elements in Y, the magnitudes of the elements in the output data matrix Y are inconsistent, that is, the elements in the positions in the output data matrix are unbalanced, which affects the performance of the model.

The number of additions and subtractions in the equations corresponding to the individual elements in Y is determined by the signs of the elements in the output transformation matrix, that is, the signs of the elements in the output transformation matrix affect the distribution of eigenvalues in the output data matrix.

Therefore, the embodiment of the application also provides an output transformation matrix, which can improve the operation speed of the neural network model and avoid the performance degradation of the neural network model.

Optionally, the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.

This ensures the equalization of the elements at each position in the output data matrix.

As previously described, the value of an element in the output transform matrix may be any one of 0,1 or-1, in which case the number of +1 s in each column of elements in the output transform matrix is the same and the number of-1 s in each column of elements is the same.

For example, c ₀ Is 0, c ₁ Is-1, c ₂ 1. In this case, the output transform matrix may include any one of the following:

wherein A is ₀ 、A ₁ 、A ₂ 、A ₃ Representing the 4 output transform matrices, respectively. It should be understood that the 4 output transform matrices are only examples, at c ₀ 、c ₁ And c ₂ In the case of other combinations of values 0, 1 and-1, i.e. in the case of an exchange between rows of the 4 output transform matrices, other forms of output transform matrices may be obtained, which are not listed here.

Using an output transformation matrix A ₀ In the case of (2), by y=a ^T XA is available, and the elements in Y satisfy the following formula:

y ₀ ＝x ₀ -x ₁ -x ₂ -x ₄ +x ₅ +x ₆ -x ₈ +x ₉ +x ₁₀ ；

y ₁ ＝-x ₁ +x ₂ -x ₃ +x ₅ -x ₆ +x ₇ +x ₉ -x ₁₀ +x ₁₁ ；

y ₂ ＝x ₄ +x ₅ +x ₆ +x ₈ -x ₉ -x ₁₀ -x ₁₂ +x ₁₃ +x ₁₄ ；

y ₃ ＝x ₅ -x ₆ +x ₇ -x ₉ +x ₁₀ -x ₁₁ +x ₁₃ -x ₁₄ +x ₁₅ ；

as can be seen from the above equation, the number of addition operations in the equation corresponding to each element in Y is the same. Accordingly, the number of subtraction operations in the equation corresponding to each element in Y is the same. Or, the number of signs in the elements in the intermediate matrix in the equation corresponding to each element in Y is the same. For example, y ₀ The corresponding equation includes 5 positive signs, and y ₁ The corresponding equation includes 5 positive signs, y ₀ Corresponding equation and y ₁ The number of positive signs in the corresponding equations is the same, and the equality of elements at each position in the output data matrix is ensured. It should be understood that only A is used herein ₀ By way of example, A ₁ 、A ₂ 、A ₃ The same effect can be achieved.

For example, c ₀ Is 0, c ₁ Is-1, c ₂ 1. In this case, the weight transformation matrix may include any one of the following:

G ₀ 、G ₁ 、G ₂ 、G ₃ representing 4 weight transform matrices, respectively. It should be understood that the 4 weight transformation matrices are only examples, at c ₀ 、c ₁ And c ₂ In the case of other combinations of values 0, 1 and-1, i.e. in the case of the mutual exchange of the rows of the 4 weight transformation matrices, other forms of weight transformation matrices may be obtained, which are not listed here.

Weight transformation matrix G ₀ 、G ₁ 、G ₂ 、G ₃ And output transform matrix A ₀ 、A ₁ 、A ₂ 、A ₃ The weight transformation matrix and the output transformation matrix which are in one-to-one correspondence are matched for use. For example, G ₀ And A ₀ Is matched with the components. In the case of using the 4 kinds of weight transform matrices and the output transform matrix, the input transform matrix may be an input transform matrix in the existing wingrad.

In the case where the input data matrix adopts the existing wingrad, the output transform matrix and the weight transform matrix may take the above four forms. In the case of input data matrix adjustment, the output transform matrix and the weight transform matrix may take other forms as well. That is, as long as the transformation matrix can satisfy the generalized form of the wingrad and can ensure the equality of the elements at each position in the output data matrix.

In addition, the output transformation matrix and the weight transformation matrix meet the general form of the winograd, so that the output result is consistent with the actual result of the convolution operation when the convolution operation is performed, namely the method is suitable for the convolution operation in the model.

As described above, the winograd algorithm is adopted to accelerate the convolution operation, so that the calculation result is not affected. Whereas the absolute value is taken in method 700, the multiplicative allocation law is no longer applicable. That is, if the adder layer is accelerated by the wingrad algorithm, the calculation result is a certain difference from the original calculation result. Combining the wingrad algorithm with the adder layer in the manner described above can result in reduced performance of the neural network model.

The embodiment of the application also provides a training method of the neural network model, which can improve the performance of the neural network model.

The following describes the training method of the neural network model in the embodiment of the present application in detail with reference to fig. 8.

Fig. 8 illustrates a training method 800 of a neural network model provided by an embodiment of the present application. The method shown in fig. 8 may be performed by a training apparatus of the neural network model, and the apparatus may be a cloud service device, or may be a terminal device, for example, a computer, a server, or the like, which has sufficient computing power to perform training of the neural network model, or may be a system formed by the cloud service device and the terminal device. Illustratively, the method 800 may be performed by the training device 120 of fig. 2, the neural network processor 50 of fig. 4, or the execution device 310 of fig. 5.

The method of operation of training method 800 of fig. 8 in performing forward propagation of a neural network model is consistent with the method of fig. 7. The "data to be processed" in method 700 may be replaced with "training data". For a specific implementation of the forward propagation procedure in the method 800, reference may be made to the foregoing method 700, and in order to avoid unnecessary repetition, the repeated description is omitted below when describing the method 800.

The method 800 includes steps S810 to S850. The following describes step S810 to step S850 in detail.

Performing the following operations on at least one feature extraction layer in the neural network model:

s810, carrying out input data transformation on an input data matrix of training data through a winograd algorithm to obtain a transformed input data matrix.

For example, the training data may be pre-stored. For example, the training data may be training data maintained in database 130 shown in FIG. 2.

The input data matrix refers to a data matrix input to the at least one feature extraction layer.

S820, extracting features of the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix. The transformed weight matrix is obtained by carrying out weight transformation on the weight matrix of the at least one feature extraction layer through a wingrad algorithm, and each element in the intermediate matrix is determined according to the Lp distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.

The Lp distance, which may also be referred to as the minkowski distance (Minkowski distance), is a definition of a set of distances. p is a parameter.

When p is 1, each element in the intermediate matrix is determined from the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.

Illustratively, the intermediate matrix X satisfies the following formula:

X＝-|[GgG ^T ]-[B ^T dB]|；

when p is 2, each element in the intermediate matrix is determined from the L2 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix.

Illustratively, the intermediate matrix X satisfies the following formula:

X＝-([GgG ^T ]-[B ^T dB]) ² ；

the L2 distance may also be referred to as a euclidean distance.

And S830, performing output data transformation on the intermediate matrix through a winograd algorithm to obtain an output data matrix.

S840, determining the value of the loss function according to the output data matrix.

The output feature map may further undergo other processing, for example, processing by an activation function, and the like, to further obtain a processing result of the training data.

And S850, training the neural network model according to the value of the loss function.

In the process of the mth iteration of training the neural network model, p is 2; in the process of the nth iteration of training the neural network model, p is 1, wherein m and n are positive integers, and m is smaller than n.

In the m-th iteration process, an intermediate matrix is calculated by adopting an L2 distance in the forward calculation process, and the partial derivative of the loss function on the weight in the first weight matrix is calculated based on the L2 distance in the back propagation process. Specifically, forward computation is performed according to L2 distances between elements at corresponding positions in the transformed input data matrix and the transformed weight matrix, and backward propagation is performed according to L2 distances between elements at corresponding positions in the transformed input data matrix and the transformed weight matrix. Alternatively, it can be understood that the forward computation and the backward computation are performed based on the L2 distance.

In the n-th iteration process, an intermediate matrix is calculated by adopting an L1 distance in the forward calculation process, and the partial derivative of the loss function on the weight in the first weight matrix is calculated based on the L1 distance in the back propagation process. Specifically, forward computation is performed according to L1 distances between elements at corresponding positions in the transformed input data matrix and the transformed weight matrix, and backward propagation is performed according to L1 distances between elements at corresponding positions in the transformed input data matrix and the transformed weight matrix. Alternatively, it can be understood that the forward computation and the backward computation are performed based on the L1 distance.

The first weight matrix includes a weight matrix before transformation or a weight matrix after transformation.That is, the first weight matrix may be a weight matrix g before transformation, or a weight matrix after transformationIn the training process, the gradient of the weight matrix before transformation can be calculated, and then the value of the weight matrix before transformation is adjusted. Alternatively, the gradient of the transformed weight matrix may be calculated, and the values of the transformed weight matrix may be adjusted.

That is, in the early stage of the neural network model training, forward computation and backward propagation are performed based on the L2 distance, and in the later stage of the neural network model training, forward computation and backward propagation are performed using the L1 distance, that is, the L1 distance is approximated using the L2 distance.

If training is performed based on the L1 distance only, the wingrad algorithm is difficult to optimize, and network training may not be converged. In the embodiment of the application, in the early stage of training, the L2 distance is utilized for assisting training, the L2 distance is more friendly to a winograd algorithm, the convergence speed of the training process can be improved, and the training effect of the model is further improved. At the later stage of training, training is performed based on the L1 distance so as to further improve the training effect of a model adopting the L1 distance, and the trained model adopts the L1 distance, so that the training device is more friendly to hardware.

Optionally, the partial derivative of the loss function with respect to the weights in the first weight matrix satisfies the following formula:

wherein p is the calculated norm, p ε [1,2 ]]W represents the weight in the first weight matrix, and w may be the weight in the weight matrix g before transformation or the weight matrix g after transformationThe weight, x represents the data in the input data matrix before transformation or the data in the input data matrix after transformation, the data in the input data matrix before transformation is the characteristic value in the input characteristic diagram, L represents the loss function, i represents the layer number of the neural network model, and i is an integer.Representing the partial derivative of the loss function with respect to the eigenvalues of the i-th layer,representing the partial derivative of the loss function with respect to the eigenvalues of layer i + 1.Representing the partial derivative of the loss function with respect to the weights in the first weight matrix of the i-th layer, sign () represents the sign function.

Optionally, the value of p is determined according to the number of iterations of the training process.

Further, the initial value of p is 2, and the value of p decreases as the number of iterations increases.

That is, p is reduced from 2 to 1 during training.

The value of p may be reduced once per iteration or once every few iterations.

Illustratively, during the training process, the value of p is reduced by a, which may be set as desired, each iteration. a may be fixed, that is, the decrease in the value of p is constant for each iteration. For example, a may be 0.05, i.e., p decreases by 0.05 at each iteration. Alternatively, a may be varied. For example, as the number of iterations increases, the decrease in the value of p gradually increases. For example, p is 2 for iteration 1, p is 1.99 for iteration 2, a is 0.01 for iteration 2, and p is 1.98 for iteration 3, a is 0.02.

Alternatively, the value of p is reduced by a for each iteration k times during the training process.

Or training by adopting l2 distance until convergence, reducing the value of p, and retraining.

It should be noted that the foregoing is merely an example, and p may be reduced from 2 to 1 in other manners, which is not limited in the embodiment of the present application.

The trained neural network model may be used to perform the target task. Illustratively, the target task may be an image processing task, such as target detection, image segmentation, instance segmentation, image denoising, image super resolution, and the like. Alternatively, the target task may be a speech processing task, such as speech recognition, or the like. Alternatively, the target task may be a text processing task, such as text recognition or text translation, or the like.

Table 1 shows a comparison of the experimental results of the scheme of the example of the present application and the prior scheme on the classification data of CIFAR-10.

TABLE 1

Method	Accuracy rate of
AdderNet	91.84
The operation method of the application (adopting the prior transformation matrix)	86.13
The operation method of the application (adopting the adjusted transformation matrix)	88.60
The model obtained by the training method of the application	91.47

As described in table 1, when the conventional transformation matrix is used, the calculation method of the present application can improve the calculation efficiency, but the accuracy of the model is reduced. Compared with the existing transformation matrix, the method has the advantage that the accuracy of the model can be improved by adopting the adjusted transformation matrix. Moreover, compared with the existing transformation matrix, under the condition of adopting the adjusted transformation matrix, the training method for training the model can further improve the accuracy of the model and is close to the accuracy of AdderNet.

Table 2 shows a comparison of experimental results of the scheme of the present embodiment and the existing scheme on the underlying visual task.

TABLE 2

Method	PSNR
Convolutional neural network	57.31
AdderNet	57.22
The operation method of the application (adopting the adjusted changeMatrix replacement	57.27

As shown in table 2, the operation method of the embodiment of the present application can reach a higher index than adernet. In addition, the operation method of the embodiment of the application can achieve the visual effect close to AdderNet.

An apparatus according to an embodiment of the present application will be described with reference to fig. 9 to 12. It should be understood that the apparatus described below is capable of performing the method of the foregoing embodiments of the present application, and in order to avoid unnecessary repetition, the repeated description is appropriately omitted when describing the apparatus of the embodiments of the present application.

Fig. 9 is a schematic block diagram of a training apparatus of a neural network model of an embodiment of the present application. The training apparatus 3000 of the neural network model shown in fig. 9 includes an acquisition unit 3010 and a processing unit 3020.

The acquisition unit 3010 and the processing unit 3020 may be used to perform the training method of the neural network model according to the embodiment of the present application, and in particular, may be used to perform the method 800.

The acquisition unit 3010 is used to acquire training data.

The processing unit 3020 is configured to perform the following operations on at least one feature extraction layer of the neural network model:

Performing input data transformation on an input data matrix of training data through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix; performing feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer by using a weight transformation matrix of a wingrad algorithm, and each element in the intermediate matrix is determined according to Lp distances between the transformed input data matrix and elements at corresponding positions in the transformed weight matrix; performing output data transformation on the intermediate matrix through an output transformation matrix of a winograd algorithm to obtain an output data matrix; determining a value of a loss function from the output data matrix; training the neural network model according to the value of the loss function; in the process of the mth iteration of training the neural network model, p is 2, in the process of the nth iteration of training the neural network model, p is 1, m and n are positive integers, and m is smaller than n.

Optionally, as an embodiment, during the training of the neural network model, the initial value of p is 2, and the value of p decreases with the increase of the iteration number.

Optionally, as an embodiment, training the neural network model according to the value of the loss function includes: and adjusting the weights in the first weight matrix according to the partial derivative of the loss function on the weights in the first weight matrix, wherein the first weight matrix comprises a weight matrix before transformation or a weight matrix after transformation.

Optionally, as an embodiment, the partial derivative of the loss function with respect to the weights in the first weight matrix satisfies the following formula:

Fig. 10 is a schematic block diagram of an operation device 4000 of the neural network model provided in the embodiment of the present application. The apparatus 4000 shown in fig. 10 includes an acquisition unit 4010 and a processing unit 4020.

The acquisition unit 4010 and the processing unit 4020 may be configured to execute the operation method of the neural network model according to the embodiment of the application, for example, may be configured to execute the method 700.

The acquisition unit 4010 is used for acquiring data to be processed including image data, voice data, or text data.

The processing unit 4020 is configured to perform the following operations on at least one feature extraction layer of the neural network model: performing input data transformation on an input data matrix of data to be processed through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix; performing feature extraction on the transformed input data matrix by using the transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of at least one feature extraction layer by using a weight transformation matrix of a wingrad algorithm, and each element in the intermediate matrix is determined according to the L1 distance between the transformed input data matrix and the element at the corresponding position in the transformed weight matrix; and carrying out output data transformation on the intermediate matrix through an output transformation matrix of the winograd algorithm to obtain an output data matrix.

Alternatively, as one embodiment, the output data matrix satisfies the following formula:

Y＝A ^T [-|[GgG ^T ]-[B ^T dB]|]A；

Alternatively, as an embodiment, the value of an element in the output transform matrix is any one of 0, -1 or 1.

Optionally, as an embodiment, the output transform matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.

Optionally, as an embodiment, the elements of at least one row in the output transformation matrix are the opposite numbers of the elements of the first matrix at the positions corresponding to the at least one row, the elements of other rows in the output transformation matrix are the same as the elements of the first matrix at the positions corresponding to the other rows, and the first matrix is:

wherein A' represents a first matrix, c ₀ 、c ₁ And c ₂ Each 0, -1, 1.

Optionally, as an embodiment, the weight transformation matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.

Optionally, as an embodiment, the number of positive numbers in each column element in the output transformation matrix is the same, and the number of negative numbers in each column element is the same.

The training device 3000 and the device 4000 are embodied as functional units. The term "unit" herein may be implemented in software and/or hardware, without specific limitation.

For example, a "unit" may be a software program, a hardware circuit or a combination of both that implements the functions described above. The hardware circuitry may include application specific integrated circuits (application specific integrated circuit, ASICs), electronic circuits, processors (e.g., shared, proprietary, or group processors, etc.) and memory for executing one or more software or firmware programs, merged logic circuits, and/or other suitable components that support the described functions.

Thus, the elements of the examples described in the embodiments of the present application can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 11 is a schematic hardware structure of a training device for a neural network model according to an embodiment of the present application. The training apparatus 5000 of the neural network model shown in fig. 11 (the apparatus 5000 may be a computer device in particular) includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. The memory 5001, the processor 5002, and the communication interface 5003 are communicatively connected to each other via a bus 5004.

The memory 5001 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 5001 may store a program for executing the steps of the training method of the neural network model of the embodiment of the present application by the processor 5002 when the program stored in the memory 5001 is executed by the processor 5002. In particular, the processor 5002 can perform the method 800 illustrated in FIG. 8 above.

The processor 5002 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to implement the neural network model training method of the present method embodiments.

The processor 5002 may also be an integrated circuit chip having signal processing capabilities, for example, the chip illustrated in fig. 4. In implementation, the various steps of the neural network model training method of the present application may be performed by instructions in the form of integrated logic circuits or software in hardware in the processor 5002.

The processor 5002 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines the hardware thereof to perform the functions required by the units included in the training device shown in fig. 9, or to perform the training method of the neural network model shown in fig. 8 according to the method embodiment of the present application.

The communication interface 5003 enables communication between the apparatus 5000 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver. For example, training data may be obtained through the communication interface 5003.

Bus 5004 may include a path for transferring information between various components of device 5000 (e.g., memory 5001, processor 5002, communications interface 5003).

Fig. 12 is a schematic hardware configuration of an arithmetic device of a neural network model according to an embodiment of the present application. The data processing apparatus 6000 shown in fig. 12 includes a memory 6001, a processor 6002, a communication interface 6003 and a bus 6004. The memory 6001, the processor 6002, and the communication interface 6003 are connected to each other by a bus 6004.

The memory 6001 may be a ROM, a static storage device, and a RAM. The memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 and the communication interface 6003 are configured to execute respective steps of an operation method of a neural network model of an embodiment of the present application. Specifically, the processor 6002 may perform steps S710 to S730 in the method shown in fig. 7 above.

The processor 6002 may be a general-purpose CPU, microprocessor, ASIC, GPU, or one or more integrated circuits for executing related programs to implement the functions required to be performed by the units in the computing device of the neural network model of the method embodiment of the present application, or to perform the method of computing the neural network model of the method embodiment of the present application.

The processor 6002 may also be an integrated circuit chip with signal processing capabilities, for example, the chip shown in fig. 4. In the implementation process, each step of the operation method of the neural network model according to the embodiment of the present application may be completed by an integrated logic circuit of hardware or an instruction in a software form in the processor 6002.

The processor 6002 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 6001, and the processor 6002 reads information in the memory 6001, and combines its hardware to implement functions to be executed by units included in the operation device of the neural network model of the embodiment of the present application, or to execute the operation method of the neural network model of the method embodiment of the present application.

The communication interface 6003 enables communication between the apparatus 6000 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver. For example, data to be processed can be acquired through the communication interface 6003.

Bus 6004 may include a path to transfer information between components of device 6000 (e.g., memory 6001, processor 6002, communication interface 6003).

It should be noted that although the above-described apparatus 5000 and apparatus 6000 only show memory, processors, communication interfaces, in a particular implementation, those skilled in the art will appreciate that the apparatus 5000 and apparatus 6000 may also include other devices necessary to achieve proper operation. Also, as will be appreciated by those skilled in the art, the apparatus 5000 and the apparatus 6000 may also include hardware devices that perform other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus 5000 and the apparatus 6000 may also include only the devices necessary to implement the embodiments of the present application, and not all of the devices shown in fig. 11 and 12.

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A method of operating a neural network model, characterized in that the following operations are performed in at least one feature extraction layer of the neural network model:

performing input data transformation on an input data matrix of data to be processed through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix, wherein the data to be processed comprises image data, voice data or text data;

performing feature extraction on the transformed input data matrix by using a transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer by using a weight transformation matrix of the wingrad algorithm, and each element in the intermediate matrix is determined according to an L1 distance between the transformed input data matrix and an element at a corresponding position in the transformed weight matrix;

And carrying out output data transformation on the intermediate matrix through the output transformation matrix of the winograd algorithm to obtain an output data matrix.
The method of claim 1, wherein the output data matrix satisfies the following formula:

Y＝A ^T [-|[GgG ^T ]-[B ^T dB]|]A；

wherein Y represents the output data matrix, A represents the output transformation matrix, A ^T A transpose matrix representing A, G representing the weight transform matrix, G ^T Represents the transposed matrix of G, G represents the weight matrix before the transformation, and table BShowing the input transformation matrix, B ^T The transposed matrix representing B and d representing the input data matrix prior to the transformation.
The operation method according to claim 1 or 2, wherein the value of an element in the output transform matrix is any one of 0, -1 or 1.
The method of claim 3, wherein the output transform matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.
The operation method according to claim 3, wherein elements of at least one row in the output transformation matrix are opposite numbers of elements of a position corresponding to the at least one row in a first matrix, elements of other rows in the output transformation matrix are identical to elements of positions corresponding to the other rows in the first matrix, and the first matrix is:

Wherein A' represents the first matrix, c ₀ 、c ₁ And c ₂ Each 0, -1, 1.
The operation method according to any one of claims 2 to 5, wherein the weight transformation matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.
The operation method according to any one of claims 1 to 6, wherein the number of positive numbers in each column element in the output transform matrix is the same, and the number of negative numbers in each column element is the same.
A method for training a neural network model, comprising:

performing input data transformation on an input data matrix of training data through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix;

performing feature extraction on the transformed input data matrix by using a transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer by using a weight transformation matrix of the wingrad algorithm, and each element in the intermediate matrix is determined according to Lp distances between the transformed input data matrix and elements in corresponding positions in the transformed weight matrix;

Performing output data transformation on the intermediate matrix through an output transformation matrix of the winograd algorithm to obtain an output data matrix;

determining a value of a loss function from the output data matrix;

training the neural network model according to the value of the loss function;

in the process of the mth iteration of training the neural network model, p is 2, in the process of the nth iteration of training the neural network model, p is 1, m and n are positive integers, and m is smaller than n.
The training method of claim 8, wherein during training of the neural network model, an initial value of p is 2 and the value of p decreases with increasing number of iterations.
Training method according to claim 8 or 9, characterized in that the training of the neural network model according to the value of the loss function comprises:

and adjusting the weights in a first weight matrix according to the partial derivative of the loss function on the weights in the first weight matrix, wherein the first weight matrix comprises a weight matrix before transformation or a weight matrix after transformation.
Training method according to claim 10, characterized in that the partial derivative of the loss function with respect to the weights in the first weight matrix satisfies the following formula:

Wherein p e [1,2], w represents the weight, x represents data in the input data matrix before transformation or data in the input data matrix after transformation, L represents the loss function, i represents the layer number of the neural network model, sign () represents a sign function.
An arithmetic device of a neural network model, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed, and the data to be processed comprises image data, voice data or text data;

a processing unit, configured to perform the following operations on at least one feature extraction layer of the neural network model:

performing input data transformation on an input data matrix of data to be processed through an input data matrix of a winograd algorithm to obtain a transformed input data matrix;

performing feature extraction on the transformed input data matrix by using a transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer by using a weight transformation matrix of the wingrad algorithm, and each element in the intermediate matrix is determined according to an L1 distance between the transformed input data matrix and an element at a corresponding position in the transformed weight matrix;

And carrying out output data transformation on the intermediate matrix through the output transformation matrix of the winograd algorithm to obtain an output data matrix.
The computing device of claim 12, wherein the output data matrix satisfies the following formula:

Y＝AT[-|[GgG ^T ]-[B ^T dB]|]A；

wherein Y represents the output data matrix, A represents the output transformation matrix, A ^T A transpose matrix representing A, G representing the weight transform matrix, G ^T A transpose matrix representing G, G representing the weight matrix before the transformation, B representing the input transformation matrix, B ^T The transposed matrix representing B and d representing the input data matrix prior to the transformation.
The arithmetic device according to claim 12 or 13, characterized in that the value of an element in the output transform matrix is any one of 0, -1 or 1.
The computing device of claim 14, wherein the output transform matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.
The computing device of claim 15, wherein elements of at least one row in the output transform matrix are opposite numbers of elements of a first matrix at positions corresponding to the at least one row, elements of other rows in the output transform matrix are the same as elements of the first matrix at positions corresponding to the other rows, the first matrix being:

Wherein A' represents the first matrix, c ₀ 、c ₁ And c ₂ Each 0, -1, 1.
The computing device according to any one of claims 13 to 16, wherein the weight transformation matrix is:

wherein c ₀ 、c ₁ And c ₂ Each 0, -1, 1.
The arithmetic device according to any one of claims 12 to 17, characterized in that the number of positive numbers in each column element in the output transform matrix is the same, and the number of negative numbers in each column element is the same.
A training device for a neural network model, comprising:

an acquisition unit configured to acquire training data;

a processing unit for:

performing input data transformation on an input data matrix of training data through an input transformation matrix of a winograd algorithm to obtain a transformed input data matrix;

performing feature extraction on the transformed input data matrix by using a transformed weight matrix to obtain an intermediate matrix, wherein the transformed weight matrix is obtained by performing weight transformation on the weight matrix of the at least one feature extraction layer by using a weight transformation matrix of the wingrad algorithm, and each element in the intermediate matrix is determined according to Lp distances between the transformed input data matrix and elements in corresponding positions in the transformed weight matrix;

Performing output data transformation on the intermediate matrix through an output transformation matrix of the winograd algorithm to obtain an output data matrix;

determining a value of a loss function from the output data matrix;

training the neural network model according to the value of the loss function;

in the process of the mth iteration of training the neural network model, p is 2, in the process of the nth iteration of training the neural network model, p is 1, m and n are positive integers, and m is smaller than n.
The training device of claim 19, wherein during training of the neural network model, an initial value of p is 2 and the value of p decreases with increasing number of iterations.
Training device according to claim 19 or 20, characterized in that the training of the neural network model according to the value of the loss function comprises: and adjusting the weights in a first weight matrix according to the partial derivative of the loss function on the weights in the first weight matrix, wherein the first weight matrix comprises a weight matrix before transformation or a weight matrix after transformation.
Training device according to claim 21, characterized in that the partial derivative of the loss function with respect to the weights in the first weight matrix satisfies the following formula:

Wherein p epsilon [1,2], w represents the weight, x represents the data in the input data matrix before transformation or the data in the input data matrix after transformation, L represents the loss function, i represents the layer number of the neural network model, and sign () represents a sign function.
An arithmetic device of a neural network model, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 1-7.
A training device for a neural network model, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 8-11.
A computer readable storage medium for storing program code for execution by a device, the program code comprising instructions for performing the method of any one of claims 1 to 7 or 8 to 11.
A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 7 or 8 to 11.
A chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface to perform the method of any one of claims 1 to 7 or 8 to 11.