CN113627163A

CN113627163A - Attention model, feature extraction method and related device

Info

Publication number: CN113627163A
Application number: CN202110731775.6A
Authority: CN
Inventors: 唐业辉; 韩凯; 王云鹤; 肖安; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-11-09

Abstract

The embodiment of the application discloses an attention model and a feature extraction method, which are applied to the technical field of artificial intelligence. The attention model includes: one or more serially connected self-attention networks comprising a self-attention module, a multi-layer perceptron, and a first neural network layer; the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, and the fusion layer is connected with the plurality of parallel feature extraction layers respectively; the multilayer perceptron is connected with the self-attention module in series and comprises a plurality of first full-connection layers in series; the first neural network layer is connected in parallel with the self-attention module and one or more of the multi-layer perceptrons, wherein the first neural network layer is configured to perform feature transformation. Based on the scheme, the diversity of the features extracted by the attention model can be increased, the expression capability of the features is enhanced, and therefore the performance of the attention model is improved.

Description

Attention model, feature extraction method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an attention model, a feature extraction method and a related device.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

In recent years, self-attention networks have found great application in many Natural Language Processing (NLP) tasks, such as machine translation, emotion analysis, and problem solving. With the wide application of the self-attention network, the self-attention network from the natural language processing field also achieves high performance in tasks such as image classification, target detection, image processing and the like.

In the self-attention network, due to the processing of features from the attention network layer, features of input data are easily indistinguishable as the network deepens, and these indistinguishable features have weak representation capabilities. This phenomenon, in which features become indistinguishable as the network deepens, is commonly referred to as feature collapse (feature collapse).

At present, the phenomenon of feature collapse can be relieved by adding shortcuts (shortcuts) in a self-attention network, and the situation that features cannot be distinguished is avoided. However, the shortcut added in the self-attention network simply copies the input features from the attention network layer to the output from the attention network layer, and the expressive power of the features cannot be enhanced, resulting in poor performance of the self-attention network.

Disclosure of Invention

The application provides an attention model and a feature extraction method, which can increase the diversity of features extracted by the attention model and enhance the expression capability of the features, thereby improving the performance of the attention model.

A first aspect of the application provides an attention model comprising: one or more serially connected self-attention networks comprising a self-attention module, a multi-layer perceptron, and a first neural network layer.

The self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, and the fusion layer is connected with the plurality of parallel feature extraction layers respectively. Wherein the self-attention module is a network employing a self-attention mechanism capable of correlating different positions of an input sequence to compute a representation of the same sequence.

The multilayer perceptron is connected with the self-attention module in series and comprises a plurality of first full-connection layers in series. Specifically, the multi-layer perceptron may also be referred to as a Fully Connected neural Network (FCN), and the multi-layer perceptron includes an input layer, hidden layers, and output layers, and the number of the hidden layers may be one or more layers. And network layers in the multilayer perceptron are all full connection layers. Namely, the input layer and the hidden layer of the multilayer perceptron are all connected, and the hidden layer and the output layer of the multilayer perceptron are also all connected.

The first neural network layer is connected in parallel with the self-attention module and one or more of the multi-layer perceptrons, wherein the first neural network layer is configured to perform feature transformation.

In the scheme, another parallel neural network layer is introduced on the basis of a self-attention module and a multilayer perceptron, and the parallel neural network layer performs characteristic transformation operation on input characteristics to obtain transformed characteristics. And the transformed features are added with the output features of the self-attention module and/or the multi-layer perceptron to increase the diversity of the features output from the middle layer of the attention network, thereby enhancing the expression capability of the features and further improving the performance of the attention model.

In one possible implementation, the self-attention network further includes: a second neural network layer to perform a feature transformation; the first neural network layer is connected with the self-attention module in parallel and the second neural network layer is connected with the multilayer perceptron in parallel, or the second neural network layer is connected with the self-attention module in parallel and the first neural network layer is connected with the multilayer perceptron in parallel.

In the scheme, a first neural network layer and a second neural network layer which are respectively parallel to the self-attention module and the multilayer perceptron are introduced, and the two parallel neural network layers perform characteristic transformation operation on input characteristics to obtain transformed characteristics. And the transformed features are respectively added with the output features of the self-attention module and the output features of the multilayer perceptron to increase the diversity of the features output from the middle layer of the attention network, enhance the expression capability of the features and further improve the performance of the attention model.

In a possible implementation manner, the first neural network layer comprises a weight matrix and an activation function, the weight matrix is used for multiplying the input features of the first neural network layer, and the activation function is used for processing the multiplication result of the input features and the weight matrix. The activation function may be a nonlinear function such as a Sigmoid function, a Tanh function, or a ReLU function, for example.

In one possible implementation, the weight matrix includes a plurality of sub-matrices, and each sub-matrix is a cyclic matrix. In short, the weight matrix may be composed of a plurality of sub-matrices, and each sub-matrix constituting the weight matrix is a cyclic matrix.

In one possible implementation, the plurality of parallel feature extraction layers respectively include different weight matrices.

In one possible implementation, the self-attention module is a multi-head self-attention module, and the multi-head self-attention module includes a plurality of parallel self-attention units and a second full-link layer, the second full-link layer is respectively connected to the plurality of parallel self-attention units, and each of the plurality of parallel self-attention units includes a plurality of parallel feature extraction layers and a fusion layer.

In one possible implementation, the self-attention module and/or the multi-layered perceptron are also provided with shortcuts in parallel, the input features of the shortcuts being identical to the output features.

Compared to the first neural network layer, the shortcuts do not change the input features, i.e. do not perform a feature transformation on the input features. Thus, shortcuts can also be considered as a special feature handling. On the basis of the first neural network layer, parallel shortcuts are introduced, so that the diversity of the features obtained from the attention network can be further enhanced, and the expression capability of the features is increased.

In one possible implementation, the model comprises a computer vision model or a natural language processing model.

A second aspect of the present application provides a feature extraction method, including: acquiring data to be processed; inputting the data to be processed into one or more self-attention networks connected in series to obtain the characteristics of the data to be processed; the self-attention network comprises a self-attention module, a multi-layer perceptron and a first neural network layer, wherein the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is respectively connected with the plurality of parallel feature extraction layers, the multi-layer perceptron is connected with the self-attention module in series, the multi-layer perceptron comprises a plurality of series first full-connection layers, the first neural network layer is connected with the self-attention module and one or more of the multi-layer perceptron in parallel, and the first neural network layer is used for executing feature transformation.

In a possible implementation manner, the first neural network layer comprises a weight matrix and an activation function, the weight matrix is used for multiplying the input features of the first neural network layer, and the activation function is used for processing the multiplication result of the input features and the weight matrix.

In one possible implementation, the weight matrix includes a plurality of sub-matrices, and each sub-matrix is a cyclic matrix.

In one possible implementation, the method is applied to a computer vision task or a natural language processing task.

A third aspect of the present application provides an image processing method, including: acquiring an image to be processed; inputting the image to be processed into an image processing model to extract image features through an attention model in the image processing model, wherein the attention model is the attention model described in the first aspect or any implementation manner of the first aspect; and processing the image to be processed according to the image characteristics.

In a possible implementation manner, the processing the image to be processed according to the image feature includes: according to the image characteristics, one or more of the following tasks are executed on the image to be processed: image recognition, object detection, semantic segmentation, and image generation.

A fourth aspect of the present application provides a natural language processing method, including: acquiring a text to be processed; inputting the text to be processed into a natural language processing model to extract text features through an attention model in the natural language processing model, wherein the attention model is the attention model described in the first aspect or any implementation manner of the first aspect; and processing the text to be processed according to the text characteristics.

In a possible implementation manner, the processing the text to be processed according to the text feature includes: according to the text characteristics, one or more of the following tasks are executed on the text to be processed: machine translation, public opinion monitoring, automatic summary generation, viewpoint extraction, text classification, question answering and text semantic comparison.

A fifth aspect of the present application provides a feature extraction device, including: an acquisition unit and a processing unit; the acquisition unit is used for acquiring data to be processed; the processing unit is used for inputting the data to be processed into one or more serially connected self-attention networks to obtain the characteristics of the data to be processed; the self-attention network comprises a self-attention module, a multi-layer perceptron and a first neural network layer, wherein the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is respectively connected with the plurality of parallel feature extraction layers, the multi-layer perceptron is connected with the self-attention module in series, the multi-layer perceptron comprises a plurality of series first full-connection layers, the first neural network layer is connected with the self-attention module and one or more of the multi-layer perceptron in parallel, and the first neural network layer is used for executing feature transformation.

In one possible implementation, the self-attention network further includes: a second neural network layer to perform a feature transformation;

the first neural network layer is connected in parallel with the self-attention module and the second neural network layer is connected in parallel with the multi-layered perceptron,

or the like, or, alternatively,

the second neural network layer is connected in parallel with the self-attention module, and the first neural network layer is connected in parallel with the multilayer perceptron.

A sixth aspect of the present application provides an image processing apparatus comprising: an acquisition unit and a processing unit; the acquisition unit is used for acquiring an image to be processed; the processing unit is configured to input the image to be processed into an image processing model, so as to extract image features through an attention model in the image processing model, where the attention model is the attention model described in the first aspect or any implementation manner of the first aspect; the processing unit is further configured to process the image to be processed according to the image features.

In a possible implementation manner, the processing unit is further configured to perform one or more of the following tasks on the image to be processed according to the image features: image recognition, object detection, semantic segmentation, and image generation.

A seventh aspect of the present application provides a natural language processing apparatus comprising: an acquisition unit and a processing unit; the acquisition unit is used for acquiring a text to be processed; the processing unit is configured to input the text to be processed into a natural language processing model, so as to extract text features through an attention model in the natural language processing model, where the attention model is the attention model described in the first aspect or any implementation manner of the first aspect; the processing unit is further configured to process the text to be processed according to the text features.

In a possible implementation manner, the processing unit is further configured to perform one or more of the following tasks on the text to be processed according to the text feature: machine translation, public opinion monitoring, automatic summary generation, viewpoint extraction, text classification, question answering and text semantic comparison.

An eighth aspect of the present application provides an electronic device, which may include a processor, a processor coupled with a memory, the memory storing program instructions, and the program instructions stored in the memory when executed by the processor implement the method of the first or second aspect. For the processor to execute the steps in each possible implementation manner of the second aspect, reference may be made to the second aspect specifically, and details are not described here.

A ninth aspect of the present application provides a server, which may comprise a processor, a processor coupled to a memory, the memory storing program instructions, which when executed by the processor, implement the method of the second aspect described above. For the processor to execute the steps in each possible implementation manner of the second aspect, reference may be made to the second aspect specifically, and details are not described here.

A tenth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method of the second aspect described above.

An eleventh aspect of the present application provides circuitry comprising processing circuitry configured to perform the method of the second aspect described above.

A twelfth aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the second aspect described above.

A thirteenth aspect of the present application provides a chip system, which includes a processor, configured to enable a server or a threshold value obtaining apparatus to implement the functions referred to in the first aspect, for example, to send or process data and/or information referred to in the method. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;

FIG. 2 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;

FIG. 4 is a diagram illustrating a system architecture according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an attention model provided in an embodiment of the present application;

FIG. 6a is a schematic structural diagram of an attention model provided in an embodiment of the present application;

FIG. 6b is a schematic structural diagram of an attention model provided in an embodiment of the present application;

FIG. 6c is a schematic structural diagram of an attention model provided in an embodiment of the present application;

FIG. 6d is a schematic structural diagram of an attention model provided in an embodiment of the present application;

fig. 6e is a schematic structural diagram of a self-attention network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a feature process provided in an embodiment of the present application;

fig. 8 is a schematic diagram illustrating comparison of performance of different models on the Imagenet data set according to the embodiment of the present application;

fig. 9 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an execution device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) An infrastructure.

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) And (4) data.

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) And (6) data processing.

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) Universal capability.

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent products and industrial applications.

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent electronic equipment, intelligent transportation, intelligent medical treatment, autopilot, smart city, etc.

The method provided by the application is described from the model training side and the model application side as follows:

the model training method provided by the embodiment of the application can be particularly applied to data processing methods such as data training, machine learning and deep learning, symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like are carried out on training data, and a trained neural network model (such as a target neural network model in the embodiment of the application) is finally obtained; and the target neural network model can be used for model reasoning, and specifically, input data can be input into the target neural network model to obtain output data.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) A neural network.

The neural network may be composed of neural units, which may refer to inputs of xs (i.e., input data) and intercept 1An arithmetic unit, the output of which may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Convolutional Neural Networks (CNN) are a type of deep neural Network with convolutional structures. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer (for example, a first convolutional layer and a second convolutional layer in the present embodiment) for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

Specifically, as shown in fig. 2, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

The structure formed by the convolutional layer/pooling layer 120 and the neural network layer 130 may be a first convolutional layer and a second convolutional layer described in this application, the input layer 110 is connected to the convolutional layer/pooling layer 120, the convolutional layer/pooling layer 120 is connected to the neural network layer 130, the output of the neural network layer 130 may be input to the active layer, and the active layer may perform nonlinear processing on the output of the neural network layer 130.

Convolutional/pooling layers 120. And (3) rolling layers: as shown in FIG. 2, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer: since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 2, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers.

The neural network layer 130: after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 2) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

(3) A deep neural network.

Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layered DNN, the linearity of the 4 th neuron of the second layer to the 2 nd neuron of the third layerThe coefficients are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(4) A loss function.

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(5) A back propagation algorithm.

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

(6) Linear operation.

Linearity refers to a proportional, linear relationship between a quantity and a quantity, and is understood mathematically as a function of which the first derivative is a constant, and linear operations can be, but are not limited to, addition operations, null operations, identity operations, convolution operations, batch normalization BN operations, and pooling operations. Linear operations may also be referred to as linear mapping, which requires two conditions to be satisfied: homogeneity and additivity, and non-linearity if either condition is not met.

Wherein, homogeneous means f (ax) af (x); additivity means f (x + y) ═ f (x) + f (y); for example, f (x) ax is linear. It should be noted that x, a, and f (x) herein are not necessarily scalars, and may be vectors or matrices, forming a linear space of any dimension. If x, f (x) are n-dimensional vectors, the equivalence satisfies homogeneity when a is constant, and the equivalence satisfies additivity when a is matrix. In contrast, a function graph is a straight line and does not necessarily conform to a linear mapping, for example, f (x) ax + b, which does not satisfy homogeneity or additivity, and thus belongs to a nonlinear mapping.

In the embodiment of the present application, a composite of a plurality of linear operations may be referred to as a linear operation, and each linear operation included in the linear operation may also be referred to as a sub-linear operation.

(7) Attention model.

The attention model is a neural network to which an attention mechanism is applied. In deep learning, the attention mechanism can be broadly defined as a weight vector describing importance: an element is predicted or inferred by this weight vector. For example, for a certain pixel in an image or a certain word in a sentence, the relevance between the target element and other elements can be quantitatively estimated using the attention vector, and the weighted sum of the attention vectors is used as an approximate value of the target.

The attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when a human being views a picture, although the human eyes can see the whole picture, when the human being closely observes, the human eyes focus on only a part of the whole picture, and at this time, the human brain focuses mainly on the small picture. That is, when a human beings carefully observe an image, the attention of the human brain to the whole image is not balanced and is distinguished by a certain weight, which is the core idea of attention mechanism.

In short, the human visual processing system tends to selectively focus on certain portions of an image, ignoring other irrelevant information, and thus contributing to the perception of the human brain. Similarly, in the attention mechanism of deep learning, in some questions related to language, speech, or vision, certain portions of the input may be more relevant than others. Thus, by means of the attention mechanism in the attention model, it is possible to let the attention model dynamically focus only on the partial inputs that contribute to efficiently performing the task at hand.

(8) Self-attention networks.

A self-attention network is a neural network to which a self-attention mechanism is applied. The self-attention mechanism is an extension of the attention mechanism. The self-attention mechanism is actually an attention mechanism that associates different positions of a single sequence to compute a representation of the same sequence. The self-attention mechanism can play a key role in machine reading, abstract or image description generation.

Taking the application of the self-attention network to natural language processing as an example, the self-attention network processes input data of any length and generates a new feature expression of the input data, and then converts the feature expression into a target word. The self-attention network layer in the self-attention network uses the attention mechanism to obtain the relationships between all other words, thereby generating a new feature expression for each word. An advantage of the self-attention network is that the attention mechanism can directly capture the relationships between all words in a sentence without regard to the word position.

Fig. 4 is a schematic diagram of a system architecture provided in an embodiment of the present application, in fig. 4, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140.

During the input data preprocessing performed by the execution device 120 or the processing related to the computation performed by the computation module 111 of the execution device 120 (such as performing the function implementation of the neural network in the present application), the execution device 120 may call the data, the code, and the like in the data storage system 150 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

Alternatively, the client device 140 may be, for example, a control unit in an automatic driving system, a functional algorithm module in a mobile phone electronic device, and the functional algorithm module may be used to implement related tasks, for example.

It should be noted that the training device 120 may generate corresponding target models/rules (e.g., target neural network models in this embodiment) based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 4, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific form may be a display, a sound, an action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 4 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 4, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

The attention model and the feature extraction method provided by the embodiment of the application can be applied to electronic equipment, in particular to electronic equipment which needs to perform data processing tasks based on a self-attention network. Illustratively, the electronic device may be, for example, a server, a smart phone (mobile phone), a Personal Computer (PC), a laptop, a tablet, a smart tv, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless electronic device in industrial control (industrial control), a wireless electronic device in self driving (self driving), a wireless electronic device in remote surgery (remote medical supply), a wireless electronic device in smart grid (smart grid), a wireless electronic device in transportation safety (transportation safety), a wireless electronic device in smart city (smart city), a wireless electronic device in smart home (smart home), and the like.

The attention model and the feature extraction method provided in the embodiments of the present application are applied to the device described above, and a scene to which the attention model and the feature extraction method provided in the embodiments of the present application are applied will be described below.

The attention model and the feature extraction method provided by the embodiment of the application can be applied to computer vision or natural language processing. That is, the electronic device can execute a computer vision task or a natural language processing task by the above-described attention model and feature extraction method.

Among them, natural language processing is an important direction in the fields of computer science and artificial intelligence. Natural language processing studies various theories and methods that enable efficient communication between humans and computers using natural language. Generally, natural language processing tasks mainly include tasks such as machine translation, public opinion monitoring, automatic summary generation, viewpoint extraction, text classification, question answering, text semantic comparison, and speech recognition.

Computer vision is a science of how to make a machine learn to see. More specifically, computer vision refers to machine vision such as identifying, tracking and measuring a target by using a camera and a computer instead of human eyes, and further performing image processing, so that an image obtained by processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection. Generally, computer vision tasks include tasks such as Image recognition (Image Classification), Object Detection (Object Detection), Semantic Segmentation (Semantic Segmentation), and Image Generation (Image Generation).

Image recognition is a common classification problem, also commonly referred to as image classification. Specifically, in the image recognition task, the input of the neural network is image data, and the output value is the probability that the current image data belongs to each category. The class with the highest probability value is usually selected as the prediction class of the image data. Image recognition is one of the tasks of applying deep learning at the earliest success, and the classic network models are the VGG series, the inclusion series, the ResNet series and the like.

The object detection means that an approximate position of a common object in an image is automatically detected through an algorithm, the approximate position of the object is usually represented by a Bounding box (Bounding box), and the class information of the object in the Bounding box is classified.

Semantic segmentation refers to automatically segmenting and identifying content in an image through an algorithm. Semantic segmentation can be understood as the classification problem of each pixel point, namely, the classification of each pixel point belonging to an object is analyzed.

Image generation refers to obtaining a generated image with high fidelity by learning the distribution of real images and sampling from the learned distribution. For example, a sharp image is generated based on the blurred image; a defogged image is generated based on the image with the fog.

The attention model and the application scenario of the feature extraction method provided in the embodiment of the present application are described above, and the specific structure of the model provided in the embodiment of the present application will be described below.

The attention model provided by the embodiment of the application comprises a self-attention network or a plurality of self-attention networks connected in series. When the attention model includes a plurality of serially connected self-attention networks, the structure of each self-attention network in the attention model is the same, but the weight parameters in different self-attention networks may be different. As shown in fig. 5, fig. 5 is a schematic structural diagram of an attention model provided in an embodiment of the present application. In fig. 5, the attention model includes N serially connected self-attention networks, namely self-attention network 1 and self-attention network 2 … self-attention network N. The structure of the N networks from attention network 1 to attention network N may be the same, but the weight parameters in attention network 1 to attention network N may be different. The input from the attention network 1 is the input data of the attention model, the input from the attention network 2 is the output from the attention network 1, and the input from the attention network N is the output from the attention network N-1.

Specifically, in the attention model, each self-attention network includes a self-attention module, a multi-layered perceptron, and a first neural network layer. The self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is respectively connected with the plurality of parallel feature extraction layers, and the fusion layer is used for fusing features output by the plurality of parallel feature extraction layers. Wherein the self-attention module is a network employing a self-attention mechanism capable of correlating different positions of an input sequence to compute a representation of the same sequence.

The Multilayer Perceptron (MLP) is serially connected with the self-attention module, the Multilayer Perceptron including a plurality of serial first fully-connected layers. In particular, the multi-layer perceptron can also be referred to as a fully-connected neural network, and comprises an input layer, hidden layers and an output layer, wherein the number of the hidden layers can be one or more. And network layers in the multilayer perceptron are all full connection layers. Namely, the input layer and the hidden layer of the multilayer perceptron are all connected, and the hidden layer and the output layer of the multilayer perceptron are also all connected. Wherein, the fully connected layer means that each neuron in the fully connected layer is connected with all neurons in the previous layer for integrating the extracted features of the previous layer.

The first neural network layer is connected in parallel with the self-attention module and one or more of the multi-layered perceptrons, the first neural network layer for performing feature transformation.

In the present embodiment, the input of the attention model is data in the form of a sequence, that is, the input data of the attention model is sequence data. For example, the input data of the attention model may be a sentence sequence composed of a plurality of consecutive words; for another example, the input data of the attention model may be an image block sequence composed of a plurality of consecutive image blocks obtained by dividing one complete image.

For ease of understanding, various implementations of the self-attention network described above will be described in detail below with reference to the accompanying drawings.

In implementation mode 1, the self-attention network includes a self-attention module, a multi-layer perceptron and a first neural network layer, and the first neural network layer is connected in parallel with the self-attention module.

Referring to fig. 6a, fig. 6a is a schematic structural diagram of an attention model provided in an embodiment of the present application. As shown in fig. 6a, in the self-attention network, the self-attention module is connected in series with the multi-layered perceptron, and the first neural network layer is connected in parallel with the self-attention module. In the working process of the self-attention network, the self-attention module and the first neural network layer process data to be processed input from the attention network in parallel, and the output of the self-attention module and the output of the first neural network layer are added to obtain the input of the multilayer perceptron. Finally, after the output of the self-attention module and the output of the first neural network layer are subjected to addition operation, the multi-layer perceptron continues to perform processing, and output data of the self-attention network are obtained.

It is to be understood that, when the attention model includes a plurality of serial self-attention networks, and the self-attention network shown in fig. 6a is the first self-attention network in the attention model, the data to be processed input from the attention network in fig. 6a is raw sequence data, such as text data to be processed or image data to be processed. When the self-attention network shown in fig. 6a is not the first self-attention network in the attention model, the data to be processed input into the self-attention network in fig. 6a is the feature data output from the previous self-attention network.

In the scheme, a parallel first neural network layer is introduced on the basis of a self-attention module, and the parallel first neural network layer performs characteristic transformation operation on input characteristics to obtain transformed characteristics. And the transformed features are added with the output features of the self-attention module to increase the diversity of the features output from the attention network middle layer, and enhance the expression capability of the features, thereby improving the performance of the attention model.

In implementation 2, the self-attention network includes a self-attention module, a multi-layer perceptron and a first neural network layer, and the first neural network layer is connected to the multi-layer perceptron in parallel.

Referring to fig. 6b, fig. 6b is a schematic structural diagram of an attention model provided in an embodiment of the present application. As shown in fig. 6b, in the self-attention network, the self-attention module is connected in series with the multi-layered perceptron, and the first neural network layer is connected in parallel with the multi-layered perceptron. In the working process of the self-attention network, the self-attention module processes data to be processed input from the attention network, the output of the self-attention module is simultaneously used as the input of the multilayer perceptron and the first neural network layer, and the output of the self-attention module is processed by the multilayer perceptron and the first neural network layer in parallel. And finally, adding the output of the multilayer perceptron and the output of the first neural network layer to obtain the output data from the attention network.

According to the scheme, a parallel first neural network layer is introduced on the basis of a multilayer perceptron, and the parallel first neural network layer performs characteristic transformation operation on input characteristics of the multilayer perceptron to obtain transformed characteristics. And the transformed features are added with the output features of the multilayer perceptron to increase the diversity of the features output from the middle layer of the attention network and enhance the expression capability of the features, thereby improving the performance of the attention model.

In implementation 3, the self-attention network includes a self-attention module, a multi-layered perceptron, and a first neural network layer, and the first neural network layer is parallel to the self-attention module and the multi-layered perceptron.

Referring to fig. 6c, fig. 6c is another schematic structural diagram of the attention model provided in the embodiment of the present application. As shown in fig. 6c, in the self-attention network, the self-attention module is connected in series with the multi-layered perceptron, and the first neural network layer is connected in parallel with the multi-layered perceptron. In the working process of the self-attention network, the self-attention module processes data to be processed input from the attention network, the output of the self-attention module is simultaneously used as the input of the multilayer perceptron and the first neural network layer, and the output of the self-attention module is processed by the multilayer perceptron and the first neural network layer in parallel. And finally, adding the output of the multilayer perceptron and the output of the first neural network layer to obtain the output data from the attention network.

In the scheme, a parallel first neural network layer is introduced on the basis of a self-attention module and a multilayer perceptron, and the parallel first neural network layer executes a feature transformation operation on the input features of the self-attention network to obtain transformed features. And the transformed features are added with the output features of the multilayer perceptron to increase the diversity of the features output from the middle layer of the attention network and enhance the expression capability of the features, thereby improving the performance of the attention model.

Implementation 4, the self-attention network includes a self-attention module, a multi-layered perceptron, a first neural network layer, and a second neural network layer. Wherein the first neural network layer is parallel to the self-attention module and the second neural network layer is parallel to the multilayer perceptron; alternatively, the second neural network layer is parallel to the self-attention module and the first neural network layer is parallel to the multi-layered perceptron. The first neural network layer and the second neural network layer are both used for performing feature transformation. Also, the network structures of the first and second neural network layers may be the same, but the weight parameters in the first and second neural network layers may not be the same.

Referring to fig. 6d, fig. 6d is another structural schematic diagram of the attention model provided in the embodiment of the present application. As shown in fig. 6d, in the self-attention network, the self-attention module is connected in series with the multi-layered perceptron, and the first neural network layer is connected in parallel with the self-attention module and the second neural network layer is connected in parallel with the multi-layered perceptron. In the working process of the self-attention network, the self-attention module and the first neural network layer process data to be processed input from the attention network, and the result obtained by adding the output of the self-attention module and the output of the first neural network layer is used as the input of the multilayer perceptron and the second neural network layer. That is, the result obtained by adding the output of the self-attention module and the output of the first neural network layer is simultaneously used as the input of the multilayer perceptron and the second neural network layer, and is processed by the multilayer perceptron and the second neural network layer in parallel. And finally, adding the output of the multilayer perceptron and the output of the second neural network layer to obtain the output data from the attention network.

Optionally, on the basis of the foregoing four implementation manners, in addition to introducing a neural network layer in parallel with the self-attention module and/or the multilayer perceptron, the self-attention network may further include a plurality of neural network layers in parallel with the foregoing first neural network layer and/or the foregoing second neural network layer.

For example, in the foregoing implementation 1, in the self-attention network, in addition to the first neural network layer connected in parallel with the self-attention module, a plurality of neural network layers may be included. The plurality of neural network layers are all connected with the first neural network layer in parallel, namely the self-attention module is simultaneously connected with the first neural network layer and the plurality of neural network layers in parallel. Wherein the plurality of neural network layers are each configured to perform a feature transformation. Also, the network structure of the plurality of neural network layers may be the same as that of the first neural network layer, but the weight parameters in the first neural network layer and the plurality of neural network layers may not be the same.

Taking the foregoing implementation manner 4 as an example, refer to fig. 6e, where fig. 6e is another schematic structural diagram of the self-attention network provided in the embodiment of the present application. As shown in fig. 6e, the self-attention network includes a self-attention module and a multi-layer perceptron connected in series, and n neural network layers connected in parallel with the self-attention module and n neural network layers connected in parallel with the multi-layer perceptron. Specifically, in a self-attention network, a self-attention module is connected in series with a multi-layered perceptron. The self-attention module is provided with n neural network layers in parallel, namely a neural network layer A1 and a neural network layer A2 … neural network layer An. The multilayer perceptron is also provided with n neural network layers in parallel, namely a neural network layer B1 and a neural network layer B2 …, namely a neural network layer Bn.

In the scheme, the diversity of the characteristics output from the attention network can be further enriched by introducing a plurality of parallel neural network layers.

In a possible embodiment, based on the foregoing four implementations, the self-attention module and/or the multi-layer perceptron in the self-attention network are also provided with shortcuts in parallel, and the input characteristics of the shortcuts are the same as the output characteristics. That is, the self-attention module and/or the multi-layered perceptron have concurrently a first neural network layer and shortcuts to the identity map.

For example, taking the implementation mode 1 as an example, reference may be made to fig. 7, where fig. 7 is a schematic diagram of a feature process provided in an embodiment of the present application. As shown in fig. 7, the self-attention module in the self-attention network has shortcuts and a first neural network layer in parallel. The self-attention module, the shortcut and the first neural network layer process the same input features respectively. And then, performing addition operation on the output characteristics of the self-attention module, the output characteristics of the shortcut and the output characteristics of the first neural network layer to obtain a characteristic addition result, wherein the characteristic addition result is used as the input of a subsequent multilayer perceptron. Also, as can be seen from fig. 7, the input characteristics and the output characteristics of the shortcut are the same. After the feature transformation of the first neural network layer, the input feature and the output feature of the first neural network layer are different.

While various implementations of the self-attention network are described above, specific modules in the self-attention network are described in detail below.

In one possible embodiment, the first neural network layer includes a weight matrix for multiplying the input features of the first neural network layer and an activation function for processing the result of multiplying the input features by the weight matrix. That is to say, in the process of processing the input features through the first neural network layer, the input features are multiplied by the weight matrix in the first neural network layer to obtain a multiplication result; and then, processing the multiplication result through an activation function in the first neural network layer to obtain the output characteristic of the first neural network layer.

Illustratively, the process of processing the input features by the first neural network layer to obtain the output features can be represented by formula 1.

Wherein the content of the first and second substances,

representing an output characteristic of the first neural network layer; z_lRepresenting input features of a first neural network layer; theta_liRepresenting a weight matrix; σ represents an activation function; i represents the number of blocks of the weight matrix, and the weight matrix can be divided into a plurality of sub-matrixes to execute multiplication operation with the input characteristics; i e {1, 2, …, T ∈ [ ]]。

The activation function refers to a function that runs on a neuron of the neural network model and is responsible for mapping an input of the neuron to an output. The activation function plays an important role in learning and understanding very complex and nonlinear functions of the neural network model. The activation function can introduce non-linear characteristics into the neural network model. For example, in a neuron, the input features are weighted by a weight matrix and summed, and then a function is applied, which is an activation function. The activation function is introduced to increase the non-linearity of the neural network model. In the neural network model, each network layer without an activation function is equivalent to a matrix multiplication. Thus, without an activation function, the resulting output after superimposing several network layers is actually also the result of a matrix multiplication.

Briefly, if no activation function is employed in the neural network, the output of each neural network layer is a linear function of the upper layer input. The output is a linear combination of inputs, no matter how many layers there are in the neural network. If the neural network adopts the activation function, the activation function introduces a nonlinear factor to the neuron, so that the neural network can arbitrarily approximate any nonlinear function, and the neural network can be applied to a plurality of nonlinear models.

Illustratively, the activation function may be a nonlinear function such as a Sigmoid function, a Tanh function, or a ReLU function. The present embodiment does not specifically limit the specific type of the activation function.

It is understood that the second neural network layer has the same structure as the first neural network layer, i.e., the second neural network layer may also include a weight matrix and an activation function.

Specifically, when the self-attention module in the self-attention network has the first neural network layer and the shortcut in parallel, the output feature of the self-attention module, the output feature of the first neural network layer, and the output feature of the shortcut need to be added to obtain an added output feature. The summed output features may be used as inputs to a subsequent multi-tier perceptron.

Illustratively, the process of adding the output characteristic of the self-attention module, the output characteristic of the first neural network layer, and the output characteristic of the shortcut to obtain an added output characteristic may be represented by formula 2.

Among them, AugMSA (Z)_l) Representing the summed output characteristics; z_lRepresenting an input feature; MSA (Z)_l) Representing features based on processing by the attention module;

the ith neural network layer parallel to the self-attention module is shown, the number of the neural network layers parallel to the self-attention module is T, and the first neural network layer is one of the neural network layers parallel to the self-attention module.

It will be appreciated that after the parallel first neural network layers are introduced, the processing of the features by the parallel first neural network layers may involve matrix multiplication operations. If the matrix multiplication operation is directly performed on the input features and the weight matrix, huge calculation overhead is brought. Therefore, in the embodiment, it is proposed to implement the computation of the first neural network layer by using the block circulant matrix, so as to reduce the computation overhead brought by introducing the parallel first neural network layers.

Specifically, in the first neural network layer in parallel with the self-attention module and/or the multi-layer perceptron, the weight matrix of the first neural network layer includes a plurality of sub-matrices, and each sub-matrix is a circulant matrix. In short, the weight matrix may be composed of a plurality of sub-matrices, and each sub-matrix constituting the weight matrix is a cyclic matrix.

Where a circulant matrix is a special form of matrix. Each element of the row vector of the circulant matrix is the result of shifting the elements of the previous row vector one position to the right in turn. The circulant matrix may enable efficient computation through fourier transforms, saving the computational overhead of performing feature transforms on features through parallel neural network layers.

Since each sub-matrix is a circulant matrix, in practical application, each sub-matrix can be represented by a set of vectors. For any group of vectors, each row in the cyclic matrix is obtained by sequentially shifting the elements in the group of vectors to the right, and the cyclic matrix corresponding to the group of vectors can be obtained. In this way, in the process of training the self-attention network, when the weight parameters of the weight matrix in the self-attention network are adjusted based on the loss function, the vectors corresponding to the sub-matrices in the weight matrix are actually adjusted. After the adjustment of the vector corresponding to the sub-matrix, an adjusted weight matrix may be obtained based on the adjusted vector.

Illustratively, the weight matrix may be represented by the following formula 3.

Wherein Θ represents a weight matrix; the weight matrix Θ is divided into b × b sub-matrices, and each sub-matrix is a circulant matrix.

For any one sub-matrix C in the weight matrix^ijSub-matrix C^ijCan be obtained by cycling through the elements in a vector. In particular, the submatrix C^ijCan be expressed based on the following formula 4.

As can be seen from equation 4, the submatrix C^ijMay be by a circular vector C₁ ^ij，C₂ ^ij，…，C_d-1 ^ij，C_d ^ij]The element in (1) is obtained.

For example, in the case that the weight matrix of the first neural network layer is divided into a plurality of sub-matrices, and each sub-matrix is a circulant matrix, the step of multiplying the input features of the first neural network layer by the weight matrix in the first neural network layer may specifically include the following steps.

Firstly, the electronic device performs fourier transform on the plurality of sub-matrices of the first neural network layer and the input features of the first neural network layer, respectively, to obtain a plurality of transformed sub-matrices and transformed features.

Then, the electronic device performs element-by-element multiplication operation on the transformed features and the transformed sub-matrices respectively to obtain a plurality of sub-features. The element-by-element multiplication operation refers to multiplying elements at the same position in two matrixes one by one.

Finally, the electronic device performs inverse fourier transform on the plurality of sub-features and accumulates the plurality of sub-features after the inverse fourier transform is performed, so as to obtain the processed first feature.

According to the scheme, on the basis that the weight matrix can be divided into a plurality of cyclic matrices, efficient calculation between the input features and the weight features is achieved through Fourier transformation, and therefore calculation cost for performing feature transformation on the features through parallel network layers is saved.

Further, to facilitate parallel processing of input features of the first neural network layer on the same hardware, the electronic device may further divide the input features into a plurality of feature matrices, and perform the above-described processing on each feature matrix based on the plurality of sub-matrices. And finally, splicing the obtained multiple output features to obtain the features processed based on the weight matrix in the first neural network layer.

For example, the process of processing the input features based on the first neural network layer weight matrix may be as shown in equation 5 and equation 6.

Wherein the content of the first and second substances,

representing the features after processing based on a weight matrix in a first neural network layer;

representing a part of the features after being processed based on the weight matrix in the first neural network layer; z represents input features, Z is a matrix with the size of N x d, N represents the number of image blocks, and d represents the dimension of each image block; z^jJ-th feature matrix representing the input matrix Z, wherein the size of the j-th feature matrix is N (d/b), and b is the number of feature matrices divided by the input features; c. C^ijRepresents a vector of length d/b; FFT represents fast fourier transform; IFFT represents inverse fast fourier transform; and o represents element-by-element multiplication.

The above describes the process of processing the features by the first neural network layer provided in the embodiment of the present application, and the following will describe in detail the process of processing the features by the self-attention module provided in the embodiment of the present application.

In one possible embodiment, the self-attention module includes a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is connected with the plurality of parallel feature extraction layers respectively, and the plurality of parallel feature extraction layers respectively include different weight matrices. Specifically, for input features of a self-attention module, a plurality of parallel feature extraction layers respectively process the input features to obtain a plurality of extracted features; then, the plurality of extracted features are subjected to fusion processing by a fusion layer in the self-attention module, and output features of the self-attention module are obtained.

Illustratively, three parallel feature extraction layers may be included in the self-attention module, each of the three parallel feature extraction layers including a weight matrix W^QWeight matrix W^KAnd a weight matrix W^V. For the input features of the self-attention module, which include a plurality of sub-features, each sub-feature in the input features may be first associated with the weight matrix W^QWeight matrix W^KAnd a weight matrix W^VMultiplying to obtain the characteristic Q, the characteristic K and the characteristic V. Wherein, the characteristics Q, K and V are input from the fusion layer of the attention module, and the processing is continued by the fusion layer.

Specifically, the fusion layer performs a dot product operation on the feature Q and the feature K to obtain a feature value score. Then, the fusion layer processes the characteristic value score through a softmax activation function to obtain the characteristic value softmax. And the fusion layer performs point multiplication on the characteristic value softmax and the execution point to obtain a score value v corresponding to each sub-characteristic. And finally, adding the scoring values v corresponding to each sub-feature to obtain the output features of the self-attention module.

Optionally, the self-attention module may be a multi-head self-attention module, where the multi-head self-attention module includes a plurality of parallel self-attention units and a second full connection layer, the second full connection layer is respectively connected to the plurality of parallel self-attention units, and each of the plurality of parallel self-attention units includes a plurality of parallel feature extraction layers and a fusion layer.

For example, assuming that the multi-head self-attention module includes 8 parallel self-attention units, for the input features of the multi-head self-attention module, the input features may be input into the 8 parallel self-attention units, respectively, to obtain a feature matrix Z output by the 8 parallel self-attention units respectively_iI ∈ {1, 2,..., 8 }. Then, 8 feature matrices Z_iAnd splicing the characteristic matrixes into a total characteristic matrix according to columns, and processing the total characteristic matrix through a second full-connection layer to obtain the output characteristics of the multi-head self-attention module.

In order to verify the advantages of the attention model provided by the embodiments of the present application in data processing, the following will explain the beneficial effects of the attention model provided by the embodiments of the present application based on specific experiments.

In the embodiment of the present application, based on the existing network model and the attention model provided in the embodiment of the present application, extensive experiments are performed on a large-scale data set, and the proposed method is subjected to empirical study. Therein, the large-scale dataset may be an Imagenet (ILSVRC-2012) dataset containing 128 million training images and validation images from 1000 categories.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating comparison of performance of different models on the Imagenet data set according to the embodiment of the present application. In fig. 8, Aug-ViT, Aug-PVT, and Aug-T2T represent image processing models using the attention model provided by the embodiments of the present application. Resolution represents the image Resolution; top-1 Accuracy represents Accuracy; params denotes the number of parameters; FLOPs denote the calculated amount (multiplied by the addend).

As can be seen from fig. 8, compared with the image processing model in the prior art, the image processing model using the attention model provided in the embodiment of the present application can achieve higher accuracy without substantially changing the amount of calculation and the amount of parameters.

The embodiment of the present application further provides a feature extraction method, including: the electronic equipment acquires data to be processed; the electronic equipment inputs the data to be processed into one or more self-attention networks connected in series to obtain the characteristics of the data to be processed; the self-attention network comprises a self-attention module, a multi-layer perceptron and a first neural network layer, wherein the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is respectively connected with the plurality of parallel feature extraction layers, the multi-layer perceptron is connected with the self-attention module in series, the multi-layer perceptron comprises a plurality of series first full-connection layers, the first neural network layer is connected with the self-attention module and one or more of the multi-layer perceptron in parallel, and the first neural network layer is used for executing feature transformation.

An embodiment of the present application further provides an image processing method, including: acquiring an image to be processed; inputting the image to be processed into an image processing model to extract image features through an attention model in the image processing model, wherein the attention model is the attention model described in the foregoing embodiment; and processing the image to be processed according to the image characteristics.

An embodiment of the present application further provides a natural language processing method, including: acquiring a text to be processed; inputting the text to be processed into a natural language processing model to extract text features through an attention model in the natural language processing model, wherein the attention model is the attention model described in the foregoing embodiment; and processing the text to be processed according to the text characteristics.

In one possible embodiment, the present application provides a feature extraction apparatus. Referring to fig. 9, fig. 9 is a schematic structural diagram of a feature extraction device according to an embodiment of the present application. The feature extraction device 900 includes: an acquisition unit 901 and a processing unit 902; the acquiring unit 901 is configured to acquire data to be processed; the processing unit 902 is configured to input the data to be processed into one or more serially connected self-attention networks to obtain characteristics of the data to be processed; the self-attention network comprises a self-attention module, a multi-layer perceptron and a first neural network layer, wherein the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is respectively connected with the plurality of parallel feature extraction layers, the multi-layer perceptron is connected with the self-attention module in series, the multi-layer perceptron comprises a plurality of series first full-connection layers, the first neural network layer is connected with the self-attention module and one or more of the multi-layer perceptron in parallel, and the first neural network layer is used for executing feature transformation.

the first neural network layer is connected in parallel with the self-attention module and the second neural network layer is connected in parallel with the multi-layered perceptron, or,

In one possible embodiment, the present application further provides an image processing apparatus, including: an acquisition unit and a processing unit; the acquisition unit is used for acquiring an image to be processed; the processing unit is configured to input the image to be processed into an image processing model, so as to extract image features through an attention model in the image processing model, where the attention model is the attention model described in the first aspect or any implementation manner of the first aspect; the processing unit is further configured to process the image to be processed according to the image features.

In a possible embodiment, an embodiment of the present application further provides a natural language processing apparatus, including: an acquisition unit and a processing unit; the acquisition unit is used for acquiring a text to be processed; the processing unit is configured to input the text to be processed into a natural language processing model, so as to extract text features through an attention model in the natural language processing model, where the attention model is the attention model described in the first aspect or any implementation manner of the first aspect; the processing unit is further configured to process the text to be processed according to the text features.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 1000 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. The execution device 1000 may be disposed with the data processing apparatus described in the embodiment corresponding to fig. 10, and is configured to implement the function of data processing in the embodiment corresponding to fig. 10. Specifically, the execution apparatus 1000 includes: a receiver 1001, a transmitter 1002, a processor 1003 and a memory 1004 (wherein the number of processors 1003 in the execution device 1000 may be one or more, and one processor is taken as an example in fig. 10), wherein the processor 1003 may include an application processor 10031 and a communication processor 10032. In some embodiments of the present application, the receiver 1001, the transmitter 1002, the processor 1003, and the memory 1004 may be connected by a bus or other means.

The memory 1004 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1003. A portion of memory 1004 may also include non-volatile random access memory (NVRAM). The memory 1004 stores the processor and the operating instructions, executable modules or data structures, or a subset or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 1003 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiment of the present application may be applied to the processor 1003 or implemented by the processor 1003. The processor 1003 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1003. The processor 1003 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 1003 may implement or execute the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004, and completes the steps of the method in combination with the hardware thereof.

The receiver 1001 may be used to receive input numeric or character information and generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1002 may be configured to output numeric or character information via a first interface; the transmitter 1002 may also be configured to send instructions to the disk group via the first interface to modify data in the disk group; the transmitter 1002 may also include a display device such as a display screen.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the electronic device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the image processing method described in the above embodiment, or to cause the chip in the training device to execute the image processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 11, fig. 11 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 1100, and the NPU 1100 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1103, and the controller 1104 controls the arithmetic circuit 1103 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1103 includes a plurality of processing units (PEs) inside. In some implementations, the arithmetic circuitry 1103 is a two-dimensional systolic array. The arithmetic circuit 1103 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 1103 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1102 and buffers the data in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1101 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 1108.

The unified memory 1106 is used to store input data as well as output data. The weight data directly passes through a Memory cell Access Controller (DMAC) 1105, and the DMAC is transferred to the weight Memory 1102. The input data is also carried into the unified memory 1106 through the DMAC.

The BIU is a Bus Interface Unit 1111 for interaction of the AXI Bus with the DMAC and an Instruction Fetch Buffer (IFB) 1109.

A Bus Interface Unit 1111(Bus Interface Unit, BIU for short) is configured to fetch instructions from the instruction fetch memory 1109, and to fetch the original data of the input matrix a or the weight matrix B from the external memory by the memory Unit access controller 1105.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1106 or to transfer weight data into the weight memory 1102 or to transfer input data into the input memory 1101.

The vector calculation unit 1107 includes a plurality of operation processing units, and further processes the output of the operation circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1107 can store the processed output vector to the unified memory 1106. For example, the vector calculation unit 1107 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 1103, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 1107 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 1103, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 1109 connected to the controller 1104, configured to store instructions used by the controller 1104;

the unified memory 1106, the input memory 1101, the weight memory 1102 and the instruction fetch memory 1109 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. An attention model, comprising: one or more serially connected self-attention networks comprising a self-attention module, a multi-layer perceptron, and a first neural network layer;

the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, and the fusion layer is connected with the plurality of parallel feature extraction layers respectively;

the multilayer perceptron is connected with the self-attention module in series and comprises a plurality of first full-connection layers in series;

2. The model of claim 1, wherein the self-attention network further comprises: a second neural network layer to perform a feature transformation;

or the like, or, alternatively,

3. The model of claim 1 or 2, characterized in that the first neural network layer comprises a weight matrix for multiplying with input features of the first neural network layer and an activation function for processing the result of the multiplication of the input features with the weight matrix.

4. A model according to claim 3, characterized in that said weight matrix comprises a plurality of sub-matrices, each sub-matrix being a circulant matrix.

5. The model of any one of claims 1 to 4, wherein the self-attention module is a multi-headed self-attention module comprising a plurality of parallel self-attention cells and a second fully-connected layer, the second fully-connected layer being respectively connected to the plurality of parallel self-attention cells, the plurality of parallel self-attention cells each comprising a plurality of parallel feature extraction layers and fusion layers.

6. The model of any one of claims 1 to 5, wherein the self-attention module and/or the multi-layered perceptron are further provided with shortcuts in parallel, the input features of which are the same as the output features.

7. A method of feature extraction, comprising:

acquiring data to be processed;

inputting the data to be processed into one or more self-attention networks connected in series to obtain the characteristics of the data to be processed;

the self-attention network comprises a self-attention module, a multi-layer perceptron and a first neural network layer, wherein the self-attention module comprises a plurality of parallel feature extraction layers and a fusion layer, the fusion layer is respectively connected with the plurality of parallel feature extraction layers, the multi-layer perceptron is connected with the self-attention module in series, the multi-layer perceptron comprises a plurality of series first full-connection layers, the first neural network layer is connected with the self-attention module and one or more of the multi-layer perceptron in parallel, and the first neural network layer is used for executing feature transformation.

8. The method of claim 7, wherein the self-attention network further comprises: a second neural network layer to perform a feature transformation;

or the like, or, alternatively,

9. The method according to claim 7 or 8, wherein the first neural network layer comprises a weight matrix and an activation function, the weight matrix is used for multiplying the input features of the first neural network layer, and the activation function is used for processing the multiplication result of the input features and the weight matrix.

10. The method of claim 9, wherein the weight matrix comprises a plurality of sub-matrices, each sub-matrix being a circulant matrix.

11. The method according to any one of claims 7 to 10, wherein the self-attention module is a multi-headed self-attention module comprising a plurality of parallel self-attention units and a second fully-connected layer, the second fully-connected layer being respectively connected with the plurality of parallel self-attention units, the plurality of parallel self-attention units each comprising a plurality of parallel feature extraction layers and fusion layers.

12. The method according to any one of claims 7 to 11, wherein the self-attention module and/or the multi-layered perceptron are further provided with shortcuts in parallel, the input features of which are the same as the output features.

13. An image processing method, comprising:

acquiring an image to be processed;

inputting the image to be processed into an image processing model to extract image features through an attention model in the image processing model, wherein the attention model is the attention model described in claims 1 to 6;

and processing the image to be processed according to the image characteristics.

14. A natural language processing method, comprising:

acquiring a text to be processed;

inputting the text to be processed into a natural language processing model to extract text features through an attention model in the natural language processing model, wherein the attention model is the attention model described in claims 1 to 6;

and processing the text to be processed according to the text characteristics.

15. A feature extraction apparatus comprising: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring data to be processed;

the processing unit is used for inputting the data to be processed into one or more serially connected self-attention networks to obtain the characteristics of the data to be processed;

16. The apparatus of claim 15, wherein the self-attention network further comprises: a second neural network layer to perform a feature transformation;

or the like, or, alternatively,

17. The apparatus of claim 15 or 16, wherein the first neural network layer comprises a weight matrix and an activation function, the weight matrix is used for multiplying the input features of the first neural network layer, and the activation function is used for processing the multiplication result of the input features and the weight matrix.

18. The apparatus of claim 17, wherein the weight matrix comprises a plurality of sub-matrices, each sub-matrix being a circulant matrix.

19. The apparatus of any one of claims 15 to 18, wherein the self-attention module is a multi-headed self-attention module, the multi-headed self-attention module comprising a plurality of parallel self-attention units and a second fully-connected layer, the second fully-connected layer being respectively connected with the plurality of parallel self-attention units, the plurality of parallel self-attention units each comprising a plurality of parallel feature extraction layers and a fusion layer.

20. The apparatus according to any one of claims 15 to 19, wherein the self-attention module and/or the multi-layered perceptron are further provided with shortcuts in parallel, the input features of the shortcuts being the same as the output features.

21. An image processing apparatus characterized by comprising: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring an image to be processed;

the processing unit is used for inputting the image to be processed into an image processing model to extract image features through an attention model in the image processing model, wherein the attention model is the attention model described in claims 1 to 6;

the processing unit is further configured to process the image to be processed according to the image features.

22. A natural language processing apparatus, comprising: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring a text to be processed;

the processing unit is used for inputting the text to be processed into a natural language processing model so as to extract text features through an attention model in the natural language processing model, wherein the attention model is the attention model described in claims 1 to 6;

the processing unit is further configured to process the text to be processed according to the text features.

23. An electronic device comprising a memory and a processor; the memory stores code, the processor is configured to execute the code, and when executed, the electronic device performs the method of any of claims 7 to 14.

24. The electronic device of claim 23, wherein the electronic device comprises a smart car, a smart phone, a smart television, a virtual reality device, an augmented reality device, a wearable device, or a server.

25. A computer storage medium storing instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 7 to 14.

26. A computer program product having stored thereon instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 7 to 14.