WO2023226783A1

WO2023226783A1 - Data processing method and apparatus

Info

Publication number: WO2023226783A1
Application number: PCT/CN2023/093668
Authority: WO
Inventors: 蔡创坚; 胡芝兰
Original assignee: 华为技术有限公司
Priority date: 2022-05-24
Filing date: 2023-05-11
Publication date: 2023-11-30
Also published as: CN117172297A

Abstract

A data processing method and apparatus, which relate to the technical field of artificial intelligence, and are used for reducing the calculation consumption. In the method, when attention calculation is executed, the attention calculation is executed among elements of the same coordinate axis, rather than on one element with respect to all the other elements, and weighting processing is executed after the attention calculation is performed on all coordinate axes, respectively. For example, for a certain element, elements in the same row, or elements in the same row and in adjacent rows participate in the calculation of a feature vector of the element. By means of performing attention calculation twice, not only can global modeling be realized, the calculation complexity can also be reduced.

Description

A data processing method and device

Cross-references to related applications

This application claims priority to the Chinese patent application filed with the Intellectual Property Office of the People's Republic of China on May 24, 2022, with application number 202210569598.0 and the invention title "A data processing method and device", the entire content of which is incorporated by reference. in this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a data processing method and device.

Background technique

In recent years, self-attention networks have been well applied in many natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and question answering. With the widespread application of self-attention networks, self-attention networks originating from the field of natural language processing have also achieved high performance in tasks such as image classification, target detection, and image processing.

The key to self-attention networks is to learn an alignment where each element in the sequence learns to gather information from other elements in the sequence. The self-attention network is different from the general attention network in that it pays more attention to the internal correlation of data or features and reduces the dependence on external information. However, the currently used self-attention network obtains relevant information from all other elements for one element, resulting in high computational consumption.

Contents of the invention

Embodiments of the present application provide a data processing method and device to solve the problem of high computational consumption caused by calculating the relevant information of one element and all other elements when using the current self-attention network.

In a first aspect, embodiments of the present application provide a data processing method, including: receiving data to be processed, where the data to be processed includes encoded data of multiple elements; performing feature extraction on the data to be processed through a neural network to obtain Each of the plurality of elements corresponds to a feature vector of a plurality of coordinate axes, and each of the plurality of elements corresponds to a feature vector of the plurality of coordinate axes, which is weighted to obtain the Describe the feature vector of the data to be processed.

Illustratively, the neural network is a self-attention network. Through the above solution, when obtaining the feature vector, the attention calculation is no longer performed on one element and all other elements, but the attention calculation is performed on the elements of the same coordinate axis, and the calculation is performed on all coordinate axes separately, and then Performing weighting processing again can reduce computational consumption.

For example, this application can be applied to computer vision or natural language processing. Including machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, speech recognition, image recognition (Image Classification), object detection (Object Detection), semantic segmentation (Semantic Segmentation) and image generation (Image Generation). Wherein, the neural network can be a neural network used to classify images, or it can be a neural network used to segment images, or it can be a neural network used to detect images, or it can be used to perform image processing. The neural network for recognition can either be a neural network used to generate a specified image, or it can be a neural network used to translate text, or it can be a neural network used to paraphrase text, or It may be a neural network used to generate specified text, or it may be a neural network used to recognize speech, or it may be a neural network used to translate speech, or it may be a neural network used to generate specified speech, etc.

The data to be processed can be audio data, video data, image data, text data, etc.

In one possible design, receiving the data to be processed includes receiving a service request from the user equipment, and the service request carries the data to be processed. A service request is used to request completion of a specified processing task for the data to be processed.

The method further includes: completing the specified processing task according to the feature vector of the data to be processed to obtain a processing result, and sending the processing result to the user equipment.

For example, if the processing task is designated as image classification, the attention network is an attention network used to classify images. After obtaining the feature vector, the image can be further classified according to the feature vector to obtain the classification result. For another example, if the processing task is designated as image segmentation, the attention network is an attention network used to segment images. After obtaining the feature vector, the image can be further segmented based on the feature vector to obtain the segmentation result. For another example, if the processing task is designated as image detection, the attention network is an attention network used to detect images. After obtaining the feature vector, image detection can be further performed based on the feature vector to obtain the segmentation result. For another example, if the designated processing task is speech recognition, the attention network is an attention network used to recognize speech. After obtaining the feature vector, speech recognition can be further performed based on the feature vector to obtain the recognition result. For another example, if the processing task is designated as speech translation, the attention network is an attention network used to translate speech. After obtaining the feature vector, speech translation can be further performed based on the feature vector to obtain the translation result.

In one example, the feature vectors of elements corresponding to different coordinate axes are orthogonal to each other.

In a possible design, the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located; The first element is any element among the plurality of elements; the positions of other elements in the first area where the first element is located are mapped to coordinate axes other than the first coordinate axis and the The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.

For example, the first coordinate axis is the horizontal coordinate axis, and the feature vector of the element in the horizontal direction coordinates. For the element in the same row or the same row and adjacent rows, the elements participate in the calculation of the feature vector of the element. Since all elements can participate in calculations, global modeling can be achieved and computational complexity can be reduced.

In a possible design, obtaining the feature vectors on multiple coordinate axes that each element among the plurality of elements respectively corresponds to includes: assigning the first element to the first region corresponding to the first element. The attention values between the first element and the other elements are obtained by performing attention calculations between other elements; the first element is any element among the plurality of elements; according to the first element Weighting processing is performed on the attention values between elements respectively and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.

The adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent, or the interval between the positions mapped to other coordinate axes of the two elements can be within a set distance.

In a possible design, the data to be processed includes audio data, the audio data includes multiple audio points, and each audio point is mapped to a time coordinate axis and a frequency coordinate axis; or,

The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,

The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.

For multimedia data, the dimensions of elements on different axes are different, which may vary greatly. Using the above solution, we will not pay too much attention to the elements of axes with more dimensions, but perform attention calculations separately for different axes to prevent axes with larger dimensions from inhibiting smaller axes.

In one possible design, feature extraction is performed on the data to be processed through a neural network to obtain feature vectors corresponding to each of the multiple elements on multiple coordinate axes: the multiple The coordinate axes include a first coordinate axis and a second coordinate axis; a first query vector, a first key value vector and a first value vector are generated based on the data to be processed through the neural network; according to the first query vector, the According to the first key value vector and the first value vector, each element of the plurality of elements is corresponding to a feature vector on the first coordinate axis; according to the first query vector, the first The key value vector and the first value vector are used to obtain a feature vector corresponding to each element of the plurality of elements on the second coordinate axis.

When calculating feature vectors for different axes, using the same query vector, key value vector, and value vector can reduce the number of parameter calculations, thereby further reducing the computational complexity.

In a possible design, N=2, the following formula can be used to generate the first query vector, the first key value vector and the first value vector according to the data to be processed:
q _(i,j) ＝W ^Q h _(i,j) , k _(i,j) ＝W ^K h _(i,j) , v _(i,j) ＝W ^V h _(i,j)

Among them, q _{(i, j)} represents the Query of the element at position (i, j), k _{(i, j)} represents the Key of the element at position (i, j), and v _{(i, j)} represents the element at position (i, j). Value. The value range of i is 0~m-1, and the value range of j is 0~n-1.

The 2-dimensional data to be processed includes m rows and n columns.

In a possible design, the feature vector corresponding to each element on the first coordinate axis can be determined using the following formula.

Among them, d _k is the number of dimensions of the input data. Represents the feature vector of the element at position (i, j) corresponding to axis 1;

In a possible design, the feature vector corresponding to each element on the first coordinate axis can be determined using the following formula:

The eigenvector corresponding to the element of the first coordinate axis is weighted with the eigenvector corresponding to the element of the second coordinate axis, and is determined using the following formula:

Among them, h′ _{(i, j)} represents the feature vector of the element at position (i, j). w ₁ represents the weight of axis 1, and w ₂ represents the weight of axis 2.

In one possible design, w ₁ =w ₂ =1/2.

In a possible design, the method further includes: generating a second query vector, a second key value vector and a second value vector based on the encoded data of at least one setting element; according to the second query vector, the The second key value vector and the second value vector obtain the feature vector corresponding to the at least one setting element; the corresponding feature vector of the setting element is used to characterize the relationship between the setting element and the multiple elements. degree of correlation; perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.

The encoding data of at least one setting element may include classification bits and/or distillation bits, thereby also for classification scenarios Be applicable.

In a possible design, the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.

In a second aspect, embodiments of the present application provide a data processing device, including:

An input unit configured to receive data to be processed, where the data to be processed includes encoded data of multiple elements;

A processing unit configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors corresponding to multiple coordinate axes for each of the multiple elements, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.

In a possible design, the processing unit is specifically configured to calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the first element. The attention value between the element and the other elements respectively; the first element is any element among the plurality of elements; weighting processing is performed according to the attention value between the first element and the other elements respectively Obtain the feature vector corresponding to the first coordinate axis of the first element.

In a possible design, the number of coordinate axes is equal to N; the linear module is used to generate the first query vector, the first key value vector and the first value vector based on the data to be processed; the i-th attention A calculation module configured to obtain, according to the first query vector, the first key value vector, and the first value vector, each of the plurality of elements corresponding to the i-th coordinate axis. Feature vector; i is a positive integer less than or equal to N; a weighting module is used to weight the feature vectors on N coordinate axes corresponding to each element in the plurality of elements.

In a possible design, the neural network also includes an N+1 attention calculation module and a feature fusion module;

The linear module is configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one set element;

The N+1th attention calculation module is used to obtain the feature vector corresponding to the at least one setting element according to the second query vector, the second key value vector, and the second value vector; The corresponding feature vector of the setting element is used to represent the degree of association between the setting element and the multiple elements;

The feature fusion module is used to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.

In the third aspect, this application provides a data processing system, including user equipment and cloud service equipment;

The user equipment is used to send a service request to the cloud service device, the service request carries data to be processed, and the data to be processed includes encoded data of multiple elements; the service request is used to request the cloud server to perform the processing for the data to be processed. Process data to complete designated processing tasks;

The cloud service device is used to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to perform feature extraction on the multiple elements. Each element in is corresponding to the feature vectors of the multiple coordinate axes and is weighted to obtain the feature vector of the data to be processed; completing the specified processing task according to the feature vector of the data to be processed to obtain the processing result, Send the processing result to the user equipment;

The user equipment is also configured to receive the processing result from the cloud service equipment.

In a fourth aspect, embodiments of the present application provide an electronic device. The electronic device includes at least one processor and a memory; instructions are stored in the memory; and the at least one processor is used to execute all instructions stored in the memory. The above instructions are provided to implement the method described in the first aspect or any design of the first aspect. The electronic device may also be called an execution device and is used to execute the data processing method provided by this application.

In a fifth aspect, embodiments of the present application provide a chip system. The chip system includes at least one processor and a communication interface. The communication interface and the at least one processor are interconnected through lines; the communication interface is used to receive Data to be processed; the processor, configured to execute the method of the first aspect or any design of the first aspect on the data to be processed.

In a sixth aspect, embodiments of the present application provide a computer-readable medium for storing a computer program, where the computer program includes instructions for executing the method in the first aspect or any optional implementation of the first aspect.

In a seventh aspect, embodiments of the present application provide a computer program product. The computer program product stores instructions. When executed by a computer, the instructions cause the computer to implement the first aspect or any optional aspect of the first aspect. The method described in the design.

On the basis of the implementations provided by the above aspects, this application can also be further combined to provide more implementations.

Description of the drawings

Figure 1 is a structural schematic diagram of the main framework of artificial intelligence;

Figure 2 is a schematic diagram of a system architecture 200 provided by an embodiment of the present application;

Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application;

Figure 4 is a schematic diagram of an axial attention calculation provided by an embodiment of the present application;

Figure 5 is a schematic diagram of another axial attention calculation provided by an embodiment of the present application;

Figure 6 is a schematic diagram of the processing flow of an independent superimposed attention network provided by an embodiment of the present application;

Figure 7 is a schematic diagram of the processing flow of another independent superimposed attention network provided by the embodiment of the present application;

Figure 8 is a schematic structural diagram of a transformer module illustrated in the embodiment of this application;

Figure 9 is a schematic structural diagram of another transformer module illustrated in the embodiment of the present application;

Figure 10 is a schematic structural diagram of a classification network model provided by an embodiment of the present application;

Figure 11 is a schematic diagram of the workflow of the classification network model provided by the embodiment of the present application;

Figure 12 is a schematic workflow diagram of an image segmentation network model provided by an embodiment of the present application;

Figure 13 is a schematic workflow diagram of a video classification network model provided by an embodiment of the present application;

Figure 14 is a schematic structural diagram of a data processing device provided by an embodiment of the present application;

Figure 15 is a schematic structural diagram of an execution device provided by an embodiment of the present application;

Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.

Detailed ways

The embodiments of the present application are described below with reference to the accompanying drawings. Persons of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a structural schematic diagram of the artificial intelligence main framework. The following is from the "intelligent information chain" (horizontal axis) and "IT value chain" ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.

(1)Infrastructure:

Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms. Communicate with the outside through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.); the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc. For example, sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.

(2)Data

Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3)Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.

Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.

Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

(4) General ability

After the data is processed as mentioned above, some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.

(5) Intelligent products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, smart city, smart terminal, etc.

The embodiments of the present application relate to the application of neural networks. To facilitate understanding, the relevant terms involved in the embodiments of the present application and related concepts such as neural networks are first introduced below.

1) Neural network

The work of each layer in a neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). ), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend". Among them, the operations of 1, 2 and 3 are performed by Completed, the operation of 4 is completed by +b, and the operation of 5 is implemented by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this type of thing. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.

The purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.

2) Loss function

In the process of training the neural network, because we hope that the output of the neural network is as close as possible to the value we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two Update the weight vector of each layer of the neural network according to the situation (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network). For example, if the predicted value of the network is high, Just adjust the weight vector to make it predict lower, and continue to adjust until the neural network can predict the target value you really want. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the neural network becomes a process of reducing this loss as much as possible.

3) Back propagation algorithm

The neural network uses the back propagation algorithm to correct the size of the network parameters in the neural network during the training process, making the neural network model reconstruction error loss smaller and smaller. Forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.

4) Linear operation:

Linear refers to the proportional and straight-line relationship between quantities. Mathematically, it can be understood as a function whose first-order derivative is a constant. Linear operations can be, but are not limited to, addition operations, empty operations, identity operations, and convolutions. operations, layer normalization (LN) operations and pooling operations. Linear operations can also be called linear mapping. Linear mapping needs to meet two conditions: homogeneity and additivity. If any one of the conditions is not met, it is nonlinear.

Among them, homogeneity means f(ax)=af(x); additivity means f(x+y)=f(x)+f(y); for example, f(x)=ax is the line sexual. It should be noted that x, a, and f(x) here are not necessarily scalars, but can be vectors or matrices, forming a linear space of any dimension. If x and f(x) are n-dimensional vectors, when a is a constant, they are equivalent to homogeneity, and when a is a matrix, they are equivalent to additivity. Relatively speaking, a function graph that is a straight line may not necessarily conform to a linear mapping, such as f(x)=ax+b, which neither satisfies homogeneity nor additivity, so it is a nonlinear mapping.

In the embodiment of the present application, the combination of multiple linear operations can be called a linear operation, and each linear operation included in the linear operation can also be called a sub-linear operation.

5) Attention model.

The attention model is a neural network that applies the attention mechanism. In deep learning, the attention mechanism can be broadly defined as a weight vector describing the importance: using this weight vector to predict or infer an element. For example, for a certain pixel in an image or a word in a sentence, attention vectors can be used to quantitatively estimate the correlation between the target element and other elements, and the weighted sum of the attention vectors can be used as an approximation of the target.

The attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when humans look at a painting, although their eyes can see the entire painting, when humans observe deeply and carefully, their eyes actually focus on only part of the pattern in the entire painting. This At this time, the human brain mainly focuses on this small pattern. In other words, when humans carefully observe an image, the human brain's attention to the entire image is not balanced, but has a certain weight distinction. This is the core idea of the attention mechanism.

Simply put, the human visual processing system tends to selectively focus on certain parts of the image and ignore other irrelevant information, thus contributing to the human brain's perception. Similarly, in the attention mechanism of deep learning, in some problems involving language, speech, or vision, some parts of the input may be more relevant than other parts. Therefore, through the attention mechanism in the attention model, the attention model can be enabled to dynamically focus only on the part of the input that is helpful to effectively perform the task at hand.

6) Self-attention network.

The self-attention network is a neural network that applies the self-attention mechanism. The self-attention mechanism is an extension of the attention mechanism. The self-attention mechanism is actually an attention mechanism that associates different positions of a single sequence to calculate a representation of the same sequence. Self-attention mechanism can play a key role in machine reading, abstract summary or image description generation. Taking the application of self-attention network to natural language processing as an example, the self-attention network processes input data of any length and generates new feature expressions of the input data, and then converts the feature expressions into target words. The self-attention network layer in the self-attention network uses the attention mechanism to obtain the relationships between all other words, thereby generating a new feature expression for each word. The advantage of the self-attention network is that the attention mechanism can directly capture the relationship between all words in the sentence without considering the word position.

The data processing method provided by the embodiment of the present application can be executed by an execution device, or the attention model can be deployed in the execution device. An execution device may be implemented by one or more computing devices. For example, see Figure 2, which shows a system architecture 200 provided by an embodiment of the present application. Included in the system architecture 200 is an execution device 210 . Execution device 210 may be implemented by one or more computing devices. Execution device 210 may be arranged on one physical site, or distributed across multiple physical sites. System architecture 200 also includes data storage system 250 . Optionally, the execution device 210 cooperates with other computing devices, such as data storage, routers, load balancers and other devices. The execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the data processing method provided by this application. One or more computing devices can be deployed in a cloud network. In one example, the data processing method provided by the embodiment of the present application is deployed in one or more computing devices of the cloud network in the form of a service, and the user device accesses the cloud service through the network. When the execution device is one or more computing devices of the cloud network, the execution device may also be called a cloud end service equipment.

In another example, the data processing method provided by the embodiment of the present application can be deployed on one or more local computing devices in the form of a software tool.

The user may operate respective user devices (eg, local device 301 and local device 302) to interact with execution device 210. Each local device can represent any computing device, such as a smartphone (mobile phone), personal computer (PC), laptop, tablet, smart TV, mobile internet device (MID), wearable device , smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, virtual reality (VR) devices, augmented reality (AR) devices, industrial control (industrial control) ), wireless electronic devices in self-driving, wireless electronic devices in remote medical surgery, wireless electronic devices in smart grid, transportation safety ), wireless electronic devices in smart cities, and wireless electronic devices in smart homes.

Each user's local device can interact with the execution device 210 through a communication network of any communication mechanism/communication standard. The communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In another implementation, one or more aspects of the execution device 210 may be implemented by each local device. For example, the local device 301 may provide local data or feedback prediction results to the execution device 210 .

It should be noted that all functions of the execution device 210 can also be implemented by local devices. For example, the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for users of the local device 302 . The local device 301 may be an electronic device. For example, the electronic device may be a server, a smartphone (mobile phone), a personal computer (PC), a laptop, a tablet, a smart TV, a mobile Internet device (mobile internet device (MID), wearable devices, virtual reality (VR) devices, augmented reality (AR) devices, wireless electronic devices in industrial control (industrial control), self-driving (self-driving) wireless electronic devices, wireless electronic devices in remote medical surgery, wireless electronic devices in smart grid, wireless electronic devices in transportation safety, and wireless electronic devices in smart city Wireless electronic devices, wireless electronic devices in smart homes, etc.

The data processing methods and attention models provided by the embodiments of this application can be applied to computer vision or natural language processing. That is, the electronic device or computing device can perform computer vision tasks or natural language processing tasks through the above-mentioned data processing method.

Among them, natural language processing is an important direction in the field of computer science and artificial intelligence. Natural language processing research can realize various theories and methods for effective communication between humans and computers using natural language. Generally, natural language processing tasks mainly include tasks such as machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, and speech recognition.

Computer vision is the science that studies how to make machines learn to see. Furthermore, computer vision refers to using cameras and computers instead of human eyes to identify, track, and measure targets, and further performs graphics processing to make the processed images more suitable for human eyes to observe or transmit to instruments for detection. Image. Generally speaking, computer vision tasks include tasks such as image recognition (Image Classification), object detection (Object Detection), semantic segmentation (Semantic Segmentation), and image generation (Image Generation).

Image recognition is a common classification problem, also commonly known as image classification. Specifically, in the image recognition task, the input of the neural network is image data, and the output value is the probability that the current image data belongs to each category. Usually, the category with the largest probability value is selected as the predicted category of image data. Image recognition is one of the earliest tasks to successfully apply deep learning. Typical network models include VGG series, Inception series, ResNet series, etc.

Target detection refers to automatically detecting the approximate location of common objects in images through algorithms. Bounding boxes are usually used to represent the approximate locations of objects, and the category information of objects in the bounding boxes is classified.

Semantic segmentation refers to automatically segmenting and identifying the content in images through algorithms. Semantic segmentation can be understood as the classification problem of each pixel, that is, analyzing the category of the object that each pixel belongs to.

Image generation refers to obtaining high-fidelity generated images by learning the distribution of real images and sampling from the learned distribution. For example, a clear image is generated based on a blurred image; a dehazed image is generated based on a hazy image.

As mentioned in the background art, if the self-attention network is used to obtain relevant information for one element from all other elements, it will lead to high computational consumption. One possible way is to use criss cross attention. Considering the correlation between elements in the cross-shaped area can reduce the complexity compared to calculating all pixels. However, the data dimensions in the row direction and column direction included in the mapping of elements to different cross directions are generally different, and in some cases the difference is very large. For example, the dimension difference of audio data, video data, etc. may be more than 10 times. The use of cross attention will lead to excessive attention to data with large dimensions, which will lead to the calculation of larger dimensions being suppressed by the calculations of smaller dimensions.

Based on this, the neural network and data processing method provided by the embodiments of this application adopt the method of independently calculating the correlation of elements on each coordinate axis, that is, focusing on each element along the tensor direction of each coordinate axis. calculate. Then in the weighted superposition, it can prevent too much attention to the axis with high dimensions, causing the axis with high dimensions to have an inhibitory effect on the axis with low dimensions. Therefore, using the neural network and data processing method provided by the embodiment of the present application can improve the calculation efficiency. At the same time, the processing accuracy is improved. In some embodiments, the neural network may use a convolutional neural network to implement correlation calculation between elements. In other embodiments, the neural network provided by the embodiments of the present application adopts a self-attention mechanism to implement correlation calculation between elements. In this case, the neural network may also be called an attention network.

In this embodiment, the input of the attention network is data in the form of a sequence, that is, the input data of the attention network is sequence data. For example, the input data of the attention network can be a sequence of sentences composed of multiple consecutive words; for another example, the input data of the attention network can be a sequence of image blocks composed of multiple consecutive image blocks. Continuous image blocks are obtained by segmenting a complete image. Sequence data can understand encoded data, such as encoding multiple consecutive words. For example, for some data that needs to be processed, the encoded data of each element is obtained by performing embedding, such as convolution processing. Elements can also be called patches. Each element in the input data can correspond to multiple coordinate axes. The coordinate axis mentioned here can be in terms of time, space or other dimensions. An element can have parameter values mapped to multiple axes. Input data can also be called data to be processed. The data to be processed can be multimedia data, such as audio data, video data or image data. For example, the data to be processed is audio data, and each element in the audio data can be understood as an audio point. Each audio point can be mapped to the time coordinate axis or the frequency coordinate axis. For example, it may include time parameters mapped to the time coordinate axis and frequency parameters mapped to the frequency coordinate axis. For another example, the data to be processed is image data, and the elements of the image data can be understood as pixels or image blocks. Each pixel or image block can be mapped to a horizontal coordinate axis and a vertical coordinate axis. For another example, the data to be processed includes video data, and the video data can be mapped to three coordinate axes, such as the time coordinate axis, the horizontal coordinate axis, and the vertical coordinate axis. Video data includes multiple video frames, and each video frame includes multiple pixels or image blocks. The encoded data of each pixel or image block can be mapped to the time coordinate axis, with the time parameter of the time coordinate axis. The encoded data of each pixel or image block can be mapped to the horizontal coordinate axis and the vertical coordinate axis in space, with the horizontal coordinate of the horizontal coordinate axis and the vertical coordinate of the vertical coordinate axis.

Referring to Figure 3, a schematic flow chart of a data processing method provided by an embodiment of the present application is shown, using a neural network. Take the attention network as an example.

301. Obtain data to be processed, where the data to be processed includes encoded data of multiple elements.

In some embodiments, the data processing method may be executed by a service device, such as a cloud service device. The user device can send a service request to the cloud service device, and the service request carries data to be processed. The service request is used to request the cloud server to complete a specified processing task for the data to be processed. The designated processing tasks can be natural language processing tasks, such as machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, or speech recognition. The specified processing task can be a computer vision task, such as image recognition, target detection, semantic segmentation, and image generation.

In other embodiments, the data processing method may be executed by a local device, such as a local electronic device. The data to be processed can be generated by the electronic device itself.

302. Extract features from the data to be processed through the attention network to obtain feature vectors in which each of the multiple elements corresponds to multiple coordinate axes, and each of the multiple elements corresponds to the multiple coordinates. The eigenvectors of the axes are weighted to obtain the eigenvectors of the data to be processed.

Among them, the attention network can be an attention network used to classify images, or it can be an attention network used to segment images, or it can be an attention network used to detect images, or it can be used The attention network is used to recognize images, or it can be the attention network used to generate specified images, or it can be the attention network used to translate text, or it can be the attention network used to paraphrase text. force network, or it can be an attention network used to generate specified text, or it can be an attention network used to recognize speech, or it can be an attention network used to translate speech, or it can be an attention network used to generate Specifying attention networks for speech, etc.

In a possible implementation, after obtaining the feature vector of the data to be processed, the specified processing task can be further completed according to the feature vector to obtain the processing result, and the processing result can be sent to the user device.

As an example, specifying that the processing task is image classification, the attention network is an attention network used to classify images. After obtaining the feature vector, the image can be further classified according to the feature vector to obtain the classification result. For example, if the processing task is designated as image segmentation, the attention network is an attention network used to segment images. After obtaining the feature vector, the image can be further segmented based on the feature vector to obtain the segmentation result. For another example, if the processing task is designated as image detection, the attention network is an attention network used to detect images. After obtaining the feature vector, image detection can be further performed based on the feature vector to obtain the segmentation result. For another example, if the designated processing task is speech recognition, the attention network is an attention network used to recognize speech. After obtaining the feature vector, speech recognition can be further performed based on the feature vector to obtain the recognition result. For another example, if the processing task is designated as speech translation, the attention network is an attention network used to translate speech. After obtaining the feature vector, speech translation can be further performed based on the feature vector to obtain the translation result.

Take, for example, multiple coordinate axes including N, which are axis 1 to axis N respectively. For example, take the elements included in the input data mapped to N coordinate axes, which are axis 1 to axis N respectively. The attention is calculated separately for the elements on each axis, and then the weighted sum of the calculation results for the elements on each axis is used as the output of the independent superimposed attention network. For example, the weights of different axes can be similar, such as a simple average. After inputting the data, the attention network performs feature extraction on the input data to obtain the feature vectors of each of the multiple elements corresponding to axis 1 to axis N, and then each element corresponds to the feature vector of axis 1 to axis N. N groups of feature vectors are weighted to obtain the feature vector corresponding to each element.

Illustratively, taking the first element as an example, the first element is any element among multiple elements. The feature vector corresponding to the first element on the first coordinate axis is used to characterize the first element and other elements in the first region where the first element is located. The correlation between its elements.

The feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located; the first element is the Any element among a plurality of elements; the positions of other elements in the first area where the first element is located are mapped to other coordinate axes other than the first coordinate axis, and the positions of the first element mapped to the other coordinate axes are The axes are in the same and/or adjacent positions.

The eigenvector corresponding to the first coordinate axis of the first element can be determined in the following way:

Calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the attention value between the first element and the other elements respectively; the first The element is any element among the plurality of elements; then weighting processing is performed based on the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element. .

In one case, the adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent. Taking N=2 as an example, axis 1 is the horizontal coordinate axis (can be referred to as the horizontal coordinate axis for short), and axis 2 is the vertical axis. Direction coordinate axis (can be referred to as vertical coordinate axis). When calculating the feature vector corresponding to an element on the horizontal axis, attention calculation can be performed on the same element. When calculating the feature vector corresponding to an element on the vertical coordinate axis, attention calculation can be performed on the elements in the same column.

As shown in Figure 4, each row in the horizontal direction includes 10 elements, and each column in the vertical direction includes 5 elements. As an example: take elements 3-6. When calculating the feature vector of element 3-6 corresponding to the horizontal coordinate axis, you can calculate the distance between element 3-6 and other elements in the same row (elements 3-1~3-5, 3-7~3-10). Attention calculation results, and then perform weighting processing based on the attention calculation results to obtain the feature vectors of elements 3-6 corresponding to the horizontal coordinate axis. When calculating the eigenvector corresponding to the vertical coordinate axis of element 3-6, the distance between element 3-6 and other elements in the same column (elements 1-6~2-6, 4-6~5-6) can be calculated respectively. Attention calculation results, and then perform weighting processing based on the attention calculation results to obtain the feature vectors of elements 3-6 corresponding to the vertical coordinate axis.

In the other case, the separation between the positions of two elements mapped to other coordinate axes is within a set distance. Taking N=2 as an example, axis 1 is a horizontal coordinate axis (which may be referred to as a horizontal coordinate axis), and axis 2 is a vertical coordinate axis (which may be referred to as a vertical coordinate axis). When calculating the feature vector corresponding to an element on the horizontal axis, the attention calculation can be performed on the elements of the same row and one or more rows adjacent to the same row. When calculating the feature vector corresponding to an element on the vertical coordinate axis, column attention calculations can be performed on elements in the same column and one or more columns adjacent to the same column. Referring to Figure 5 as an example, when calculating the feature vector of element 3-6 corresponding to the horizontal coordinate axis, you can calculate element 3-6 and other elements in the same row and adjacent rows (elements 3-1 ~ 3-5, 3 -7～3-10, 2-1～2-10, 4-1～4-10), and then perform weighting processing based on the attention calculation results to obtain elements 3-6 corresponding to the horizontal coordinate axis eigenvector. When calculating the eigenvector corresponding to the vertical coordinate axis of element 3-6, you can separately calculate element 3-6 and other elements in the same column and adjacent rows (elements 1-6~2-6, 4-6~5-6 , 1-5~5-5, 1-7~5-7), and then perform weighting processing based on the attention calculation results to obtain the feature vector of the vertical coordinate axis corresponding to elements 3-6.

It should be noted that the attention network provided by the embodiments of the present application can also be called an independent superimposed attention network or a self-independent superimposed attention network, and other names can also be used, which are not specifically limited in the embodiments of the present application. The following description takes what is called an independent stacked attention network as an example.

Specifically, as shown in Figure 6, the independent superimposed attention network determines the query vector (Query, Q), key value vector (Key, K) and value vector (Value, V) based on the input data, and then based on Q, K , V carries axis 1 square Calculate the attention in the direction of ~ axis N to obtain the feature vector of each element corresponding to axis 1 ~ axis N, and then perform a weighted sum of the feature vectors of each element corresponding to axis 1 ~ axis N to obtain the feature vector of each element.

It should be noted that the independent superimposed attention network provided by the embodiments of the present application can adopt a single-head attention mechanism or a multi-head attention mechanism, which is not specifically limited in the embodiments of the present application. In the case of using a multi-head attention mechanism, after receiving the input data, the independent stacked attention network groups the dimensions of the input data according to the number of heads. In each group, attention is calculated using the method provided by the embodiment of the present application, and then the results of multiple groups are spliced.

For example, taking N=2 as an example, the following formula (1) can be used to determine Q, K and V based on the input data:
q _(i,j) = W ^Q h _(i,j) , k _(i,j) = W ^K h _(i,j) , v _(i,j) = W ^V h _(i,j) Formula (1 ).

Among them, q _{(i, j)} represents the Query of the element at position (i, j), k _{(i, j)} represents the Key of the element at position (i, j), and v _{(i, j)} represents the element at position (i, j). Value. The value range of i is 0~m-1, and the value range of j is 0~n-1. The 2-dimensional input data includes m rows and n columns.

Take the feature vector corresponding to each element on axis 1 as an example. The feature vector corresponding to each element on axis 1 can be determined using the following formula (2-1).

Combining the above methods, the feature vector corresponding to the element at position (i, j) on axis 2 is As shown in formula (2-2). m represents the number of elements included in each row of the data to be processed in the direction of the horizontal axis.

Then the feature vectors corresponding to the elements of the two axes are weighted and determined using the following formula (3):

For example, w ₁ =w ₂ =1/2.

In some cases, involving classification scenarios, encoding data corresponding to at least one setting element can be added to the network parameters of the independent overlay attention network. The encoded data corresponding to the at least one setting element is a learnable embedded input in the independent superposition attention network, that is, it can be used as a network parameter to participate in training. Each time the network parameters are adjusted during the training process, the at least one setting element can be adjusted. Corresponding encoded data.

As an example, at least one setting element may include a classification bit and/or a distillation bit. The encoded data corresponding to the classification bits can also be called a class token, and the encoded data corresponding to the distillation bits can also be called a distillation token. The student model can be trained using the knowledge distillation (KD) training mode of the teacher model. The student model can be understood as a smaller model compressed by the teacher model. Interactive learning with the teacher model is performed by adding distillation bits, and finally the output is passed through the distillation loss. The Class token and distillation token are learnable embedding vectors. The Class token and distillation token model the global relationship between elements by performing attention operations with the encoded data of each element included in the input data. , and fuses the information of all elements, and is finally connected to the classifier for category prediction.

See Figure 7 for a schematic diagram of the processing flow of another independent superimposed attention network. Figure 7 also takes as an example that the elements included in the input data can be mapped to N coordinate axes, which are axis 1 to axis N respectively. For the elements on each axis, respectively Enter attention calculations. Then weighting is performed on the calculation results of the elements of each axis. The encoding data of classification bits and distillation bits are separately weighted with the encoding data of all other elements for attention weighting calculation, and then feature fusion is performed with the weighted sum of N axes. Feature fusion can use connection functions to connect features, such as the concat function.

For example, taking N=2 as an example, the above formula (1) can be used to determine Q, K and V based on the input data. Q, K and V corresponding to the classification bits can be determined by the following formula (4).
q _c =W ^Q h _c , k _c =W ^K h _c , v _c =W ^V h _c formula (4).

Among them, q _c represents the Query of the classification bit, k _c represents the Key of the classification bit, and v _c represents the Value of the classification bit. h _c represents the coded data of the classification bit.

It should be noted that the vector matrices W Q , W K and W V used in the calculation of Q, K and V corresponding to each element in the input data, and the ^Q , ^K and W ^V corresponding to the classification bit (and/or distillation bit) The vector matrix used in V calculation is the same.

The feature vector corresponding to the classification bit is determined through the following formula (5).

In some possible implementations, the independent stacked attention network can perform fully connected processing before performing attention calculation, and perform dimensionality enhancement processing on the input data. After completing the attention calculation, further full connection processing, such as dimensionality reduction processing, can be performed. The dimensions of the input data of the independent stacked attention network are the same as the dimensions of the output data.

As shown in Table 1, a comparison of the computational complexity of the conventional attention network and the attention network provided by the embodiment of the present application is provided. Take two coordinate axes as an example. Among them, m and n are the dimensions of the two axes, and C is the feature dimension.

Table 1

For multiple axes, assuming that the dimension of the i-th axis is N _i , the relevant computational complexity of the independent stacked attention network is (6), the complexity of the conventional attention network is (7), and the ratio between the two is (8) ). It can be seen from formulas (6), (7) and (8) that the solution provided by the embodiment of the present application also has the advantage of low complexity when used in scenarios where the dimensions of each axis are equivalent.
Ω(independent superposition)=2C(∏N _i )(∑N _i ) (6)
Ω(conventional)=2C(∏N _i ) ² (7)

The solution provided by the application embodiment is applicable in multi-axis scenarios. For example, in video data, assuming that the space is 128×128 and the time is 16, the computational complexity of the independent superimposed attention network is 0.1% of that of the conventional attention network. For example, with 10 coordinates and the element dimension of each axis is 128, the computational complexity of the independent superimposed attention network is 1.1×10 ^-18 of the conventional attention network.

In some scenarios, the independent superimposed attention network provided by the embodiments of this application can be applied to the transformer module to process data, such as image classification, segmentation, and target positioning; video action classification, time positioning, and spatiotemporal positioning; audio and Music classification, sound source separation, etc. As an example, see FIG. 8 , which is a schematic structural diagram of a transformer module illustrated in an embodiment of the present application.

As shown in Figure 8, the transformer module may include the independent superimposed attention network and line provided by the embodiment of the present application. sexual layer and multi-layer perceptron. Independent stacked attention networks are used to extract features from input data. Linear layers can be layer normalization (LN). LN is used to normalize the output of independent stacked attention networks. Multilayer perceptron (MLP) is serially connected with independent superimposed attention network. A multilayer perceptron can include multiple serial fully connected layers. Specifically, the multi-layer perceptron can also be called a fully connected neural network. A multi-layer perceptron includes an input layer, a hidden layer and an output layer. The number of hidden layers can be one or more. Among them, the network layers in the multi-layer perceptron are all fully connected layers. That is, the input layer and hidden layer of the multi-layer perceptron are fully connected, and the hidden layer and output layer of the multi-layer perceptron are also fully connected. Among them, the fully connected layer means that each neuron in the fully connected layer is connected to all the neurons in the previous layer, which is used to synthesize the features extracted from the previous layer.

In some possible embodiments, the transformer module may also include another linear layer for performing layer normalization. Calculating normalized statistical information through layer normalization can reduce calculation time. Located at the input end of the independent stacked attention network, see Figure 9, it is used to first perform layer normalization on the data input to the transformer module to reduce training costs.

The solutions provided by the embodiments of this application are described in detail below based on several application scenarios.

Scenario 1: Take audio classification as an example. Refer to Figure 10, which is a schematic structural diagram of a classification network model. The classification network model includes an embedding generation module, M1 transformer modules and a classification module. M1 transformer modules can be deployed in series. The transformer module adopts the structure shown in Figure 9. The embedding generation module is used to extract local features from the input audio data, which can also be understood as generating encoded data of the audio data. Audio data can be mapped to the time axis as well as the frequency axis. Audio data can be divided into multiple audio points. For example, it is divided into T*F audio points (patches), T represents the time dimension, and F represents the frequency dimension. For example, input data: 10s, 32000Hz. In the time spectrum, frequency is 128 and time is 1000 dimensions. The number of patches into which audio data is divided is: 99 (time) * 12 (frequency). For example, if time is the horizontal axis and frequency is the vertical axis, each row includes 99 audio points and each column includes 12 audio points. For example, the feature dimension of the embedding generation module that extracts local features from the input audio data is represented by E1. The independent stacked attention network in the transformer module can adopt a multi-head attention mechanism. In the classification module, after averaging the results of classification bits and distillation bits, the predicted value of each class is obtained through the linear layer.

See Figure 11, which is a schematic diagram of the workflow of the classification network model. The embedding generation module includes a convolution layer, which is used to perform convolution processing on the input audio data (time spectrum) to generate an embedding representation, and output a time-frequency vector of (T×F, E1). E1 represents the feature dimension. The feature dimension of each patch is E1. For example, the embedding generation module can use two-dimensional convolution with a larger convolution step size (for example, a convolution step size of about 10), so that each generated time-frequency vector represents local patchE1s information. In the embodiment of this application, the purpose is audio classification, and the classification bit and the distillation bit can be combined, for example, through the concat function. In some embodiments, in order to improve classification accuracy, position encoding can be added to help learn position information. The method of position encoding is not specifically limited in the embodiment of this application.

The embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks. The embedding vector is input to the transformer module. You can first perform linear operations on the embedding vector through a linear layer, such as layer normalization. The layer-normalized data is input to the independent stacked attention network. The independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E1-dimensional data is raised to 3*E1-dimensional data), and further generate Q and K corresponding to each patch, classification bit, and distillation bit respectively. ,V. Taking the independent stacked attention network using a multi-head attention mechanism as an example, the independent stacked attention network further performs multi-head splitting. The independent stacked attention network weights the attention of the classification bits and distillation bits with other patches to obtain the classification bits and distillation bits. feature vector, and perform row attention weighting calculation on the time axis and column attention weighting calculation on the frequency axis. Combine the results of weighted row attention calculations and column attention The eigenvector obtained by weighting the result of the force weighting calculation is connected with the eigenvectors of the classification bit and the distillation bit. For example, the weights corresponding to the time coordinate axis and the frequency coordinate axis are the same, which are both 0.5. Furthermore, the independent superimposed attention network performs dimensionality reduction processing on the connected feature vectors, reducing the 3*E1-dimensional data to E1-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the classification module. In the classification module, after averaging the results of classification bits and distillation bits, the predicted value of each class is obtained through the linear layer.

For example, the classification network model shown in Figure 10 is used to classify the following audio data data sets 1) and 2) respectively. 1) Audioset, including 632 extended categories of audio event classes and a collection of 2M (mega) manually labeled 10s sound clips extracted from some videos. Categories cover a wide range of human and animal sounds, instruments and styles as well as common everyday environmental sounds. 2)Opernmic 2018: Musical instrument sound classification data set, a total of 20,000 samples, 20 categories of musical instruments, and audio length 10s. The comparison results of classification accuracy, time, and system performance requirements between the solution provided by this application and the existing solution are shown in Table 2 and Table 3. System performance is expressed in terms of the required number of floating-point operations per second (FLOPs) performed per second, for example. Classification accuracy is expressed as Mean Average Precision (mAP) as an example.

Table 2

Judging from Table 2 above, for both data sets, the prediction accuracy of the transformer including the independent superimposed attention network provided by the embodiment of the present application is improved compared with the existing technology.

table 3

From Table 3, it can be seen that compared with the existing technology, using a transformer or an independent superimposed attention network including the independent superimposed attention network provided by the embodiment of the present application can improve the computing efficiency. It should be understood that the operation of the above-mentioned prior art method and the operation of the method in the embodiment of the present application are obtained in the same environment. Table 2 and Table 3 are only examples, and the results may be different when running in different environments.

Scenario 2: Take end-to-end image segmentation as an example. See Figure 12 for a schematic workflow diagram of an image segmentation network model. The classification network model includes an embedding generation module, M2 transformer modules and a pixel reconstruction module. M2 transformer modules can be deployed in series. The transformer module adopts the structure shown in Figure 9. The embedding generation module is used to extract local features from the input image data, which can also be understood as generating the encoded data of the image data. Image data can be mapped to horizontal as well as vertical axes. Image data can be divided into multiple images piece. For example, it is divided into H*W image blocks (patches). The feature dimension of the local features extracted by the embedding generation module from the input image data is represented by E2. The independent stacked attention network in the transformer module can adopt a multi-head attention mechanism. In the pixel reconstruction module, the intensity value of pixels of each image block is restored.

The embedding generation module includes a convolution layer, which is used to perform convolution processing on the input image data (time spectrum) to generate an embedding representation, and output an image vector of (H×M, E2). E2 represents the feature dimension. The feature dimension of each element is E2. In some embodiments, in order to improve classification accuracy, position encoding (H×M, E2) can be added to help learn position information. The method of position encoding is not specifically limited in the embodiment of this application.

The embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks. The embedding vector is input to the transformer module. You can first perform linear operations on the embedding vector through a linear layer, such as layer normalization. The layer-normalized data is input to the independent stacked attention network. The independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E-dimensional data is raised to 3*E2-dimensional data), and further generate Q, K, and V corresponding to each patch. Taking the independent stacked attention network using a multi-head attention mechanism as an example, the independent stacked attention network further performs multi-head splitting, and the independent stacked attention network performs row attention weighting calculations on the horizontal axis and column attention weighting calculations on the vertical axis. . The result of row attention weighting calculation and the result of column attention weighting calculation are weighted to obtain the feature vector of the image data. For example, the weights corresponding to the horizontal coordinate axis and the vertical coordinate axis are the same, both are 0.5. Furthermore, the independent superimposed attention network can also perform dimensionality reduction processing on the feature vectors of the obtained image data, reducing the 3*E2-dimensional data to E-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the pixel reconstruction module. In the pixel reconstruction module, the pixel intensity value of each image block is restored through layer normalization and fully connected layer processing.

Scenario 3: Take video action classification as an example. Refer to Figure 13, which is a schematic workflow diagram of a video classification network model. The classification network model includes an embedding generation module, M3 transformer modules and a classification module. M3 transformer modules can be deployed in series. The transformer module adopts the structure shown in Figure 9. The embedding generation module is used to extract local features from the input video data, which can also be understood as generating the encoded data of the video data. Video data can be mapped to the time axis, horizontal axis, and vertical axis. Video data can be divided into image blocks. For example, it is divided into H*W*T image patches (patches), T represents the time coordinate axis dimension, H represents the horizontal coordinate axis dimension, and W represents the vertical coordinate axis dimension. The embedding generation module includes a three-dimensional convolution layer, which is used to perform convolution processing on the input video data to generate an embedding representation, and output a video vector of (H*W*T, E3). E3 represents the feature dimension. The feature dimension of each patch is E3. For example, in the embodiment of the present application, the purpose is to classify video actions, and the classification bits can be combined. For example, the data of the classification bits can be connected with the video vector of (H*W*T, E3) through the concat function. In some embodiments, in order to improve classification accuracy, position encoding can be added to help learn position information. The method of position encoding is not specifically limited in the embodiment of this application. After adding classification bits and superimposing position coding, the embedding generation module outputs an embedding vector with dimensions (H*W*T+1, E3).

The embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks. The embedding vector is input to the transformer module. You can first perform linear operations on the embedding vector through a linear layer, such as layer normalization. The layer-normalized data is input to the independent stacked attention network. The independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E1-dimensional data is raised to 3*E1-dimensional data), and further generates Q, K, and V corresponding to each patch and classification bit. Taking the independent superposition attention network using the multi-head attention mechanism as an example, the independent superposition attention network further performs multi-head splitting. The independent superposition attention network performs attention weighting on the classification bits and other patches respectively to obtain the feature vector of the classification bit, and executes The attention weighting calculation on the time axis, the row attention weighting calculation on the horizontal axis, and the column attention weighting calculation on the vertical axis. Calculate the weighted attention of the time axis, The feature vector obtained by weighting the results of the row attention weighting calculation and the column attention weighting calculation is connected to the feature vector of the classification bit. Furthermore, the independent superimposed attention network performs dimensionality reduction processing on the connected feature vectors, reducing the 3*E3-dimensional data to E3-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the classification module. In the classification module, the classification information corresponding to the classification bits is obtained through the linear layer, and then processed through the fully connected layer to obtain the action classification prediction distribution.

An embodiment of the present application also provides a data processing device. Please refer to FIG. 14 , which is a schematic structural diagram of a data processing device provided by an embodiment of the present application. The data processing device includes an input unit 1410 for receiving data to be processed, the data to be processed including encoded data of a plurality of elements.

The processing unit 1420 is configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.

In a possible implementation, the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located. ; The first element is any element among the plurality of elements; other elements in the first area where the first element is located are mapped to positions and coordinates of other coordinate axes other than the first coordinate axis. The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.

In a possible implementation, the processing unit 1420 is specifically configured to calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the The attention value between the first element and the other elements respectively; the first element is any element among the plurality of elements; execution is performed according to the attention value between the first element and the other elements respectively Weighting processing is performed to obtain the feature vector on the first coordinate axis corresponding to the first element.

In a possible implementation, the data to be processed includes audio data, the audio data includes multiple audio points, and each audio point is mapped to a time coordinate axis and a frequency coordinate axis; or,

In a possible implementation, the number of coordinate axes is equal to N; the neural network includes a linear module 1421, N attention calculation modules 1422 and a weighting module 1423.

The linear module 1421 is used to generate the first query vector, the first key value vector and the first value vector based on the data to be processed; the i-th attention calculation module 1422 is used to generate the first query vector, The first key value vector and the first value vector obtain a feature vector corresponding to each element of the plurality of elements on the i-th coordinate axis; i is a positive integer less than or equal to N; The weighting module 1423 is used to weight each of the plurality of elements corresponding to the feature vectors on the N coordinate axes.

In a possible implementation, the neural network also includes an N+1 attention calculation module 1424 and a feature fusion module 1425;

The linear module 1421 is further configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one setting element;

The N+1th attention calculation module 1424 is used to calculate the direction according to the second query vector and the second key value. The second value vector obtains the feature vector corresponding to the at least one setting element; the corresponding feature vector of the setting element is used to represent the correlation between the setting element and the multiple elements;

The feature fusion module 1425 is configured to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.

In a possible implementation, the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.

Next, an execution device provided by an embodiment of the present application is introduced. Please refer to Figure 15. Figure 15 is a schematic structural diagram of an execution device provided by an embodiment of the present application. The execution device can be embodied as a mobile phone, tablet, notebook computer, smart phone, etc. Wearable devices, servers, etc. are not limited here. Specifically, the embodiment of the present application also provides another structure of the device. As shown in Figure 15, the execution device 1500 may include a communication interface 1510 and a processor 1520. Optionally, the execution device 1500 may also include a memory 1530. The memory 1530 may be provided inside the device or outside the device. In one example, each of the units shown in FIG. 14 can be implemented by the processor 1520. In another example, the function of the input unit is implemented by the communication interface 1510. The functions of the processing unit 1420 are implemented by the processor 1520. The processor 1520 receives the data to be processed through the communication interface 1510, and is used to implement the methods described in Figures 3, 6-13. During the implementation process, each step of the processing flow can complete the method described in FIG. 3 and FIG. 6 to FIG. 13 through the integrated logic circuit of hardware in the processor 1520 or instructions in the form of software.

In the embodiment of this application, the communication interface 1510 may be a circuit, bus, transceiver, or any other device that can be used for information exchange. Wherein, as an example, the other device may be a device connected to the execution device 1500.

In the embodiment of the present application, the processor 1520 may be a general processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute The disclosed methods, steps and logical block diagrams in the embodiments of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software units in the processor. The program code executed by the processor 1520 to implement the above method may be stored in the memory 1530. Memory 1530 and processor 1520 are coupled.

The coupling in the embodiment of this application is an indirect coupling or communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, units or modules.

The processor 1520 may cooperate with the memory 1530. The memory 1530 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), or a volatile memory (volatile memory), such as a random access memory (random access memory). -access memory, RAM). Memory 1530 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The specific connection medium between the communication interface 1510, the processor 1520 and the memory 1530 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 1530, the processor 1520 and the communication interface 1510 are connected through a bus in Figure 15. The bus is represented by a thick line in Figure 15. The connection methods between other components are only schematically explained. It is not limited. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 15, but it does not mean that there is only one bus or one type of bus.

Based on the above embodiments, embodiments of the present application also provide a computer storage medium, which stores a software program. When read and executed by one or more processors, the software program can implement any one or more of the above. Examples provide methods. The computer storage media may include: U disk, mobile hard disk, read-only memory, random access Various media that can store program code, such as memory, magnetic disks, or optical disks.

Based on the above embodiments, embodiments of the present application also provide a chip, which includes a processor and is used to implement the functions involved in any one or more of the above embodiments, such as obtaining or processing the information involved in the above methods or information. Optionally, the chip further includes a memory, and the memory is used for necessary program instructions and data executed by the processor. The chip may be composed of chips or may include chips and other discrete devices.

Specifically, please refer to Figure 16. Figure 16 is a structural schematic diagram of a chip provided by an embodiment of the present application. The chip can be represented as a neural network processor NPU 1600. The NPU 1600 serves as a co-processor and is mounted to the main CPU (Host). CPU), tasks are allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1603. The arithmetic circuit 1603 is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.

In some implementations, the computing circuit 1603 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 1603 is a two-dimensional systolic array. The arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1603 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1602 and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory 1601 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1608 .

The unified memory 1606 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1605, and the DMAC is transferred to the weight memory 1602. Input data is also transferred to unified memory 1606 via DMAC.

BIU is the Bus Interface Unit, that is, the bus interface unit 1610, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1609.

The bus interface unit 1610 (Bus Interface Unit, BIU for short) is used to fetch the memory 1609 to obtain instructions from the external memory, and is also used for the storage unit access controller 1605 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 or the weight data to the weight memory 1602 or the input data to the input memory 1601 .

The vector calculation unit 1607 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 1603, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. Mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.

In some implementations, vector calculation unit 1607 can store the processed output vectors to unified memory 1606 . For example, the vector calculation unit 1607 can apply a linear function; or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 1607 generates normalized values, pixel-wise summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1603, such as for use in a subsequent layer in a neural network.

The instruction fetch buffer 1609 connected to the controller 1604 is used to store instructions used by the controller 1604; the unified memory 1606, the input memory 1601, the weight memory 1602 and the instruction fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.

The processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.

In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.

Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

A data processing method, characterized by including:

Receive data to be processed, the data to be processed including encoded data of a plurality of elements;

Feature extraction is performed on the data to be processed through a neural network to obtain feature vectors for each of the plurality of elements corresponding to multiple coordinate axes, and each element of the plurality of elements is corresponding to the corresponding coordinate axes. The feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
The method of claim 1, wherein the feature vector corresponding to the first element on the first coordinate axis is used to characterize the relationship between the first element and other elements in the first region where the first element is located. The correlation between The position of is the same as and/or adjacent to the position of the first element mapped to the other coordinate axes; the first coordinate axis is any one of the multiple coordinate axes.
The method of claim 2, wherein obtaining feature vectors corresponding to each of the plurality of elements on multiple coordinate axes includes:

Calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the attention value between the first element and the other elements respectively; the first The element is any one of the plurality of elements;

Perform weighting processing according to the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
The method according to any one of claims 1 to 3, characterized in that the data to be processed includes audio data, the audio data includes a plurality of audio points, each audio point is mapped to a time coordinate axis and a frequency coordinate axis. ;or,

The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,

The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
The method according to any one of claims 1 to 4, characterized in that feature extraction is performed on the data to be processed through a neural network to obtain multiple coordinates corresponding to each element in the plurality of elements. Characteristic vectors on the axes: the plurality of coordinate axes include a first coordinate axis and a second coordinate axis;

Generate a first query vector, a first key value vector and a first value vector based on the data to be processed through the neural network;

According to the first query vector, the first key value vector, and the first value vector, obtain a feature vector corresponding to each element of the plurality of elements on the first coordinate axis;

According to the first query vector, the first key value vector, and the first value vector, a feature vector corresponding to each element of the plurality of elements corresponding to the second coordinate axis is obtained.
The method of claim 5, further comprising:

Generate a second query vector, a second key-value vector and a second value vector based on the encoded data of at least one setting element;

The feature vector corresponding to the at least one setting element is obtained according to the second query vector, the second key value vector, and the second value vector; the corresponding feature vector of the setting element is used to characterize the device The degree of association between a certain element and the plurality of elements;

Feature fusion is performed on the feature vector of the data to be processed and the feature vector of the at least one setting element.
The method of claim 6, wherein the encoded data of the at least one setting element is obtained as network parameters of the neural network through multiple rounds of adjustments during the process of training the neural network.
A data processing device, characterized in that it includes:

An input unit configured to receive data to be processed, where the data to be processed includes encoded data of multiple elements;

A processing unit configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors corresponding to multiple coordinate axes for each of the multiple elements, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
The device of claim 8, wherein the feature vector corresponding to the first element on the first coordinate axis is used to characterize the relationship between the first element and other elements in the first region where the first element is located. The correlation between The position of is the same as and/or adjacent to the position of the first element mapped to the other coordinate axes; the first coordinate axis is any one of the multiple coordinate axes.
The device according to claim 9, wherein the processing unit is specifically used for:

Calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the attention value between the first element and the other elements respectively; the first The element is any one of the plurality of elements;

Perform weighting processing according to the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
The device according to any one of claims 8 to 10, wherein the data to be processed includes audio data, the audio data includes a plurality of audio points, each audio point is mapped to a time coordinate axis and a frequency coordinate axis. ;or,

The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,

The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
The device according to any one of claims 8-11, characterized in that the number of coordinate axes is equal to N;

The linear module is used to generate a first query vector, a first key value vector and a first price based on the data to be processed. value vector;

The i-th attention calculation module is used to obtain, according to the first query vector, the first key-value vector, and the first value vector, each element in the plurality of elements corresponding to the i-th Eigenvectors on coordinate axes; i is a positive integer less than or equal to N;

A weighting module is used to weight the feature vectors corresponding to each of the plurality of elements on the N coordinate axes.
The device according to claim 12, wherein the neural network further includes an N+1 attention calculation module and a feature fusion module;

The linear module is configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one set element;

The N+1th attention calculation module is used to obtain the feature vector corresponding to the at least one setting element according to the second query vector, the second key value vector, and the second value vector; The corresponding feature vector of the setting element is used to represent the degree of association between the setting element and the multiple elements;

The feature fusion module is used to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
The device of claim 13, wherein the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the training of the neural network.
A data processing system, characterized by including user equipment and cloud service equipment;

The user equipment is used to send a service request to the cloud service device, the service request carries data to be processed, and the data to be processed includes encoded data of multiple elements; the service request is used to request the cloud server to perform the processing for the data to be processed. Process data to complete designated processing tasks;

The cloud service device is used to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to perform feature extraction on the multiple elements. Each element in is corresponding to the feature vectors of the multiple coordinate axes and is weighted to obtain the feature vector of the data to be processed; completing the specified processing task according to the feature vector of the data to be processed to obtain the processing result, Send the processing result to the user equipment;

The user equipment is also configured to receive the processing result from the cloud service equipment.
An electronic device, characterized in that the terminal device includes at least one processor and memory;

Instructions are stored in the memory;

The at least one processor is configured to execute the instructions stored in the memory to implement the method according to any one of claims 1 to 7.
A chip system, characterized in that the chip system includes at least one processor and a communication interface, and the communication interface and the at least one processor are interconnected through lines;

The communication interface is used to receive data to be processed;

The processor is configured to execute the method described in any one of claims 1 to 7 on the data to be processed.
A computer storage medium, characterized in that the computer storage medium stores instructions, which when executed by a computer, cause the computer to implement the method described in any one of claims 1 to 7.
A computer program product, characterized in that the computer program product stores instructions that, when executed by a computer, cause the computer to implement the method described in any one of claims 1 to 7.