WO2023226783A1 - Data processing method and apparatus - Google Patents

Data processing method and apparatus Download PDF

Info

Publication number
WO2023226783A1
WO2023226783A1 PCT/CN2023/093668 CN2023093668W WO2023226783A1 WO 2023226783 A1 WO2023226783 A1 WO 2023226783A1 CN 2023093668 W CN2023093668 W CN 2023093668W WO 2023226783 A1 WO2023226783 A1 WO 2023226783A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
elements
vector
processed
attention
Prior art date
Application number
PCT/CN2023/093668
Other languages
French (fr)
Chinese (zh)
Inventor
蔡创坚
胡芝兰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023226783A1 publication Critical patent/WO2023226783A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a data processing method and device.
  • self-attention networks have been well applied in many natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and question answering.
  • NLP natural language processing
  • self-attention networks originating from the field of natural language processing have also achieved high performance in tasks such as image classification, target detection, and image processing.
  • the key to self-attention networks is to learn an alignment where each element in the sequence learns to gather information from other elements in the sequence.
  • the self-attention network is different from the general attention network in that it pays more attention to the internal correlation of data or features and reduces the dependence on external information.
  • the currently used self-attention network obtains relevant information from all other elements for one element, resulting in high computational consumption.
  • Embodiments of the present application provide a data processing method and device to solve the problem of high computational consumption caused by calculating the relevant information of one element and all other elements when using the current self-attention network.
  • embodiments of the present application provide a data processing method, including: receiving data to be processed, where the data to be processed includes encoded data of multiple elements; performing feature extraction on the data to be processed through a neural network to obtain Each of the plurality of elements corresponds to a feature vector of a plurality of coordinate axes, and each of the plurality of elements corresponds to a feature vector of the plurality of coordinate axes, which is weighted to obtain the Describe the feature vector of the data to be processed.
  • the neural network is a self-attention network.
  • the attention calculation is no longer performed on one element and all other elements, but the attention calculation is performed on the elements of the same coordinate axis, and the calculation is performed on all coordinate axes separately, and then Performing weighting processing again can reduce computational consumption.
  • this application can be applied to computer vision or natural language processing. Including machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, speech recognition, image recognition (Image Classification), object detection (Object Detection), semantic segmentation (Semantic Segmentation) and image generation (Image Generation).
  • the neural network can be a neural network used to classify images, or it can be a neural network used to segment images, or it can be a neural network used to detect images, or it can be used to perform image processing.
  • the neural network for recognition can either be a neural network used to generate a specified image, or it can be a neural network used to translate text, or it can be a neural network used to paraphrase text, or It may be a neural network used to generate specified text, or it may be a neural network used to recognize speech, or it may be a neural network used to translate speech, or it may be a neural network used to generate specified speech, etc.
  • the data to be processed can be audio data, video data, image data, text data, etc.
  • receiving the data to be processed includes receiving a service request from the user equipment, and the service request carries the data to be processed.
  • a service request is used to request completion of a specified processing task for the data to be processed.
  • the method further includes: completing the specified processing task according to the feature vector of the data to be processed to obtain a processing result, and sending the processing result to the user equipment.
  • the attention network is an attention network used to classify images. After obtaining the feature vector, the image can be further classified according to the feature vector to obtain the classification result. For another example, if the processing task is designated as image segmentation, the attention network is an attention network used to segment images. After obtaining the feature vector, the image can be further segmented based on the feature vector to obtain the segmentation result. For another example, if the processing task is designated as image detection, the attention network is an attention network used to detect images. After obtaining the feature vector, image detection can be further performed based on the feature vector to obtain the segmentation result. For another example, if the designated processing task is speech recognition, the attention network is an attention network used to recognize speech.
  • speech recognition can be further performed based on the feature vector to obtain the recognition result.
  • the attention network is an attention network used to translate speech.
  • speech translation can be further performed based on the feature vector to obtain the translation result.
  • the feature vectors of elements corresponding to different coordinate axes are orthogonal to each other.
  • the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located;
  • the first element is any element among the plurality of elements; the positions of other elements in the first area where the first element is located are mapped to coordinate axes other than the first coordinate axis and the The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.
  • the first coordinate axis is the horizontal coordinate axis, and the feature vector of the element in the horizontal direction coordinates.
  • the elements participate in the calculation of the feature vector of the element. Since all elements can participate in calculations, global modeling can be achieved and computational complexity can be reduced.
  • obtaining the feature vectors on multiple coordinate axes that each element among the plurality of elements respectively corresponds to includes: assigning the first element to the first region corresponding to the first element.
  • the attention values between the first element and the other elements are obtained by performing attention calculations between other elements; the first element is any element among the plurality of elements; according to the first element Weighting processing is performed on the attention values between elements respectively and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
  • the adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent, or the interval between the positions mapped to other coordinate axes of the two elements can be within a set distance.
  • the data to be processed includes audio data
  • the audio data includes multiple audio points
  • each audio point is mapped to a time coordinate axis and a frequency coordinate axis;
  • the data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
  • the data to be processed includes video data.
  • the video data includes multiple video frames.
  • Each video frame includes multiple pixel points or image blocks.
  • Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
  • feature extraction is performed on the data to be processed through a neural network to obtain feature vectors corresponding to each of the multiple elements on multiple coordinate axes: the multiple The coordinate axes include a first coordinate axis and a second coordinate axis; a first query vector, a first key value vector and a first value vector are generated based on the data to be processed through the neural network; according to the first query vector, the According to the first key value vector and the first value vector, each element of the plurality of elements is corresponding to a feature vector on the first coordinate axis; according to the first query vector, the first The key value vector and the first value vector are used to obtain a feature vector corresponding to each element of the plurality of elements on the second coordinate axis.
  • N 2
  • q (i, j) represents the Query of the element at position (i, j)
  • k (i, j) represents the Key of the element at position (i, j)
  • v (i, j) represents the element at position (i, j).
  • the value range of i is 0 ⁇ m-1
  • the value range of j is 0 ⁇ n-1.
  • the 2-dimensional data to be processed includes m rows and n columns.
  • the feature vector corresponding to each element on the first coordinate axis can be determined using the following formula.
  • d k is the number of dimensions of the input data. Represents the feature vector of the element at position (i, j) corresponding to axis 1;
  • the feature vector corresponding to each element on the first coordinate axis can be determined using the following formula:
  • the eigenvector corresponding to the element of the first coordinate axis is weighted with the eigenvector corresponding to the element of the second coordinate axis, and is determined using the following formula:
  • h′ (i, j) represents the feature vector of the element at position (i, j).
  • w 1 represents the weight of axis 1
  • w 2 represents the weight of axis 2.
  • the method further includes: generating a second query vector, a second key value vector and a second value vector based on the encoded data of at least one setting element; according to the second query vector, the The second key value vector and the second value vector obtain the feature vector corresponding to the at least one setting element; the corresponding feature vector of the setting element is used to characterize the relationship between the setting element and the multiple elements. degree of correlation; perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
  • the encoding data of at least one setting element may include classification bits and/or distillation bits, thereby also for classification scenarios Be applicable.
  • the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.
  • embodiments of the present application provide a data processing device, including:
  • An input unit configured to receive data to be processed, where the data to be processed includes encoded data of multiple elements
  • a processing unit configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors corresponding to multiple coordinate axes for each of the multiple elements, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
  • the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located;
  • the first element is any element among the plurality of elements; the positions of other elements in the first area where the first element is located are mapped to coordinate axes other than the first coordinate axis and the The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.
  • the processing unit is specifically configured to calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the first element.
  • the data to be processed includes audio data
  • the audio data includes multiple audio points
  • each audio point is mapped to a time coordinate axis and a frequency coordinate axis;
  • the data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
  • the data to be processed includes video data.
  • the video data includes multiple video frames.
  • Each video frame includes multiple pixel points or image blocks.
  • Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
  • the number of coordinate axes is equal to N; the linear module is used to generate the first query vector, the first key value vector and the first value vector based on the data to be processed; the i-th attention A calculation module configured to obtain, according to the first query vector, the first key value vector, and the first value vector, each of the plurality of elements corresponding to the i-th coordinate axis.
  • Feature vector; i is a positive integer less than or equal to N;
  • a weighting module is used to weight the feature vectors on N coordinate axes corresponding to each element in the plurality of elements.
  • the neural network also includes an N+1 attention calculation module and a feature fusion module;
  • the linear module is configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one set element;
  • the N+1th attention calculation module is used to obtain the feature vector corresponding to the at least one setting element according to the second query vector, the second key value vector, and the second value vector;
  • the corresponding feature vector of the setting element is used to represent the degree of association between the setting element and the multiple elements;
  • the feature fusion module is used to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
  • the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.
  • this application provides a data processing system, including user equipment and cloud service equipment;
  • the user equipment is used to send a service request to the cloud service device, the service request carries data to be processed, and the data to be processed includes encoded data of multiple elements; the service request is used to request the cloud server to perform the processing for the data to be processed. Process data to complete designated processing tasks;
  • the cloud service device is used to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to perform feature extraction on the multiple elements.
  • Each element in is corresponding to the feature vectors of the multiple coordinate axes and is weighted to obtain the feature vector of the data to be processed; completing the specified processing task according to the feature vector of the data to be processed to obtain the processing result, Send the processing result to the user equipment;
  • the user equipment is also configured to receive the processing result from the cloud service equipment.
  • inventions of the present application provide an electronic device.
  • the electronic device includes at least one processor and a memory; instructions are stored in the memory; and the at least one processor is used to execute all instructions stored in the memory.
  • the above instructions are provided to implement the method described in the first aspect or any design of the first aspect.
  • the electronic device may also be called an execution device and is used to execute the data processing method provided by this application.
  • inventions of the present application provide a chip system.
  • the chip system includes at least one processor and a communication interface.
  • the communication interface and the at least one processor are interconnected through lines; the communication interface is used to receive Data to be processed; the processor, configured to execute the method of the first aspect or any design of the first aspect on the data to be processed.
  • embodiments of the present application provide a computer-readable medium for storing a computer program, where the computer program includes instructions for executing the method in the first aspect or any optional implementation of the first aspect.
  • embodiments of the present application provide a computer program product.
  • the computer program product stores instructions. When executed by a computer, the instructions cause the computer to implement the first aspect or any optional aspect of the first aspect. The method described in the design.
  • Figure 1 is a structural schematic diagram of the main framework of artificial intelligence
  • Figure 2 is a schematic diagram of a system architecture 200 provided by an embodiment of the present application.
  • Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of an axial attention calculation provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of another axial attention calculation provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of the processing flow of an independent superimposed attention network provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the processing flow of another independent superimposed attention network provided by the embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a transformer module illustrated in the embodiment of this application.
  • FIG. 9 is a schematic structural diagram of another transformer module illustrated in the embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a classification network model provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of the workflow of the classification network model provided by the embodiment of the present application.
  • Figure 12 is a schematic workflow diagram of an image segmentation network model provided by an embodiment of the present application.
  • Figure 13 is a schematic workflow diagram of a video classification network model provided by an embodiment of the present application.
  • Figure 14 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Figure 1 shows a structural schematic diagram of the artificial intelligence main framework.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.);
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, smart city, smart terminal, etc.
  • the embodiments of the present application relate to the application of neural networks. To facilitate understanding, the relevant terms involved in the embodiments of the present application and related concepts such as neural networks are first introduced below.
  • the work of each layer in a neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). ), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend”. Among them, the operations of 1, 2 and 3 are performed by Completed, the operation of 4 is completed by +b, and the operation of 5 is implemented by a(). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • loss function loss function
  • objective function object function
  • the neural network uses the back propagation algorithm to correct the size of the network parameters in the neural network during the training process, making the neural network model reconstruction error loss smaller and smaller. Forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • Linear refers to the proportional and straight-line relationship between quantities. Mathematically, it can be understood as a function whose first-order derivative is a constant. Linear operations can be, but are not limited to, addition operations, empty operations, identity operations, and convolutions. operations, layer normalization (LN) operations and pooling operations. Linear operations can also be called linear mapping. Linear mapping needs to meet two conditions: homogeneity and additivity. If any one of the conditions is not met, it is nonlinear.
  • x, a, and f(x) here are not necessarily scalars, but can be vectors or matrices, forming a linear space of any dimension. If x and f(x) are n-dimensional vectors, when a is a constant, they are equivalent to homogeneity, and when a is a matrix, they are equivalent to additivity.
  • the combination of multiple linear operations can be called a linear operation, and each linear operation included in the linear operation can also be called a sub-linear operation.
  • the attention model is a neural network that applies the attention mechanism.
  • the attention mechanism can be broadly defined as a weight vector describing the importance: using this weight vector to predict or infer an element. For example, for a certain pixel in an image or a word in a sentence, attention vectors can be used to quantitatively estimate the correlation between the target element and other elements, and the weighted sum of the attention vectors can be used as an approximation of the target.
  • the attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when humans look at a painting, although their eyes can see the entire painting, when humans observe deeply and carefully, their eyes actually focus on only part of the pattern in the entire painting. This At this time, the human brain mainly focuses on this small pattern. In other words, when humans carefully observe an image, the human brain's attention to the entire image is not balanced, but has a certain weight distinction. This is the core idea of the attention mechanism.
  • the human visual processing system tends to selectively focus on certain parts of the image and ignore other irrelevant information, thus contributing to the human brain's perception.
  • the attention mechanism of deep learning in some problems involving language, speech, or vision, some parts of the input may be more relevant than other parts. Therefore, through the attention mechanism in the attention model, the attention model can be enabled to dynamically focus only on the part of the input that is helpful to effectively perform the task at hand.
  • the self-attention network is a neural network that applies the self-attention mechanism.
  • the self-attention mechanism is an extension of the attention mechanism.
  • the self-attention mechanism is actually an attention mechanism that associates different positions of a single sequence to calculate a representation of the same sequence.
  • Self-attention mechanism can play a key role in machine reading, abstract summary or image description generation. Taking the application of self-attention network to natural language processing as an example, the self-attention network processes input data of any length and generates new feature expressions of the input data, and then converts the feature expressions into target words.
  • the self-attention network layer in the self-attention network uses the attention mechanism to obtain the relationships between all other words, thereby generating a new feature expression for each word.
  • the advantage of the self-attention network is that the attention mechanism can directly capture the relationship between all words in the sentence without considering the word position.
  • the data processing method provided by the embodiment of the present application can be executed by an execution device, or the attention model can be deployed in the execution device.
  • An execution device may be implemented by one or more computing devices.
  • Figure 2 shows a system architecture 200 provided by an embodiment of the present application. Included in the system architecture 200 is an execution device 210 .
  • Execution device 210 may be implemented by one or more computing devices. Execution device 210 may be arranged on one physical site, or distributed across multiple physical sites.
  • System architecture 200 also includes data storage system 250 .
  • the execution device 210 cooperates with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the data processing method provided by this application.
  • One or more computing devices can be deployed in a cloud network.
  • the data processing method provided by the embodiment of the present application is deployed in one or more computing devices of the cloud network in the form of a service, and the user device accesses the cloud service through the network.
  • the execution device is one or more computing devices of the cloud network, the execution device may also be called a cloud end service equipment.
  • the data processing method provided by the embodiment of the present application can be deployed on one or more local computing devices in the form of a software tool.
  • Each local device can represent any computing device, such as a smartphone (mobile phone), personal computer (PC), laptop, tablet, smart TV, mobile internet device (MID), wearable device , smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, virtual reality (VR) devices, augmented reality (AR) devices, industrial control (industrial control) ), wireless electronic devices in self-driving, wireless electronic devices in remote medical surgery, wireless electronic devices in smart grid, transportation safety ), wireless electronic devices in smart cities, and wireless electronic devices in smart homes.
  • a smartphone mobile phone
  • PC personal computer
  • MID mobile internet device
  • wearable device smart cameras
  • media consumption devices wearable devices
  • set-top boxes game consoles
  • VR virtual reality
  • AR augmented reality
  • industrial control industrial control
  • Each user's local device can interact with the execution device 210 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • one or more aspects of the execution device 210 may be implemented by each local device.
  • the local device 301 may provide local data or feedback prediction results to the execution device 210 .
  • the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for users of the local device 302 .
  • the local device 301 may be an electronic device.
  • the electronic device may be a server, a smartphone (mobile phone), a personal computer (PC), a laptop, a tablet, a smart TV, a mobile Internet device (mobile internet device (MID), wearable devices, virtual reality (VR) devices, augmented reality (AR) devices, wireless electronic devices in industrial control (industrial control), self-driving (self-driving) wireless electronic devices, wireless electronic devices in remote medical surgery, wireless electronic devices in smart grid, wireless electronic devices in transportation safety, and wireless electronic devices in smart city Wireless electronic devices, wireless electronic devices in smart homes, etc.
  • MID mobile internet device
  • VR virtual reality
  • AR augmented reality
  • the data processing methods and attention models provided by the embodiments of this application can be applied to computer vision or natural language processing. That is, the electronic device or computing device can perform computer vision tasks or natural language processing tasks through the above-mentioned data processing method.
  • natural language processing is an important direction in the field of computer science and artificial intelligence.
  • Natural language processing research can realize various theories and methods for effective communication between humans and computers using natural language.
  • natural language processing tasks mainly include tasks such as machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, and speech recognition.
  • Computer vision is the science that studies how to make machines learn to see. Furthermore, computer vision refers to using cameras and computers instead of human eyes to identify, track, and measure targets, and further performs graphics processing to make the processed images more suitable for human eyes to observe or transmit to instruments for detection. Image. Generally speaking, computer vision tasks include tasks such as image recognition (Image Classification), object detection (Object Detection), semantic segmentation (Semantic Segmentation), and image generation (Image Generation).
  • Image recognition is a common classification problem, also commonly known as image classification. Specifically, in the image recognition task, the input of the neural network is image data, and the output value is the probability that the current image data belongs to each category. Usually, the category with the largest probability value is selected as the predicted category of image data. Image recognition is one of the earliest tasks to successfully apply deep learning. Typical network models include VGG series, Inception series, ResNet series, etc.
  • Target detection refers to automatically detecting the approximate location of common objects in images through algorithms. Bounding boxes are usually used to represent the approximate locations of objects, and the category information of objects in the bounding boxes is classified.
  • Semantic segmentation refers to automatically segmenting and identifying the content in images through algorithms. Semantic segmentation can be understood as the classification problem of each pixel, that is, analyzing the category of the object that each pixel belongs to.
  • Image generation refers to obtaining high-fidelity generated images by learning the distribution of real images and sampling from the learned distribution. For example, a clear image is generated based on a blurred image; a dehazed image is generated based on a hazy image.
  • the self-attention network is used to obtain relevant information for one element from all other elements, it will lead to high computational consumption.
  • One possible way is to use criss cross attention. Considering the correlation between elements in the cross-shaped area can reduce the complexity compared to calculating all pixels.
  • the data dimensions in the row direction and column direction included in the mapping of elements to different cross directions are generally different, and in some cases the difference is very large. For example, the dimension difference of audio data, video data, etc. may be more than 10 times.
  • the use of cross attention will lead to excessive attention to data with large dimensions, which will lead to the calculation of larger dimensions being suppressed by the calculations of smaller dimensions.
  • the neural network and data processing method provided by the embodiments of this application adopt the method of independently calculating the correlation of elements on each coordinate axis, that is, focusing on each element along the tensor direction of each coordinate axis. calculate. Then in the weighted superposition, it can prevent too much attention to the axis with high dimensions, causing the axis with high dimensions to have an inhibitory effect on the axis with low dimensions. Therefore, using the neural network and data processing method provided by the embodiment of the present application can improve the calculation efficiency. At the same time, the processing accuracy is improved.
  • the neural network may use a convolutional neural network to implement correlation calculation between elements.
  • the neural network provided by the embodiments of the present application adopts a self-attention mechanism to implement correlation calculation between elements. In this case, the neural network may also be called an attention network.
  • the input of the attention network is data in the form of a sequence, that is, the input data of the attention network is sequence data.
  • the input data of the attention network can be a sequence of sentences composed of multiple consecutive words; for another example, the input data of the attention network can be a sequence of image blocks composed of multiple consecutive image blocks. Continuous image blocks are obtained by segmenting a complete image.
  • Sequence data can understand encoded data, such as encoding multiple consecutive words.
  • the encoded data of each element is obtained by performing embedding, such as convolution processing. Elements can also be called patches.
  • Each element in the input data can correspond to multiple coordinate axes. The coordinate axis mentioned here can be in terms of time, space or other dimensions.
  • An element can have parameter values mapped to multiple axes.
  • Input data can also be called data to be processed.
  • the data to be processed can be multimedia data, such as audio data, video data or image data.
  • the data to be processed is audio data, and each element in the audio data can be understood as an audio point.
  • Each audio point can be mapped to the time coordinate axis or the frequency coordinate axis.
  • it may include time parameters mapped to the time coordinate axis and frequency parameters mapped to the frequency coordinate axis.
  • the data to be processed is image data, and the elements of the image data can be understood as pixels or image blocks. Each pixel or image block can be mapped to a horizontal coordinate axis and a vertical coordinate axis.
  • the data to be processed includes video data, and the video data can be mapped to three coordinate axes, such as the time coordinate axis, the horizontal coordinate axis, and the vertical coordinate axis.
  • Video data includes multiple video frames, and each video frame includes multiple pixels or image blocks.
  • the encoded data of each pixel or image block can be mapped to the time coordinate axis, with the time parameter of the time coordinate axis.
  • the encoded data of each pixel or image block can be mapped to the horizontal coordinate axis and the vertical coordinate axis in space, with the horizontal coordinate of the horizontal coordinate axis and the vertical coordinate of the vertical coordinate axis.
  • FIG. 3 a schematic flow chart of a data processing method provided by an embodiment of the present application is shown, using a neural network. Take the attention network as an example.
  • the data processing method may be executed by a service device, such as a cloud service device.
  • the user device can send a service request to the cloud service device, and the service request carries data to be processed.
  • the service request is used to request the cloud server to complete a specified processing task for the data to be processed.
  • the designated processing tasks can be natural language processing tasks, such as machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, or speech recognition.
  • the specified processing task can be a computer vision task, such as image recognition, target detection, semantic segmentation, and image generation.
  • the data processing method may be executed by a local device, such as a local electronic device.
  • the data to be processed can be generated by the electronic device itself.
  • the eigenvectors of the axes are weighted to obtain the eigenvectors of the data to be processed.
  • the attention network can be an attention network used to classify images, or it can be an attention network used to segment images, or it can be an attention network used to detect images, or it can be used
  • the attention network is used to recognize images, or it can be the attention network used to generate specified images, or it can be the attention network used to translate text, or it can be the attention network used to paraphrase text.
  • force network or it can be an attention network used to generate specified text, or it can be an attention network used to recognize speech, or it can be an attention network used to translate speech, or it can be an attention network used to generate Specifying attention networks for speech, etc.
  • the specified processing task can be further completed according to the feature vector to obtain the processing result, and the processing result can be sent to the user device.
  • the attention network is an attention network used to classify images. After obtaining the feature vector, the image can be further classified according to the feature vector to obtain the classification result. For example, if the processing task is designated as image segmentation, the attention network is an attention network used to segment images. After obtaining the feature vector, the image can be further segmented based on the feature vector to obtain the segmentation result. For another example, if the processing task is designated as image detection, the attention network is an attention network used to detect images. After obtaining the feature vector, image detection can be further performed based on the feature vector to obtain the segmentation result. For another example, if the designated processing task is speech recognition, the attention network is an attention network used to recognize speech.
  • speech recognition can be further performed based on the feature vector to obtain the recognition result.
  • the attention network is an attention network used to translate speech.
  • speech translation can be further performed based on the feature vector to obtain the translation result.
  • multiple coordinate axes including N which are axis 1 to axis N respectively.
  • N For example, take the elements included in the input data mapped to N coordinate axes, which are axis 1 to axis N respectively.
  • the attention is calculated separately for the elements on each axis, and then the weighted sum of the calculation results for the elements on each axis is used as the output of the independent superimposed attention network.
  • the weights of different axes can be similar, such as a simple average.
  • the attention network After inputting the data, the attention network performs feature extraction on the input data to obtain the feature vectors of each of the multiple elements corresponding to axis 1 to axis N, and then each element corresponds to the feature vector of axis 1 to axis N.
  • N groups of feature vectors are weighted to obtain the feature vector corresponding to each element.
  • the first element is any element among multiple elements.
  • the feature vector corresponding to the first element on the first coordinate axis is used to characterize the first element and other elements in the first region where the first element is located. The correlation between its elements.
  • the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located; the first element is the Any element among a plurality of elements; the positions of other elements in the first area where the first element is located are mapped to other coordinate axes other than the first coordinate axis, and the positions of the first element mapped to the other coordinate axes are The axes are in the same and/or adjacent positions.
  • the eigenvector corresponding to the first coordinate axis of the first element can be determined in the following way:
  • the first element is any element among the plurality of elements; then weighting processing is performed based on the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
  • the adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent, or the interval between the positions mapped to other coordinate axes of the two elements can be within a set distance.
  • the adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent.
  • axis 1 is the horizontal coordinate axis (can be referred to as the horizontal coordinate axis for short)
  • axis 2 is the vertical axis.
  • Direction coordinate axis (can be referred to as vertical coordinate axis).
  • each row in the horizontal direction includes 10 elements
  • each column in the vertical direction includes 5 elements.
  • you can calculate the distance between element 3-6 and other elements in the same row (elements 3-1 ⁇ 3-5, 3-7 ⁇ 3-10).
  • Attention calculation results and then perform weighting processing based on the attention calculation results to obtain the feature vectors of elements 3-6 corresponding to the horizontal coordinate axis.
  • the distance between element 3-6 and other elements in the same column (elements 1-6 ⁇ 2-6, 4-6 ⁇ 5-6) can be calculated respectively.
  • axis 1 is a horizontal coordinate axis (which may be referred to as a horizontal coordinate axis)
  • axis 2 is a vertical coordinate axis (which may be referred to as a vertical coordinate axis).
  • the attention calculation can be performed on the elements of the same row and one or more rows adjacent to the same row.
  • column attention calculations can be performed on elements in the same column and one or more columns adjacent to the same column.
  • the attention network provided by the embodiments of the present application can also be called an independent superimposed attention network or a self-independent superimposed attention network, and other names can also be used, which are not specifically limited in the embodiments of the present application.
  • the following description takes what is called an independent stacked attention network as an example.
  • the independent superimposed attention network determines the query vector (Query, Q), key value vector (Key, K) and value vector (Value, V) based on the input data, and then based on Q, K , V carries axis 1 square Calculate the attention in the direction of ⁇ axis N to obtain the feature vector of each element corresponding to axis 1 ⁇ axis N, and then perform a weighted sum of the feature vectors of each element corresponding to axis 1 ⁇ axis N to obtain the feature vector of each element.
  • the independent superimposed attention network provided by the embodiments of the present application can adopt a single-head attention mechanism or a multi-head attention mechanism, which is not specifically limited in the embodiments of the present application.
  • the independent stacked attention network groups the dimensions of the input data according to the number of heads. In each group, attention is calculated using the method provided by the embodiment of the present application, and then the results of multiple groups are spliced.
  • q (i, j) represents the Query of the element at position (i, j)
  • k (i, j) represents the Key of the element at position (i, j)
  • v (i, j) represents the element at position (i, j).
  • the value range of i is 0 ⁇ m-1
  • the value range of j is 0 ⁇ n-1.
  • the 2-dimensional input data includes m rows and n columns.
  • the feature vector corresponding to each element on axis 1 can be determined using the following formula (2-1).
  • d k is the number of dimensions of the input data. Represents the feature vector of the element at position (i, j) corresponding to axis 1;
  • the feature vector corresponding to the element at position (i, j) on axis 2 is As shown in formula (2-2).
  • m represents the number of elements included in each row of the data to be processed in the direction of the horizontal axis.
  • h′ (i, j) represents the feature vector of the element at position (i, j).
  • w 1 represents the weight of axis 1
  • w 2 represents the weight of axis 2.
  • encoding data corresponding to at least one setting element can be added to the network parameters of the independent overlay attention network.
  • the encoded data corresponding to the at least one setting element is a learnable embedded input in the independent superposition attention network, that is, it can be used as a network parameter to participate in training. Each time the network parameters are adjusted during the training process, the at least one setting element can be adjusted. Corresponding encoded data.
  • At least one setting element may include a classification bit and/or a distillation bit.
  • the encoded data corresponding to the classification bits can also be called a class token, and the encoded data corresponding to the distillation bits can also be called a distillation token.
  • the student model can be trained using the knowledge distillation (KD) training mode of the teacher model. The student model can be understood as a smaller model compressed by the teacher model. Interactive learning with the teacher model is performed by adding distillation bits, and finally the output is passed through the distillation loss.
  • the Class token and distillation token are learnable embedding vectors.
  • the Class token and distillation token model the global relationship between elements by performing attention operations with the encoded data of each element included in the input data. , and fuses the information of all elements, and is finally connected to the classifier for category prediction.
  • Figure 7 for a schematic diagram of the processing flow of another independent superimposed attention network.
  • Figure 7 also takes as an example that the elements included in the input data can be mapped to N coordinate axes, which are axis 1 to axis N respectively.
  • N coordinate axes which are axis 1 to axis N respectively.
  • the encoding data of classification bits and distillation bits are separately weighted with the encoding data of all other elements for attention weighting calculation, and then feature fusion is performed with the weighted sum of N axes.
  • Feature fusion can use connection functions to connect features, such as the concat function.
  • the above formula (1) can be used to determine Q, K and V based on the input data.
  • Q, K and V corresponding to the classification bits can be determined by the following formula (4).
  • q c represents the Query of the classification bit
  • k c represents the Key of the classification bit
  • v c represents the Value of the classification bit
  • h c represents the coded data of the classification bit.
  • the vector matrix used in V calculation is the same.
  • the feature vector corresponding to the classification bit is determined through the following formula (5).
  • the independent stacked attention network can perform fully connected processing before performing attention calculation, and perform dimensionality enhancement processing on the input data. After completing the attention calculation, further full connection processing, such as dimensionality reduction processing, can be performed.
  • the dimensions of the input data of the independent stacked attention network are the same as the dimensions of the output data.
  • the solution provided by the application embodiment is applicable in multi-axis scenarios.
  • the computational complexity of the independent superimposed attention network is 0.1% of that of the conventional attention network.
  • the computational complexity of the independent superimposed attention network is 1.1 ⁇ 10 -18 of the conventional attention network.
  • the independent superimposed attention network provided by the embodiments of this application can be applied to the transformer module to process data, such as image classification, segmentation, and target positioning; video action classification, time positioning, and spatiotemporal positioning; audio and Music classification, sound source separation, etc.
  • FIG. 8 is a schematic structural diagram of a transformer module illustrated in an embodiment of the present application.
  • the transformer module may include the independent superimposed attention network and line provided by the embodiment of the present application. sexual layer and multi-layer perceptron.
  • Independent stacked attention networks are used to extract features from input data.
  • Linear layers can be layer normalization (LN).
  • LN is used to normalize the output of independent stacked attention networks.
  • Multilayer perceptron (MLP) is serially connected with independent superimposed attention network.
  • a multilayer perceptron can include multiple serial fully connected layers. Specifically, the multi-layer perceptron can also be called a fully connected neural network.
  • a multi-layer perceptron includes an input layer, a hidden layer and an output layer. The number of hidden layers can be one or more.
  • the network layers in the multi-layer perceptron are all fully connected layers. That is, the input layer and hidden layer of the multi-layer perceptron are fully connected, and the hidden layer and output layer of the multi-layer perceptron are also fully connected.
  • the fully connected layer means that each neuron in the fully connected layer is connected to all the neurons in the previous layer, which is used to synthesize the features extracted from the previous layer.
  • the transformer module may also include another linear layer for performing layer normalization. Calculating normalized statistical information through layer normalization can reduce calculation time. Located at the input end of the independent stacked attention network, see Figure 9, it is used to first perform layer normalization on the data input to the transformer module to reduce training costs.
  • Scenario 1 Take audio classification as an example.
  • FIG 10 is a schematic structural diagram of a classification network model.
  • the classification network model includes an embedding generation module, M1 transformer modules and a classification module.
  • M1 transformer modules can be deployed in series.
  • the transformer module adopts the structure shown in Figure 9.
  • the embedding generation module is used to extract local features from the input audio data, which can also be understood as generating encoded data of the audio data. Audio data can be mapped to the time axis as well as the frequency axis. Audio data can be divided into multiple audio points. For example, it is divided into T*F audio points (patches), T represents the time dimension, and F represents the frequency dimension. For example, input data: 10s, 32000Hz.
  • frequency is 128 and time is 1000 dimensions.
  • the number of patches into which audio data is divided is: 99 (time) * 12 (frequency).
  • each row includes 99 audio points and each column includes 12 audio points.
  • the feature dimension of the embedding generation module that extracts local features from the input audio data is represented by E1.
  • the independent stacked attention network in the transformer module can adopt a multi-head attention mechanism.
  • the classification module after averaging the results of classification bits and distillation bits, the predicted value of each class is obtained through the linear layer.
  • the embedding generation module includes a convolution layer, which is used to perform convolution processing on the input audio data (time spectrum) to generate an embedding representation, and output a time-frequency vector of (T ⁇ F, E1).
  • E1 represents the feature dimension.
  • the feature dimension of each patch is E1.
  • the embedding generation module can use two-dimensional convolution with a larger convolution step size (for example, a convolution step size of about 10), so that each generated time-frequency vector represents local patchE1s information.
  • the purpose is audio classification, and the classification bit and the distillation bit can be combined, for example, through the concat function.
  • position encoding can be added to help learn position information.
  • the method of position encoding is not specifically limited in the embodiment of this application.
  • the embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks.
  • the embedding vector is input to the transformer module.
  • the layer-normalized data is input to the independent stacked attention network.
  • the independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E1-dimensional data is raised to 3*E1-dimensional data), and further generate Q and K corresponding to each patch, classification bit, and distillation bit respectively. ,V.
  • the independent stacked attention network further performs multi-head splitting.
  • the independent stacked attention network weights the attention of the classification bits and distillation bits with other patches to obtain the classification bits and distillation bits. feature vector, and perform row attention weighting calculation on the time axis and column attention weighting calculation on the frequency axis. Combine the results of weighted row attention calculations and column attention.
  • the eigenvector obtained by weighting the result of the force weighting calculation is connected with the eigenvectors of the classification bit and the distillation bit. For example, the weights corresponding to the time coordinate axis and the frequency coordinate axis are the same, which are both 0.5.
  • the independent superimposed attention network performs dimensionality reduction processing on the connected feature vectors, reducing the 3*E1-dimensional data to E1-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the classification module. In the classification module, after averaging the results of classification bits and distillation bits, the predicted value of each class is obtained through the linear layer.
  • the classification network model shown in Figure 10 is used to classify the following audio data data sets 1) and 2) respectively.
  • Audioset including 632 extended categories of audio event classes and a collection of 2M (mega) manually labeled 10s sound clips extracted from some videos. Categories cover a wide range of human and animal sounds, instruments and styles as well as common everyday environmental sounds.
  • the comparison results of classification accuracy, time, and system performance requirements between the solution provided by this application and the existing solution are shown in Table 2 and Table 3.
  • System performance is expressed in terms of the required number of floating-point operations per second (FLOPs) performed per second, for example.
  • Classification accuracy is expressed as Mean Average Precision (mAP) as an example.
  • Scenario 2 Take end-to-end image segmentation as an example. See Figure 12 for a schematic workflow diagram of an image segmentation network model.
  • the classification network model includes an embedding generation module, M2 transformer modules and a pixel reconstruction module.
  • M2 transformer modules can be deployed in series.
  • the transformer module adopts the structure shown in Figure 9.
  • the embedding generation module is used to extract local features from the input image data, which can also be understood as generating the encoded data of the image data.
  • Image data can be mapped to horizontal as well as vertical axes.
  • Image data can be divided into multiple images piece. For example, it is divided into H*W image blocks (patches).
  • the feature dimension of the local features extracted by the embedding generation module from the input image data is represented by E2.
  • the independent stacked attention network in the transformer module can adopt a multi-head attention mechanism.
  • the intensity value of pixels of each image block is restored.
  • the embedding generation module includes a convolution layer, which is used to perform convolution processing on the input image data (time spectrum) to generate an embedding representation, and output an image vector of (H ⁇ M, E2).
  • E2 represents the feature dimension.
  • the feature dimension of each element is E2.
  • position encoding (H ⁇ M, E2) can be added to help learn position information.
  • the method of position encoding is not specifically limited in the embodiment of this application.
  • the embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks.
  • the embedding vector is input to the transformer module.
  • the layer-normalized data is input to the independent stacked attention network.
  • the independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E-dimensional data is raised to 3*E2-dimensional data), and further generate Q, K, and V corresponding to each patch.
  • the independent stacked attention network further performs multi-head splitting, and the independent stacked attention network performs row attention weighting calculations on the horizontal axis and column attention weighting calculations on the vertical axis. .
  • the result of row attention weighting calculation and the result of column attention weighting calculation are weighted to obtain the feature vector of the image data.
  • the weights corresponding to the horizontal coordinate axis and the vertical coordinate axis are the same, both are 0.5.
  • the independent superimposed attention network can also perform dimensionality reduction processing on the feature vectors of the obtained image data, reducing the 3*E2-dimensional data to E-dimensional data.
  • the pixel reconstruction module After being processed by the LN layer and MLP layer in the Transformer module, it is input to the pixel reconstruction module. In the pixel reconstruction module, the pixel intensity value of each image block is restored through layer normalization and fully connected layer processing.
  • Scenario 3 Take video action classification as an example.
  • FIG 13 is a schematic workflow diagram of a video classification network model.
  • the classification network model includes an embedding generation module, M3 transformer modules and a classification module.
  • M3 transformer modules can be deployed in series.
  • the transformer module adopts the structure shown in Figure 9.
  • the embedding generation module is used to extract local features from the input video data, which can also be understood as generating the encoded data of the video data.
  • Video data can be mapped to the time axis, horizontal axis, and vertical axis.
  • Video data can be divided into image blocks. For example, it is divided into H*W*T image patches (patches), T represents the time coordinate axis dimension, H represents the horizontal coordinate axis dimension, and W represents the vertical coordinate axis dimension.
  • the embedding generation module includes a three-dimensional convolution layer, which is used to perform convolution processing on the input video data to generate an embedding representation, and output a video vector of (H*W*T, E3).
  • E3 represents the feature dimension.
  • the feature dimension of each patch is E3.
  • the purpose is to classify video actions, and the classification bits can be combined.
  • the data of the classification bits can be connected with the video vector of (H*W*T, E3) through the concat function.
  • position encoding can be added to help learn position information.
  • the method of position encoding is not specifically limited in the embodiment of this application. After adding classification bits and superimposing position coding, the embedding generation module outputs an embedding vector with dimensions (H*W*T+1, E3).
  • the embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks.
  • the embedding vector is input to the transformer module.
  • the layer-normalized data is input to the independent stacked attention network.
  • the independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E1-dimensional data is raised to 3*E1-dimensional data), and further generates Q, K, and V corresponding to each patch and classification bit. Taking the independent superposition attention network using the multi-head attention mechanism as an example, the independent superposition attention network further performs multi-head splitting.
  • the independent superposition attention network performs attention weighting on the classification bits and other patches respectively to obtain the feature vector of the classification bit, and executes The attention weighting calculation on the time axis, the row attention weighting calculation on the horizontal axis, and the column attention weighting calculation on the vertical axis. Calculate the weighted attention of the time axis, The feature vector obtained by weighting the results of the row attention weighting calculation and the column attention weighting calculation is connected to the feature vector of the classification bit. Furthermore, the independent superimposed attention network performs dimensionality reduction processing on the connected feature vectors, reducing the 3*E3-dimensional data to E3-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the classification module. In the classification module, the classification information corresponding to the classification bits is obtained through the linear layer, and then processed through the fully connected layer to obtain the action classification prediction distribution.
  • FIG. 14 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • the data processing device includes an input unit 1410 for receiving data to be processed, the data to be processed including encoded data of a plurality of elements.
  • the processing unit 1420 is configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
  • the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located.
  • the first element is any element among the plurality of elements; other elements in the first area where the first element is located are mapped to positions and coordinates of other coordinate axes other than the first coordinate axis.
  • the positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.
  • the processing unit 1420 is specifically configured to calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the The attention value between the first element and the other elements respectively; the first element is any element among the plurality of elements; execution is performed according to the attention value between the first element and the other elements respectively Weighting processing is performed to obtain the feature vector on the first coordinate axis corresponding to the first element.
  • the data to be processed includes audio data
  • the audio data includes multiple audio points
  • each audio point is mapped to a time coordinate axis and a frequency coordinate axis;
  • the data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
  • the data to be processed includes video data.
  • the video data includes multiple video frames.
  • Each video frame includes multiple pixel points or image blocks.
  • Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
  • the number of coordinate axes is equal to N; the neural network includes a linear module 1421, N attention calculation modules 1422 and a weighting module 1423.
  • the linear module 1421 is used to generate the first query vector, the first key value vector and the first value vector based on the data to be processed; the i-th attention calculation module 1422 is used to generate the first query vector, The first key value vector and the first value vector obtain a feature vector corresponding to each element of the plurality of elements on the i-th coordinate axis; i is a positive integer less than or equal to N; The weighting module 1423 is used to weight each of the plurality of elements corresponding to the feature vectors on the N coordinate axes.
  • the neural network also includes an N+1 attention calculation module 1424 and a feature fusion module 1425;
  • the linear module 1421 is further configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one setting element;
  • the N+1th attention calculation module 1424 is used to calculate the direction according to the second query vector and the second key value.
  • the second value vector obtains the feature vector corresponding to the at least one setting element; the corresponding feature vector of the setting element is used to represent the correlation between the setting element and the multiple elements;
  • the feature fusion module 1425 is configured to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
  • the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.
  • Figure 15 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • the execution device can be embodied as a mobile phone, tablet, notebook computer, smart phone, etc. Wearable devices, servers, etc. are not limited here.
  • the embodiment of the present application also provides another structure of the device.
  • the execution device 1500 may include a communication interface 1510 and a processor 1520.
  • the execution device 1500 may also include a memory 1530.
  • the memory 1530 may be provided inside the device or outside the device.
  • each of the units shown in FIG. 14 can be implemented by the processor 1520.
  • the function of the input unit is implemented by the communication interface 1510.
  • the functions of the processing unit 1420 are implemented by the processor 1520.
  • the processor 1520 receives the data to be processed through the communication interface 1510, and is used to implement the methods described in Figures 3, 6-13. During the implementation process, each step of the processing flow can complete the method described in FIG. 3 and FIG. 6 to FIG. 13 through the integrated logic circuit of hardware in the processor 1520 or instructions in the form of software.
  • the communication interface 1510 may be a circuit, bus, transceiver, or any other device that can be used for information exchange.
  • the other device may be a device connected to the execution device 1500.
  • the processor 1520 may be a general processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute The disclosed methods, steps and logical block diagrams in the embodiments of this application.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software units in the processor.
  • the program code executed by the processor 1520 to implement the above method may be stored in the memory 1530. Memory 1530 and processor 1520 are coupled.
  • the coupling in the embodiment of this application is an indirect coupling or communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, units or modules.
  • the processor 1520 may cooperate with the memory 1530.
  • the memory 1530 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), or a volatile memory (volatile memory), such as a random access memory (random access memory). -access memory, RAM).
  • Memory 1530 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • connection medium between the communication interface 1510, the processor 1520 and the memory 1530 is not limited in the embodiment of the present application.
  • the memory 1530, the processor 1520 and the communication interface 1510 are connected through a bus in Figure 15.
  • the bus is represented by a thick line in Figure 15.
  • the connection methods between other components are only schematically explained. It is not limited.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 15, but it does not mean that there is only one bus or one type of bus.
  • embodiments of the present application also provide a computer storage medium, which stores a software program.
  • the software program can implement any one or more of the above. Examples provide methods.
  • the computer storage media may include: U disk, mobile hard disk, read-only memory, random access Various media that can store program code, such as memory, magnetic disks, or optical disks.
  • embodiments of the present application also provide a chip, which includes a processor and is used to implement the functions involved in any one or more of the above embodiments, such as obtaining or processing the information involved in the above methods or information.
  • the chip further includes a memory, and the memory is used for necessary program instructions and data executed by the processor.
  • the chip may be composed of chips or may include chips and other discrete devices.
  • Figure 16 is a structural schematic diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 1600.
  • the NPU 1600 serves as a co-processor and is mounted to the main CPU (Host). CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1603.
  • the arithmetic circuit 1603 is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.
  • the computing circuit 1603 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1603 is a two-dimensional systolic array.
  • the arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1603 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1602 and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory 1601 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1608 .
  • the unified memory 1606 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1605, and the DMAC is transferred to the weight memory 1602.
  • Input data is also transferred to unified memory 1606 via DMAC.
  • DMAC Direct Memory Access Controller
  • BIU is the Bus Interface Unit, that is, the bus interface unit 1610, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1609.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1610 (Bus Interface Unit, BIU for short) is used to fetch the memory 1609 to obtain instructions from the external memory, and is also used for the storage unit access controller 1605 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 or the weight data to the weight memory 1602 or the input data to the input memory 1601 .
  • the vector calculation unit 1607 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 1603, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • vector calculation unit 1607 can store the processed output vectors to unified memory 1606 .
  • the vector calculation unit 1607 can apply a linear function; or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value.
  • vector calculation unit 1607 generates normalized values, pixel-wise summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1603, such as for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 1609 connected to the controller 1604 is used to store instructions used by the controller 1604; the unified memory 1606, the input memory 1601, the weight memory 1602 and the instruction fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, etc.) having computer-usable program code embodied therein.
  • a computer-usable storage media including, but not limited to, disk storage, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

A data processing method and apparatus, which relate to the technical field of artificial intelligence, and are used for reducing the calculation consumption. In the method, when attention calculation is executed, the attention calculation is executed among elements of the same coordinate axis, rather than on one element with respect to all the other elements, and weighting processing is executed after the attention calculation is performed on all coordinate axes, respectively. For example, for a certain element, elements in the same row, or elements in the same row and in adjacent rows participate in the calculation of a feature vector of the element. By means of performing attention calculation twice, not only can global modeling be realized, the calculation complexity can also be reduced.

Description

一种数据处理方法及装置A data processing method and device
相关申请的交叉引用Cross-references to related applications
本申请要求在2022年5月24日提交中华人民共和国知识产权局、申请号为202210569598.0、发明名称为“一种数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the Intellectual Property Office of the People's Republic of China on May 24, 2022, with application number 202210569598.0 and the invention title "A data processing method and device", the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本申请涉及人工智能技术领域,特别涉及一种数据处理方法及装置。This application relates to the field of artificial intelligence technology, and in particular to a data processing method and device.
背景技术Background technique
近年来,自注意力(self attention)网络已经在许多自然语言处理(natural language processing,NLP)任务中得到了很好的应用,例如机器翻译,情感分析和问题解答等。随着自注意力网络的广泛应用,源于自然语言处理领域的自注意力网络在图像分类、目标检测、和图像处理等任务上也取得了很高的性能。In recent years, self-attention networks have been well applied in many natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and question answering. With the widespread application of self-attention networks, self-attention networks originating from the field of natural language processing have also achieved high performance in tasks such as image classification, target detection, and image processing.
自注意力网络的关键是学习一种比对,其中序列中的每个元素都学会从序列中的其他元素中收集信息。自注意力网络不同于一般的注意力网络,它更加关注数据或特征的内部相关性,减少了对外部信息的依赖。但是目前采用的自注意力网络针对一个元素从其他所有的元素中获得相关信息,导致计算消耗较大。The key to self-attention networks is to learn an alignment where each element in the sequence learns to gather information from other elements in the sequence. The self-attention network is different from the general attention network in that it pays more attention to the internal correlation of data or features and reduces the dependence on external information. However, the currently used self-attention network obtains relevant information from all other elements for one element, resulting in high computational consumption.
发明内容Contents of the invention
本申请实施例提供一种数据处理方法及装置,用于解决采用目前自注意力网络时,计算一个元素与其它所有元素的相关信息,导致的计算消耗较大的问题。Embodiments of the present application provide a data processing method and device to solve the problem of high computational consumption caused by calculating the relevant information of one element and all other elements when using the current self-attention network.
第一方面,本申请实施例提供一种数据处理方法,包括:接收待处理数据,所述待处理数据包括多个元素的编码数据;通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量。In a first aspect, embodiments of the present application provide a data processing method, including: receiving data to be processed, where the data to be processed includes encoded data of multiple elements; performing feature extraction on the data to be processed through a neural network to obtain Each of the plurality of elements corresponds to a feature vector of a plurality of coordinate axes, and each of the plurality of elements corresponds to a feature vector of the plurality of coordinate axes, which is weighted to obtain the Describe the feature vector of the data to be processed.
示例性地,神经网络为自注意力网络。通过上述方案,在获取特征向量时,不再针对一个元素与其它所有元素的进行注意力计算,而是同一个坐标轴的元素执行注意力计算,并针对所有的坐标轴分别进行计算后,然后再执行加权处理,可以降低计算消耗。Illustratively, the neural network is a self-attention network. Through the above solution, when obtaining the feature vector, the attention calculation is no longer performed on one element and all other elements, but the attention calculation is performed on the elements of the same coordinate axis, and the calculation is performed on all coordinate axes separately, and then Performing weighting processing again can reduce computational consumption.
示例性地,本申请可以应用于计算机视觉或者自然语言处理。比入机器翻译、自动摘要生成、观点提取、文本分类、问题回答、文本语义对比、语音识别、图像识别(Image Classification)、目标检测(Object Detection)、语义分割(Semantic Segmentation)以及图像生成(Image Generation)。其中,神经网络可以是用于对图像进行分类的神经网络,也可以是用于对图像进行分割的神经网络,或者可以是用于对图像进行检测的神经网络,或者可以是用于对图像进行识别的神经网络,或者可以是用于生成指定图像的神经网络,或者,可以是用于对文本进行翻译的神经网络,或者,可以是用于对文本进行复述的神经网络,或者 可以是用于生成指定文本的神经网络,或者可以是用于对语音进行识别的神经网络,或者可以是用于对语音进行翻译的神经网络,或者可以是用于生成指定语音的神经网络等。For example, this application can be applied to computer vision or natural language processing. Including machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, speech recognition, image recognition (Image Classification), object detection (Object Detection), semantic segmentation (Semantic Segmentation) and image generation (Image Generation). Wherein, the neural network can be a neural network used to classify images, or it can be a neural network used to segment images, or it can be a neural network used to detect images, or it can be used to perform image processing. The neural network for recognition can either be a neural network used to generate a specified image, or it can be a neural network used to translate text, or it can be a neural network used to paraphrase text, or It may be a neural network used to generate specified text, or it may be a neural network used to recognize speech, or it may be a neural network used to translate speech, or it may be a neural network used to generate specified speech, etc.
待处理数据,可以是音频数据、视频数据或者图像数据、文本数据等等。The data to be processed can be audio data, video data, image data, text data, etc.
在一种可能的设计中,接收待处理数据,包括接收来自用户设备的服务请求,服务请求携带待处理数据。服务请求用于请求针对所述待处理数据完成指定处理任务。In one possible design, receiving the data to be processed includes receiving a service request from the user equipment, and the service request carries the data to be processed. A service request is used to request completion of a specified processing task for the data to be processed.
所述方法还包括:根据所述待处理数据的特征向量完成所述指定处理任务得到处理结果,向所述用户设备发送所述处理结果。The method further includes: completing the specified processing task according to the feature vector of the data to be processed to obtain a processing result, and sending the processing result to the user equipment.
比如,指定处理任务为图像分类,该注意网络是用于对图像进行分类的注意力网络。在获得特征向量后,可以进一步根据特征向量进行图像分类得到分类结果。再比如,指定处理任务为图像分割,该注意网络是用于对图像进行分割的注意力网络。在获得特征向量后,可以进一步根据特征向量进行图像分割得到分割结果。再比如,指定处理任务为图像检测,该注意网络是用于对图像进行检测的注意力网络。在获得特征向量后,可以进一步根据特征向量进行图像检测得到分割结果。再比如,指定处理任务为语音识别,该注意网络是用于对语音进行识别的注意力网络。在获得特征向量后,可以进一步根据特征向量进行语音识别得到识别结果。再比如,指定处理任务为语音翻译,该注意网络是用于对语音进行翻译的注意力网络。在获得特征向量后,可以进一步根据特征向量进行语音翻译得到翻译结果。For example, if the processing task is designated as image classification, the attention network is an attention network used to classify images. After obtaining the feature vector, the image can be further classified according to the feature vector to obtain the classification result. For another example, if the processing task is designated as image segmentation, the attention network is an attention network used to segment images. After obtaining the feature vector, the image can be further segmented based on the feature vector to obtain the segmentation result. For another example, if the processing task is designated as image detection, the attention network is an attention network used to detect images. After obtaining the feature vector, image detection can be further performed based on the feature vector to obtain the segmentation result. For another example, if the designated processing task is speech recognition, the attention network is an attention network used to recognize speech. After obtaining the feature vector, speech recognition can be further performed based on the feature vector to obtain the recognition result. For another example, if the processing task is designated as speech translation, the attention network is an attention network used to translate speech. After obtaining the feature vector, speech translation can be further performed based on the feature vector to obtain the translation result.
一种示例中,元素分别对应到不同坐标轴的特征向量之间正交。In one example, the feature vectors of elements corresponding to different coordinate axes are orthogonal to each other.
在一种可能的设计中,第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其它元素之间的相关度;所述第一元素为所述多个元素中的任一元素;所述第一元素所在的第一区域内的其它元素映射到除所述第一坐标轴以外的其它坐标轴的位置与所述第一元素映射到所述其它坐标轴的位置相同和/或相邻;所述第一坐标轴为所述多个坐标轴中的任一坐标轴。In a possible design, the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located; The first element is any element among the plurality of elements; the positions of other elements in the first area where the first element is located are mapped to coordinate axes other than the first coordinate axis and the The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.
比如,第一坐标轴为水平方向坐标轴,水平方向坐标的元素的特征向量,针对该元素同一行或者同一行以及相邻行的元素参与该元素的特征向量的计算。由于所有的元素都能参与计算,从而可以实现全局建模,并且可以降低计算复杂度。For example, the first coordinate axis is the horizontal coordinate axis, and the feature vector of the element in the horizontal direction coordinates. For the element in the same row or the same row and adjacent rows, the elements participate in the calculation of the feature vector of the element. Since all elements can participate in calculations, global modeling can be achieved and computational complexity can be reduced.
在一种可能的设计中,获得所述多个元素中每个元素分别对应到多个坐标轴上的特征向量,包括:将所述第一元素分别与所述第一元素对应的第一区域内的其它元素之间进行注意力计算得到所述第一元素分别与所述其它元素之间注意力值;所述第一元素为所述多个元素中的任一元素;根据所述第一元素分别与所述其它元素之间注意力值执行加权处理得到所述第一元素对应第一坐标轴上的特征向量。In a possible design, obtaining the feature vectors on multiple coordinate axes that each element among the plurality of elements respectively corresponds to includes: assigning the first element to the first region corresponding to the first element. The attention values between the first element and the other elements are obtained by performing attention calculations between other elements; the first element is any element among the plurality of elements; according to the first element Weighting processing is performed on the attention values between elements respectively and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
两个元素映射到其它坐标轴的位置相邻可以是绝对相邻,也可以是两个元素映射到其它坐标轴的位置之间的间隔位于设定距离内。The adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent, or the interval between the positions mapped to other coordinate axes of the two elements can be within a set distance.
在一种可能的设计中,所述待处理数据包括音频数据,所述音频数据包括多个音频点,每个音频点映射到时间坐标轴和频率坐标轴;或者,In a possible design, the data to be processed includes audio data, the audio data includes multiple audio points, and each audio point is mapped to a time coordinate axis and a frequency coordinate axis; or,
所述待处理数据包括图像数据,所述图像数据包括多个像素点或者图像块,每个像素点或者图像块映射到水平坐标轴和垂直坐标轴;或者,The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
所述待处理数据包括视频数据,所述视频数据包括多个视频帧,每个视频帧包括多个像素点或者图像块,每个像素点或者图像块映射到时间坐标轴、在空间上的水平坐标轴和垂直坐标轴。 The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
针对多媒体数据,不同的轴的元素的维度不同,可能相差较大。采用上述方案,并不会过多的关注维度较多的轴的元素,而是针对不同的轴均单独进行注意力计算,防止维度更大的轴对较小的轴产生抑制。For multimedia data, the dimensions of elements on different axes are different, which may vary greatly. Using the above solution, we will not pay too much attention to the elements of axes with more dimensions, but perform attention calculations separately for different axes to prevent axes with larger dimensions from inhibiting smaller axes.
在一种可能的设计中,所述通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴上的特征向量:所述多个坐标轴包括第一坐标轴和第二坐标轴;通过所述神经网络基于所述待处理数据生成第一查询向量、第一键值向量和第一价值向量;根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第一坐标轴上的特征向量;根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第二坐标轴上的特征向量。In one possible design, feature extraction is performed on the data to be processed through a neural network to obtain feature vectors corresponding to each of the multiple elements on multiple coordinate axes: the multiple The coordinate axes include a first coordinate axis and a second coordinate axis; a first query vector, a first key value vector and a first value vector are generated based on the data to be processed through the neural network; according to the first query vector, the According to the first key value vector and the first value vector, each element of the plurality of elements is corresponding to a feature vector on the first coordinate axis; according to the first query vector, the first The key value vector and the first value vector are used to obtain a feature vector corresponding to each element of the plurality of elements on the second coordinate axis.
针对不同轴在计算特征向量时,采用相同的查询向量、键值向量和价值向量,可以降低参数的计算数量,从而进一步降低计算复杂度。When calculating feature vectors for different axes, using the same query vector, key value vector, and value vector can reduce the number of parameter calculations, thereby further reducing the computational complexity.
在一种可能的设计中,N=2,根据所述待处理数据生成第一查询向量、第一键值向量和第一价值向量可以采用如下公式:
q(i,j)=WQh(i,j),k(i,j)=WKh(i,j),v(i,j)=WVh(i,j)
In a possible design, N=2, the following formula can be used to generate the first query vector, the first key value vector and the first value vector according to the data to be processed:
q (i,j) =W Q h (i,j) , k (i,j) =W K h (i,j) , v (i,j) =W V h (i,j)
其中,q(i,j)表示位置(i,j)元素的Query,k(i,j)表示位置(i,j)元素的Key,v(i,j)表示位置(i,j)元素的Value。i取值范围为0~m-1,j取值范围为0~n-1。Among them, q (i, j) represents the Query of the element at position (i, j), k (i, j) represents the Key of the element at position (i, j), and v (i, j) represents the element at position (i, j). Value. The value range of i is 0~m-1, and the value range of j is 0~n-1.
2维的待处理数据包括m行n列。The 2-dimensional data to be processed includes m rows and n columns.
在一种可能的设计中,每个元素对应到第一坐标轴上的特征向量可以采用如下公式确定。

In a possible design, the feature vector corresponding to each element on the first coordinate axis can be determined using the following formula.

其中,dk为输入的数据的维度数。表示位置(i,j)元素对应到轴1的特征向量;Among them, d k is the number of dimensions of the input data. Represents the feature vector of the element at position (i, j) corresponding to axis 1;
在一种可能的设计中,每个元素对应到第一坐标轴上的特征向量可以采用如下公式确定:

In a possible design, the feature vector corresponding to each element on the first coordinate axis can be determined using the following formula:

第一坐标轴的元素对应的特征向量与第二坐标轴的元素对应的特征向量进行加权,采用如下公式来确定:
The eigenvector corresponding to the element of the first coordinate axis is weighted with the eigenvector corresponding to the element of the second coordinate axis, and is determined using the following formula:
其中,h′(i,j)表示位置(i,j)元素的特征向量。w1表示轴1的权重,w2表示轴2的权重。Among them, h′ (i, j) represents the feature vector of the element at position (i, j). w 1 represents the weight of axis 1, and w 2 represents the weight of axis 2.
在一种可能的设计中,w1=w2=1/2。In one possible design, w 1 =w 2 =1/2.
在一种可能的设计中,所述方法还包括:基于至少一个设定元素的编码数据生成第二查询向量、第二键值向量和第二价值向量;根据所述第二查询向量、所述第二键值向量、所述第二价值向量获取所述至少一个设定元素对应的特征向量;所述设定元素的对应特征向量用于表征所述设定元素与所述多个元素之间的关联度;将所述待处理数据的特征向量以及所述至少一个设定元素的特征向量进行特征融合。In a possible design, the method further includes: generating a second query vector, a second key value vector and a second value vector based on the encoded data of at least one setting element; according to the second query vector, the The second key value vector and the second value vector obtain the feature vector corresponding to the at least one setting element; the corresponding feature vector of the setting element is used to characterize the relationship between the setting element and the multiple elements. degree of correlation; perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
至少一个设定元素的编码数据可以包括分类位和/或蒸馏位,从而针对分类场景也同样 适用。The encoding data of at least one setting element may include classification bits and/or distillation bits, thereby also for classification scenarios Be applicable.
在一种可能的设计中,所述至少一个设定元素的编码数据是在训练所述神经网络的过程中作为所述神经网络的网络参数经过多轮调整得到的。In a possible design, the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.
第二方面,本申请实施例提供一种数据处理装置,包括:In a second aspect, embodiments of the present application provide a data processing device, including:
输入单元,用于接收待处理数据,所述待处理数据包括多个元素的编码数据;An input unit configured to receive data to be processed, where the data to be processed includes encoded data of multiple elements;
处理单元,用于通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量。A processing unit configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors corresponding to multiple coordinate axes for each of the multiple elements, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
在一种可能的设计中,第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其它元素之间的相关度;所述第一元素为所述多个元素中的任一元素;所述第一元素所在的第一区域内的其它元素映射到除所述第一坐标轴以外的其它坐标轴的位置与所述第一元素映射到所述其它坐标轴的位置相同和/或相邻;所述第一坐标轴为所述多个坐标轴中的任一坐标轴。In a possible design, the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located; The first element is any element among the plurality of elements; the positions of other elements in the first area where the first element is located are mapped to coordinate axes other than the first coordinate axis and the The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.
在一种可能的设计中,所述处理单元,具体用于:将所述第一元素分别与所述第一元素对应的第一区域内的其它元素之间进行注意力计算得到所述第一元素分别与所述其它元素之间注意力值;所述第一元素为所述多个元素中的任一元素;根据所述第一元素分别与所述其它元素之间注意力值执行加权处理得到所述第一元素对应第一坐标轴上的特征向量。In a possible design, the processing unit is specifically configured to calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the first element. The attention value between the element and the other elements respectively; the first element is any element among the plurality of elements; weighting processing is performed according to the attention value between the first element and the other elements respectively Obtain the feature vector corresponding to the first coordinate axis of the first element.
在一种可能的设计中,所述待处理数据包括音频数据,所述音频数据包括多个音频点,每个音频点映射到时间坐标轴和频率坐标轴;或者,In a possible design, the data to be processed includes audio data, the audio data includes multiple audio points, and each audio point is mapped to a time coordinate axis and a frequency coordinate axis; or,
所述待处理数据包括图像数据,所述图像数据包括多个像素点或者图像块,每个像素点或者图像块映射到水平坐标轴和垂直坐标轴;或者,The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
所述待处理数据包括视频数据,所述视频数据包括多个视频帧,每个视频帧包括多个像素点或者图像块,每个像素点或者图像块映射到时间坐标轴、在空间上的水平坐标轴和垂直坐标轴。The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
在一种可能的设计中,坐标轴的数量等于N;所述线性模块,用于基于所述待处理数据生成第一查询向量、第一键值向量和第一价值向量;第i个注意力计算模块,用于根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第i个坐标轴上的特征向量;i为小于或者等于N的正整数;加权模块,用于对所述多个元素中每个元素分别对应到N个坐标轴上的特征向量进行加权。In a possible design, the number of coordinate axes is equal to N; the linear module is used to generate the first query vector, the first key value vector and the first value vector based on the data to be processed; the i-th attention A calculation module configured to obtain, according to the first query vector, the first key value vector, and the first value vector, each of the plurality of elements corresponding to the i-th coordinate axis. Feature vector; i is a positive integer less than or equal to N; a weighting module is used to weight the feature vectors on N coordinate axes corresponding to each element in the plurality of elements.
在一种可能的设计中,所述神经网络还包括第N+1个注意力计算模块和特征融合模块;In a possible design, the neural network also includes an N+1 attention calculation module and a feature fusion module;
所述线性模块,用于还用于基于至少一个设定元素的编码数据生成第二查询向量、第二键值向量和第二价值向量;The linear module is configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one set element;
所述第N+1个注意力计算模块,用于根据所述第二查询向量、所述第二键值向量、所述第二价值向量获取所述至少一个设定元素对应的特征向量;所述设定元素的对应特征向量用于表征所述设定元素与所述多个元素之间的关联度;The N+1th attention calculation module is used to obtain the feature vector corresponding to the at least one setting element according to the second query vector, the second key value vector, and the second value vector; The corresponding feature vector of the setting element is used to represent the degree of association between the setting element and the multiple elements;
所述特征融合模块,用于将所述待处理数据的特征向量以及所述至少一个设定元素的特征向量进行特征融合。The feature fusion module is used to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
在一种可能的设计中,所述至少一个设定元素的编码数据是在训练所述神经网络的过程中作为所述神经网络的网络参数经过多轮调整得到的。 In a possible design, the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.
第三方面,本申请提供一种数据处理系统,包括用户设备和云端服务设备;In the third aspect, this application provides a data processing system, including user equipment and cloud service equipment;
所述用户设备,用于向云端服务设备发送服务请求,所述服务请求携带待处理数据,所述待处理数据包括多个元素的编码数据;所述服务请求用于请求云端服务器针对所述待处理数据完成指定处理任务;The user equipment is used to send a service request to the cloud service device, the service request carries data to be processed, and the data to be processed includes encoded data of multiple elements; the service request is used to request the cloud server to perform the processing for the data to be processed. Process data to complete designated processing tasks;
所述云端服务设备,用于通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量;根据所述待处理数据的特征向量完成所述指定处理任务得到处理结果,向所述用户设备发送所述处理结果;The cloud service device is used to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to perform feature extraction on the multiple elements. Each element in is corresponding to the feature vectors of the multiple coordinate axes and is weighted to obtain the feature vector of the data to be processed; completing the specified processing task according to the feature vector of the data to be processed to obtain the processing result, Send the processing result to the user equipment;
所述用户设备,还用于接收来自所述云端服务设备的所述处理结果。The user equipment is also configured to receive the processing result from the cloud service equipment.
第四方面,本申请实施例提供一种电子设备,所述电子设备包括至少一个处理器和存储器;所述存储器中存储有指令;所述至少一个处理器,用于执行所述存储器存储的所述指令,以实现第一方面或者第一方面的任意设计所述的方法。电子设备也可以称为执行设备,用于执行本申请提供的数据处理方法。In a fourth aspect, embodiments of the present application provide an electronic device. The electronic device includes at least one processor and a memory; instructions are stored in the memory; and the at least one processor is used to execute all instructions stored in the memory. The above instructions are provided to implement the method described in the first aspect or any design of the first aspect. The electronic device may also be called an execution device and is used to execute the data processing method provided by this application.
第五方面,本申请实施例提供一种芯片系统,所述芯片系统包括至少一个处理器和通信接口,所述通信接口和所述至少一个处理器通过线路互联;所述通信接口,用于接收待处理数据;所述处理器,用于针对所述待处理数据执行第一方面或者第一方面的任意设计所述的方法。In a fifth aspect, embodiments of the present application provide a chip system. The chip system includes at least one processor and a communication interface. The communication interface and the at least one processor are interconnected through lines; the communication interface is used to receive Data to be processed; the processor, configured to execute the method of the first aspect or any design of the first aspect on the data to be processed.
第六方面,本申请实施例提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可选的实现中的方法的指令。In a sixth aspect, embodiments of the present application provide a computer-readable medium for storing a computer program, where the computer program includes instructions for executing the method in the first aspect or any optional implementation of the first aspect.
第七方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品存储有指令,所述指令在由计算机执行时使得所述计算机实施第一方面或第一方面的任意可选的设计中所述的方法。In a seventh aspect, embodiments of the present application provide a computer program product. The computer program product stores instructions. When executed by a computer, the instructions cause the computer to implement the first aspect or any optional aspect of the first aspect. The method described in the design.
本申请在上述各方面提供的实现的基础上,还可以进行进一步组合以提供更多实现。On the basis of the implementations provided by the above aspects, this application can also be further combined to provide more implementations.
附图说明Description of the drawings
图1为人工智能主体框架的一种结构示意图;Figure 1 is a structural schematic diagram of the main framework of artificial intelligence;
图2为本申请实施例提供的一种系统架构200示意图;Figure 2 is a schematic diagram of a system architecture 200 provided by an embodiment of the present application;
图3为本申请实施例提供的一种数据处理方法流程示意图;Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application;
图4为本申请实施例提供的一种轴向注意力计算示意图;Figure 4 is a schematic diagram of an axial attention calculation provided by an embodiment of the present application;
图5为本申请实施例提供的另一种轴向注意力计算示意图;Figure 5 is a schematic diagram of another axial attention calculation provided by an embodiment of the present application;
图6为本申请实施例提供的一种独立叠加注意力网络的处理流程示意图;Figure 6 is a schematic diagram of the processing flow of an independent superimposed attention network provided by an embodiment of the present application;
图7为本申请实施例提供的另一种独立叠加注意力网络的处理流程示意图;Figure 7 is a schematic diagram of the processing flow of another independent superimposed attention network provided by the embodiment of the present application;
图8为本申请实施例所示例的一种transformer模块结构示意图;Figure 8 is a schematic structural diagram of a transformer module illustrated in the embodiment of this application;
图9为本申请实施例所示例的另一种transformer模块结构示意图;Figure 9 is a schematic structural diagram of another transformer module illustrated in the embodiment of the present application;
图10为本申请实施例提供的一种分类网络模型的结构示意图;Figure 10 is a schematic structural diagram of a classification network model provided by an embodiment of the present application;
图11为本申请实施例提供的分类网络模型的工作流程示意图;Figure 11 is a schematic diagram of the workflow of the classification network model provided by the embodiment of the present application;
图12为本申请实施例提供的一种图像分割网络模型的工作流程示意图;Figure 12 is a schematic workflow diagram of an image segmentation network model provided by an embodiment of the present application;
图13为本申请实施例提供的一种视频分类网络模型的工作流程示意图;Figure 13 is a schematic workflow diagram of a video classification network model provided by an embodiment of the present application;
图14为本申请实施例提供的一种数据处理装置的结构示意图; Figure 14 is a schematic structural diagram of a data processing device provided by an embodiment of the present application;
图15为本申请实施例提供的执行设备的一种结构示意图;Figure 15 is a schematic structural diagram of an execution device provided by an embodiment of the present application;
图16为本申请实施例提供的芯片的一种结构示意图。Figure 16 is a schematic structural diagram of a chip provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments of the present application are described below with reference to the accompanying drawings. Persons of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a structural schematic diagram of the artificial intelligence main framework. The following is from the "intelligent information chain" (horizontal axis) and "IT value chain" ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
(1)基础设施:(1)Infrastructure:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms. Communicate with the outside through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.); the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc. For example, sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
(2)数据(2)Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3)Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
(4)通用能力(4) General ability
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。 After the data is processed as mentioned above, some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
(5)智能产品及行业应用(5) Intelligent products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,智慧城市,智能终端等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, smart city, smart terminal, etc.
本申请实施例涉及神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。The embodiments of the present application relate to the application of neural networks. To facilitate understanding, the relevant terms involved in the embodiments of the present application and related concepts such as neural networks are first introduced below.
1)神经网络1) Neural network
神经网络中的每一层的工作可以用数学表达式来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。The work of each layer in a neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). ), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend". Among them, the operations of 1, 2 and 3 are performed by Completed, the operation of 4 is completed by +b, and the operation of 5 is implemented by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this type of thing. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。The purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vector W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
2)损失函数2) Loss function
在训练神经网络的过程中,因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training the neural network, because we hope that the output of the neural network is as close as possible to the value we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two Update the weight vector of each layer of the neural network according to the situation (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network). For example, if the predicted value of the network is high, Just adjust the weight vector to make it predict lower, and continue to adjust until the neural network can predict the target value you really want. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the neural network becomes a process of reducing this loss as much as possible.
3)反向传播算法3) Back propagation algorithm
神经网络采用反向传播算法在训练过程中修正神经网络中的网络参数的大小,使得神经网络模型重建误差损失越来越小。前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。The neural network uses the back propagation algorithm to correct the size of the network parameters in the neural network during the training process, making the neural network model reconstruction error loss smaller and smaller. Forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
4)线性操作:4) Linear operation:
线性是指量与量之间按比例、成直线的关系,在数学上可以理解为一阶导数为常数的函数,线性操作可以但不限于为加和操作、空操作、恒等操作、卷积操作、层归一化(layer normalization,LN)操作以及池化操作。线性操作也可以称之为线性映射,线性映射需要满足两个条件:齐次性和可加性,任一个条件不满足则为非线性。Linear refers to the proportional and straight-line relationship between quantities. Mathematically, it can be understood as a function whose first-order derivative is a constant. Linear operations can be, but are not limited to, addition operations, empty operations, identity operations, and convolutions. operations, layer normalization (LN) operations and pooling operations. Linear operations can also be called linear mapping. Linear mapping needs to meet two conditions: homogeneity and additivity. If any one of the conditions is not met, it is nonlinear.
其中,齐次性是指f(ax)=af(x);可加性是指f(x+y)=f(x)+f(y);例如,f(x)=ax就是线 性的。需要注意的是,这里的x、a、f(x)并不一定是标量,可以是向量或者矩阵,形成任意维度的线性空间。如果x、f(x)为n维向量,当a为常数时,就是等价满足齐次性,当a为矩阵时,则等价满足可加性。相对而言,函数图形为直线的不一定符合线性映射,比如f(x)=ax+b,既不满足齐次性也不满足可加性,因此属于非线性映射。Among them, homogeneity means f(ax)=af(x); additivity means f(x+y)=f(x)+f(y); for example, f(x)=ax is the line sexual. It should be noted that x, a, and f(x) here are not necessarily scalars, but can be vectors or matrices, forming a linear space of any dimension. If x and f(x) are n-dimensional vectors, when a is a constant, they are equivalent to homogeneity, and when a is a matrix, they are equivalent to additivity. Relatively speaking, a function graph that is a straight line may not necessarily conform to a linear mapping, such as f(x)=ax+b, which neither satisfies homogeneity nor additivity, so it is a nonlinear mapping.
本申请实施例中,多个线性操作的复合可以称之为线性操作,线性操作中包括的各个线性操作也可以称之为子线性操作。In the embodiment of the present application, the combination of multiple linear operations can be called a linear operation, and each linear operation included in the linear operation can also be called a sub-linear operation.
5)注意力模型。5) Attention model.
注意力模型是一种应用了注意力机制的神经网络。在深度学习中,注意力机制可以被广义地定义为一个描述重要性的权重向量:通过这个权重向量为了预测或者推断一个元素。比如,对于图像中的某个像素或句子中的某个单词,可以使用注意力向量定量地估计出目标元素与其他元素之间的相关性,并由注意力向量的加权和作为目标的近似值。The attention model is a neural network that applies the attention mechanism. In deep learning, the attention mechanism can be broadly defined as a weight vector describing the importance: using this weight vector to predict or infer an element. For example, for a certain pixel in an image or a word in a sentence, attention vectors can be used to quantitatively estimate the correlation between the target element and other elements, and the weighted sum of the attention vectors can be used as an approximation of the target.
深度学习中的注意力机制模拟的是人脑的注意力机制。举个例子来说,当人类观赏一幅画时,虽然人类的眼睛可以看到整幅画的全貌,但是在人类深入仔细地观察时,其实眼睛聚焦的只有整幅画中的一部分图案,这个时候人类的大脑主要关注在这一小块图案上。也就是说,在人类仔细观察图像时,人脑对整幅图像的关注并不是均衡的,是有一定的权重区分的,这就是注意力机制的核心思想。The attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when humans look at a painting, although their eyes can see the entire painting, when humans observe deeply and carefully, their eyes actually focus on only part of the pattern in the entire painting. This At this time, the human brain mainly focuses on this small pattern. In other words, when humans carefully observe an image, the human brain's attention to the entire image is not balanced, but has a certain weight distinction. This is the core idea of the attention mechanism.
简单来说,人类的视觉处理系统往往会选择性地聚焦于图像的某些部分上,而忽略其它不相关的信息,从而有助于人脑的感知。类似地,在深度学习的注意力机制中,在涉及语言、语音或视觉的一些问题中,输入的某些部分相比其它部分可能更相关。因此,通过注意力模型中的注意力机制,能够让注意力模型仅动态地关注有助于有效执行手头任务的部分输入。Simply put, the human visual processing system tends to selectively focus on certain parts of the image and ignore other irrelevant information, thus contributing to the human brain's perception. Similarly, in the attention mechanism of deep learning, in some problems involving language, speech, or vision, some parts of the input may be more relevant than other parts. Therefore, through the attention mechanism in the attention model, the attention model can be enabled to dynamically focus only on the part of the input that is helpful to effectively perform the task at hand.
6)自注意力网络。6) Self-attention network.
自注意力网络是一种应用了自注意力机制的神经网络。自注意力机制是注意力机制的一种延伸。自注意力机制实际上是一种将单个序列的不同位置关联起来以计算同一序列的表示的注意机制。自注意力机制在机器阅读、抽象摘要或图像描述生成中能够起到关键的作用。以自注意力网络应用于自然语言处理为例,自注意力网络处理任意长度的输入数据并生成输入数据的新的特征表达,然后再将特征表达转换为目的词。自注意力网络中的自注意力网络层利用注意力机制获取所有其他单词之间的关系,从而生成每个单词的新的特征表达。自注意力网络的优点是注意力机制能够在不考虑单词位置的情况下,直接捕捉句子中所有单词之间的关系。The self-attention network is a neural network that applies the self-attention mechanism. The self-attention mechanism is an extension of the attention mechanism. The self-attention mechanism is actually an attention mechanism that associates different positions of a single sequence to calculate a representation of the same sequence. Self-attention mechanism can play a key role in machine reading, abstract summary or image description generation. Taking the application of self-attention network to natural language processing as an example, the self-attention network processes input data of any length and generates new feature expressions of the input data, and then converts the feature expressions into target words. The self-attention network layer in the self-attention network uses the attention mechanism to obtain the relationships between all other words, thereby generating a new feature expression for each word. The advantage of the self-attention network is that the attention mechanism can directly capture the relationship between all words in the sentence without considering the word position.
本申请实施例提供的数据处理方法可以由执行设备执行,或者注意力模型可以部署于执行设备中。执行设备可以由一个或者多个计算设备实现。示例性地,参见图2所示,为本申请实施例提供的一种系统架构200。系统架构200中包括执行设备210。执行设备210可以由一个或者多个计算设备实现。执行设备210可以布置在一个物理站点上,或者分布在多个物理站点上。系统架构200还包括数据存储系统250。可选地,执行设备210与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备。执行设备210可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请提供的数据处理方法。一个或者多个计算设备可以部署于云网络中。一种示例中,本申请实施例提供的数据处理方法以服务的形式部署云网络的一个或者多个计算设备中,用户设备通过网络访问云服务。执行设备为云网络的一个或者多个计算设备时,执行设备也可以称为云 端服务设备。The data processing method provided by the embodiment of the present application can be executed by an execution device, or the attention model can be deployed in the execution device. An execution device may be implemented by one or more computing devices. For example, see Figure 2, which shows a system architecture 200 provided by an embodiment of the present application. Included in the system architecture 200 is an execution device 210 . Execution device 210 may be implemented by one or more computing devices. Execution device 210 may be arranged on one physical site, or distributed across multiple physical sites. System architecture 200 also includes data storage system 250 . Optionally, the execution device 210 cooperates with other computing devices, such as data storage, routers, load balancers and other devices. The execution device 210 can use the data in the data storage system 250, or call the program code in the data storage system 250 to implement the data processing method provided by this application. One or more computing devices can be deployed in a cloud network. In one example, the data processing method provided by the embodiment of the present application is deployed in one or more computing devices of the cloud network in the form of a service, and the user device accesses the cloud service through the network. When the execution device is one or more computing devices of the cloud network, the execution device may also be called a cloud end service equipment.
另一种示例中,本申请实施例提供的数据处理方法可以以软件工具形式部署于在本地的一个或者多个计算设备上。In another example, the data processing method provided by the embodiment of the present application can be deployed on one or more local computing devices in the form of a software tool.
用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备210进行交互。每个本地设备可以表示任何计算设备,例如智能手机(mobile phone)、个人电脑(personal computer,PC)、笔记本电脑、平板电脑、智慧电视、移动互联网设备(mobile internet device,MID)、可穿戴设备,智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线电子设备、无人驾驶(self driving)中的无线电子设备、远程手术(remote medical surgery)中的无线电子设备、智能电网(smart grid)中的无线电子设备、运输安全(transportation safety)中的无线电子设备、智慧城市(smart city)中的无线电子设备、智慧家庭(smart home)中的无线电子设备。The user may operate respective user devices (eg, local device 301 and local device 302) to interact with execution device 210. Each local device can represent any computing device, such as a smartphone (mobile phone), personal computer (PC), laptop, tablet, smart TV, mobile internet device (MID), wearable device , smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, virtual reality (VR) devices, augmented reality (AR) devices, industrial control (industrial control) ), wireless electronic devices in self-driving, wireless electronic devices in remote medical surgery, wireless electronic devices in smart grid, transportation safety ), wireless electronic devices in smart cities, and wireless electronic devices in smart homes.
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。Each user's local device can interact with the execution device 210 through a communication network of any communication mechanism/communication standard. The communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
在另一种实现中,执行设备210的一个方面或多个方面可以由每个本地设备实现,例如,本地设备301可以为执行设备210提供本地数据或反馈预测结果。In another implementation, one or more aspects of the execution device 210 may be implemented by each local device. For example, the local device 301 may provide local data or feedback prediction results to the execution device 210 .
需要注意的,执行设备210的所有功能也可以由本地设备实现。例如,本地设备301实现执行设备210的功能并为自己的用户提供服务,或者为本地设备302的用户提供服务。本地设备301可以是电子设备,示例性地,该电子设备例如可以是服务器、智能手机(mobile phone)、个人电脑(personal computer,PC)、笔记本电脑、平板电脑、智慧电视、移动互联网设备(mobile internet device,MID)、可穿戴设备,虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线电子设备、无人驾驶(self driving)中的无线电子设备、远程手术(remote medical surgery)中的无线电子设备、智能电网(smart grid)中的无线电子设备、运输安全(transportation safety)中的无线电子设备、智慧城市(smart city)中的无线电子设备、智慧家庭(smart home)中的无线电子设备等。It should be noted that all functions of the execution device 210 can also be implemented by local devices. For example, the local device 301 implements the functions of the execution device 210 and provides services for its own users, or provides services for users of the local device 302 . The local device 301 may be an electronic device. For example, the electronic device may be a server, a smartphone (mobile phone), a personal computer (PC), a laptop, a tablet, a smart TV, a mobile Internet device (mobile internet device (MID), wearable devices, virtual reality (VR) devices, augmented reality (AR) devices, wireless electronic devices in industrial control (industrial control), self-driving (self-driving) wireless electronic devices, wireless electronic devices in remote medical surgery, wireless electronic devices in smart grid, wireless electronic devices in transportation safety, and wireless electronic devices in smart city Wireless electronic devices, wireless electronic devices in smart homes, etc.
本申请实施例所提供的数据处理方法和注意力模型可以应用于计算机视觉或自然语言处理。即,电子设备或者计算设备通过上述的数据处理方法,能够执行计算机视觉任务或者自然语言处理任务。The data processing methods and attention models provided by the embodiments of this application can be applied to computer vision or natural language processing. That is, the electronic device or computing device can perform computer vision tasks or natural language processing tasks through the above-mentioned data processing method.
其中,自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。自然语言处理研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。一般地,自然语言处理任务主要包括机器翻译、自动摘要生成、观点提取、文本分类、问题回答、文本语义对比以及语音识别等任务。Among them, natural language processing is an important direction in the field of computer science and artificial intelligence. Natural language processing research can realize various theories and methods for effective communication between humans and computers using natural language. Generally, natural language processing tasks mainly include tasks such as machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, and speech recognition.
计算机视觉是一门研究如何使机器学会看的科学。更进一步地说,计算机视觉就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使得处理得到的图像成为更适合人眼观察或传送给仪器检测的图像。通常来说,计算机视觉任务包括图像识别(Image Classification)、目标检测(Object Detection)、语义分割(Semantic Segmentation)以及图像生成(Image Generation)等任务。Computer vision is the science that studies how to make machines learn to see. Furthermore, computer vision refers to using cameras and computers instead of human eyes to identify, track, and measure targets, and further performs graphics processing to make the processed images more suitable for human eyes to observe or transmit to instruments for detection. Image. Generally speaking, computer vision tasks include tasks such as image recognition (Image Classification), object detection (Object Detection), semantic segmentation (Semantic Segmentation), and image generation (Image Generation).
图像识别是常见的分类问题,通常也称为图像分类。具体地,在图像识别任务中,神经网络的输入为图像数据,输出值为当前图像数据属于每个类别的概率。通常选取概率值最大的类别作为图像数据的预测类别。图像识别是最早成功应用深度学习的任务之一,经 典的网络模型有VGG系列、Inception系列以及ResNet系列等。Image recognition is a common classification problem, also commonly known as image classification. Specifically, in the image recognition task, the input of the neural network is image data, and the output value is the probability that the current image data belongs to each category. Usually, the category with the largest probability value is selected as the predicted category of image data. Image recognition is one of the earliest tasks to successfully apply deep learning. Typical network models include VGG series, Inception series, ResNet series, etc.
目标检测是指通过算法自动检测出图像中常见物体的大致位置,通常用边界框(Bounding box)来表示物体的大致位置,并分类出边界框中物体的类别信息。Target detection refers to automatically detecting the approximate location of common objects in images through algorithms. Bounding boxes are usually used to represent the approximate locations of objects, and the category information of objects in the bounding boxes is classified.
语义分割是指通过算法自动分割并识别出图像中的内容。语义分割可以理解为每个像素点的分类问题,即分析每个像素点属于物体的类别。Semantic segmentation refers to automatically segmenting and identifying the content in images through algorithms. Semantic segmentation can be understood as the classification problem of each pixel, that is, analyzing the category of the object that each pixel belongs to.
图像生成是指通过学习真实图像的分布,并从学习到的分布中采样而获得逼真度较高的生成图像。例如,基于模糊的图像生成清晰的图像;基于带雾的图像生成去雾后的图像。Image generation refers to obtaining high-fidelity generated images by learning the distribution of real images and sampling from the learned distribution. For example, a clear image is generated based on a blurred image; a dehazed image is generated based on a hazy image.
如背景技术所述,如果采用的自注意力网络针对一个元素从其他所有的元素中获得相关信息,会导致计算消耗较大。一种可能的方式中,可以采用十字交叉注意力(criss cross attention)方式。考虑十字型区域的元素之间的相关性,相比计算所有的像素来说可以降低复杂度。但是元素映射到不同十字方向包括的行方向和列方向的数据维度一般不同,有些情况下相差还会很大。比如音频数据、视频数据等等,维度差异可能在10倍以上。采用十字交叉注意力的方式,会导致过多关注维度大的数据,进而导致较大维度的计算会被较小维度的计算的抑制。As mentioned in the background art, if the self-attention network is used to obtain relevant information for one element from all other elements, it will lead to high computational consumption. One possible way is to use criss cross attention. Considering the correlation between elements in the cross-shaped area can reduce the complexity compared to calculating all pixels. However, the data dimensions in the row direction and column direction included in the mapping of elements to different cross directions are generally different, and in some cases the difference is very large. For example, the dimension difference of audio data, video data, etc. may be more than 10 times. The use of cross attention will lead to excessive attention to data with large dimensions, which will lead to the calculation of larger dimensions being suppressed by the calculations of smaller dimensions.
基于此,本申请实施例提供的神经网络和数据处理方法,采用独立在每个坐标轴上进行元素的相关性计算,即针对每个元素分别沿着每个坐标轴的张量方向进行注意力计算。然后在加权叠加,可以防止过多关注维度高的轴向导致维度高的轴向对维度低的轴向产生抑制作用,因此采用本申请实施例提供的神经网络和数据处理方法可以在提高计算效率的同时,提高处理精度。一些实施例中,神经网络可以采用卷积神经网络来实现元素之间的相关度计算。另一些实施例中,本申请实施例提供的神经网络采用自注意力机制,采用注意力机制来实现元素之间的相关度计算,在该情况下,神经网络也可以称为注意力网络。Based on this, the neural network and data processing method provided by the embodiments of this application adopt the method of independently calculating the correlation of elements on each coordinate axis, that is, focusing on each element along the tensor direction of each coordinate axis. calculate. Then in the weighted superposition, it can prevent too much attention to the axis with high dimensions, causing the axis with high dimensions to have an inhibitory effect on the axis with low dimensions. Therefore, using the neural network and data processing method provided by the embodiment of the present application can improve the calculation efficiency. At the same time, the processing accuracy is improved. In some embodiments, the neural network may use a convolutional neural network to implement correlation calculation between elements. In other embodiments, the neural network provided by the embodiments of the present application adopts a self-attention mechanism to implement correlation calculation between elements. In this case, the neural network may also be called an attention network.
在本实施例中,注意力网络的输入为序列形式的数据,即注意力网络的输入数据为序列数据。例如,注意力网络的输入数据可以为由多个连续的单词所构成的句子序列;又例如,注意力网络的输入数据可以为由多个连续的图像块所构成的图像块序列,该多个连续的图像块是对一个完整的图像进行分割得到的。序列数据可以理解编码数据,比如针对多个连续的单词进行编码处理得到。比如,针对某些需要处理的数据中通过执行嵌入生成(embedding)获得各个元素的编码数据,例如卷积处理。元素也可以称为patch。输入数据中的各个元素可以对应到多个坐标轴。这里所说的坐标轴,可以从时间上、空间上或者其它维度上来说。一个元素可以具有映射到多个坐标轴上的参数值。输入的数据也可以称为待处理数据。待处理数据可以为多媒体数据,比如音频数据、视频数据或者图像数据等。比如待处理数据为音频数据,音频数据中的每个元素可以理解为音频点。每个音频点可以映射到时间坐标轴上,也可以映射到频率坐标轴。例如可以包括映射到时间坐标轴的时间参数和映射到频率坐标轴的频率参数。再比如,待处理数据为图像数据,图像数据的元素可以理解为像素点或者图像块。每个像素点或者图像块可以映射到水平坐标轴和垂直坐标轴。又比如,待处理数据包括视频数据,视频数据可以映射到三个坐标轴,比如时间坐标轴、水平坐标轴和垂直坐标轴。视频数据包括多个视频帧,每个视频帧包括多个像素点或者图像块。每个像素点或者图像块的编码数据可以映射到时间坐标轴,具有时间坐标轴的时间参数。每个像素点或者图像块的编码数据可以映射到在空间上的水平坐标轴和垂直坐标轴,具有水平坐标轴的水平坐标以及垂直坐标轴的垂直坐标。In this embodiment, the input of the attention network is data in the form of a sequence, that is, the input data of the attention network is sequence data. For example, the input data of the attention network can be a sequence of sentences composed of multiple consecutive words; for another example, the input data of the attention network can be a sequence of image blocks composed of multiple consecutive image blocks. Continuous image blocks are obtained by segmenting a complete image. Sequence data can understand encoded data, such as encoding multiple consecutive words. For example, for some data that needs to be processed, the encoded data of each element is obtained by performing embedding, such as convolution processing. Elements can also be called patches. Each element in the input data can correspond to multiple coordinate axes. The coordinate axis mentioned here can be in terms of time, space or other dimensions. An element can have parameter values mapped to multiple axes. Input data can also be called data to be processed. The data to be processed can be multimedia data, such as audio data, video data or image data. For example, the data to be processed is audio data, and each element in the audio data can be understood as an audio point. Each audio point can be mapped to the time coordinate axis or the frequency coordinate axis. For example, it may include time parameters mapped to the time coordinate axis and frequency parameters mapped to the frequency coordinate axis. For another example, the data to be processed is image data, and the elements of the image data can be understood as pixels or image blocks. Each pixel or image block can be mapped to a horizontal coordinate axis and a vertical coordinate axis. For another example, the data to be processed includes video data, and the video data can be mapped to three coordinate axes, such as the time coordinate axis, the horizontal coordinate axis, and the vertical coordinate axis. Video data includes multiple video frames, and each video frame includes multiple pixels or image blocks. The encoded data of each pixel or image block can be mapped to the time coordinate axis, with the time parameter of the time coordinate axis. The encoded data of each pixel or image block can be mapped to the horizontal coordinate axis and the vertical coordinate axis in space, with the horizontal coordinate of the horizontal coordinate axis and the vertical coordinate of the vertical coordinate axis.
参见图3所示为本申请实施例提供的一种数据处理方法流程示意图,以神经网络采用 注意力网络为例。Referring to Figure 3, a schematic flow chart of a data processing method provided by an embodiment of the present application is shown, using a neural network. Take the attention network as an example.
301,获取待处理数据,所述待处理数据包括多个元素的编码数据。301. Obtain data to be processed, where the data to be processed includes encoded data of multiple elements.
一些实施例中,数据处理方法可以由服务设备来执行,比如云端服务设备。用户设备可以向云端服务设备发送服务请求,该服务请求携带待处理数据。服务请求用于请求云端服务器针对所述待处理数据完成指定处理任务。指定处理任务可以为自然语言处理任务主要,比如机器翻译、自动摘要生成、观点提取、文本分类、问题回答、文本语义对比或者语音识别等任务。指定处理任务可以为计算机视觉任务,比如为图像识别、目标检测、语义分割以及图像生成等任务。In some embodiments, the data processing method may be executed by a service device, such as a cloud service device. The user device can send a service request to the cloud service device, and the service request carries data to be processed. The service request is used to request the cloud server to complete a specified processing task for the data to be processed. The designated processing tasks can be natural language processing tasks, such as machine translation, automatic summary generation, opinion extraction, text classification, question answering, text semantic comparison, or speech recognition. The specified processing task can be a computer vision task, such as image recognition, target detection, semantic segmentation, and image generation.
另一些实施例中,数据处理方法可以由本地设备执行,比如本地的电子设备。待处理数据可以由电子设备自身产生。In other embodiments, the data processing method may be executed by a local device, such as a local electronic device. The data to be processed can be generated by the electronic device itself.
302,通过注意力网络对待处理数据进行特征提取,以获得多个元素中每个元素分别对应到多个坐标轴的特征向量,并对多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得待处理数据的特征向量。302. Extract features from the data to be processed through the attention network to obtain feature vectors in which each of the multiple elements corresponds to multiple coordinate axes, and each of the multiple elements corresponds to the multiple coordinates. The eigenvectors of the axes are weighted to obtain the eigenvectors of the data to be processed.
其中,注意力网络可以是用于对图像进行分类的注意力网络,也可以是用于对图像进行分割的注意力网络,或者可以是用于对图像进行检测的注意力网络,或者可以是用于对图像进行识别的注意力网络,或者可以是用于生成指定图像的注意力网络,或者,可以是用于对文本进行翻译的注意力网络,或者,可以是用于对文本进行复述的注意力网络,或者可以是用于生成指定文本的注意力网络,或者可以是用于对语音进行识别的注意力网络,或者可以是用于对语音进行翻译的注意力网络,或者可以是用于生成指定语音的注意力网络等。Among them, the attention network can be an attention network used to classify images, or it can be an attention network used to segment images, or it can be an attention network used to detect images, or it can be used The attention network is used to recognize images, or it can be the attention network used to generate specified images, or it can be the attention network used to translate text, or it can be the attention network used to paraphrase text. force network, or it can be an attention network used to generate specified text, or it can be an attention network used to recognize speech, or it can be an attention network used to translate speech, or it can be an attention network used to generate Specifying attention networks for speech, etc.
在一种可能的实现方式中,在获得待处理数据的特征向量后,可以进一步根据特征向量完成指定处理任务得到处理结果,并将处理结果发送给用户设备。In a possible implementation, after obtaining the feature vector of the data to be processed, the specified processing task can be further completed according to the feature vector to obtain the processing result, and the processing result can be sent to the user device.
作为一种示例,指定处理任务为图像分类,该注意网络是用于对图像进行分类的注意力网络。在获得特征向量后,可以进一步根据特征向量进行图像分类得到分类结果。比如,指定处理任务为图像分割,该注意网络是用于对图像进行分割的注意力网络。在获得特征向量后,可以进一步根据特征向量进行图像分割得到分割结果。再比如,指定处理任务为图像检测,该注意网络是用于对图像进行检测的注意力网络。在获得特征向量后,可以进一步根据特征向量进行图像检测得到分割结果。再比如,指定处理任务为语音识别,该注意网络是用于对语音进行识别的注意力网络。在获得特征向量后,可以进一步根据特征向量进行语音识别得到识别结果。再比如,指定处理任务为语音翻译,该注意网络是用于对语音进行翻译的注意力网络。在获得特征向量后,可以进一步根据特征向量进行语音翻译得到翻译结果。As an example, specifying that the processing task is image classification, the attention network is an attention network used to classify images. After obtaining the feature vector, the image can be further classified according to the feature vector to obtain the classification result. For example, if the processing task is designated as image segmentation, the attention network is an attention network used to segment images. After obtaining the feature vector, the image can be further segmented based on the feature vector to obtain the segmentation result. For another example, if the processing task is designated as image detection, the attention network is an attention network used to detect images. After obtaining the feature vector, image detection can be further performed based on the feature vector to obtain the segmentation result. For another example, if the designated processing task is speech recognition, the attention network is an attention network used to recognize speech. After obtaining the feature vector, speech recognition can be further performed based on the feature vector to obtain the recognition result. For another example, if the processing task is designated as speech translation, the attention network is an attention network used to translate speech. After obtaining the feature vector, speech translation can be further performed based on the feature vector to obtain the translation result.
以多个坐标轴包括N个为例,分别为轴1~轴N。以输入的数据包括的元素映射到N个坐标轴为例,分别为轴1~轴N。针对各个轴上的元素分别进行注意力计算,然后再针对各个轴的元素的计算结果进行加权和作为独立叠加注意力网络的输出。示例性地,不同轴的权重可以相近,比如简单平均。注意力网络在输入数据后,对输入的数据进行特征提取,以获取多个元素中每个元素分别对应到轴1~轴N上的特征向量,然后每个元素对应到轴1~轴N的N组特征向量进行加权处理,从而获得每个元素对应的特征向量。Take, for example, multiple coordinate axes including N, which are axis 1 to axis N respectively. For example, take the elements included in the input data mapped to N coordinate axes, which are axis 1 to axis N respectively. The attention is calculated separately for the elements on each axis, and then the weighted sum of the calculation results for the elements on each axis is used as the output of the independent superimposed attention network. For example, the weights of different axes can be similar, such as a simple average. After inputting the data, the attention network performs feature extraction on the input data to obtain the feature vectors of each of the multiple elements corresponding to axis 1 to axis N, and then each element corresponds to the feature vector of axis 1 to axis N. N groups of feature vectors are weighted to obtain the feature vector corresponding to each element.
示例性地,以第一元素为例,第一元素为多个元素中的任一个元素。第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其 它元素之间的相关度。Illustratively, taking the first element as an example, the first element is any element among multiple elements. The feature vector corresponding to the first element on the first coordinate axis is used to characterize the first element and other elements in the first region where the first element is located. The correlation between its elements.
第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其它元素之间的相关度;所述第一元素为所述多个元素中的任一元素;所述第一元素所在的第一区域内的其它元素映射到除所述第一坐标轴以外的其它坐标轴的位置与所述第一元素映射到所述其它坐标轴的位置相同和/或相邻。The feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located; the first element is the Any element among a plurality of elements; the positions of other elements in the first area where the first element is located are mapped to other coordinate axes other than the first coordinate axis, and the positions of the first element mapped to the other coordinate axes are The axes are in the same and/or adjacent positions.
第一元素对应到第一坐标轴的特征向量可以采用如下方式来确定:The eigenvector corresponding to the first coordinate axis of the first element can be determined in the following way:
将所述第一元素分别与所述第一元素对应的第一区域内的其它元素之间进行注意力计算得到所述第一元素分别与所述其它元素之间注意力值;所述第一元素为所述多个元素中的任一元素;然后再根据所述第一元素分别与所述其它元素之间注意力值执行加权处理得到所述第一元素对应第一坐标轴上的特征向量。Calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the attention value between the first element and the other elements respectively; the first The element is any element among the plurality of elements; then weighting processing is performed based on the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element. .
两个元素映射到其它坐标轴的位置相邻可以是绝对相邻,也可以是两个元素映射到其它坐标轴的位置之间的间隔位于设定距离内。The adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent, or the interval between the positions mapped to other coordinate axes of the two elements can be within a set distance.
一种情况下,两个元素映射到其它坐标轴的位置相邻可以是绝对相邻,以N=2为例,轴1为水平方向坐标轴(可以简称为水平坐标轴),轴2为垂直方向坐标轴(可以简称为垂直坐标轴)。在计算一个元素对应到水平坐标轴上的特征向量时,可以同行元素进行注意力计算。在计算一个元素对应到垂直坐标轴上的特征向量时,可以同列元素进行注意力计算。In one case, the adjacent positions of two elements mapped to other coordinate axes can be absolutely adjacent. Taking N=2 as an example, axis 1 is the horizontal coordinate axis (can be referred to as the horizontal coordinate axis for short), and axis 2 is the vertical axis. Direction coordinate axis (can be referred to as vertical coordinate axis). When calculating the feature vector corresponding to an element on the horizontal axis, attention calculation can be performed on the same element. When calculating the feature vector corresponding to an element on the vertical coordinate axis, attention calculation can be performed on the elements in the same column.
参见图4所示,水平方向上每一行包括10个元素,垂直方向上每一列包括5个元素。作为一种示例:以元素3-6为例。在计算元素3-6对应到水平坐标轴的特征向量时,可以分别计算元素3-6与同一行的其它元素(元素3-1~3-5、3-7~3-10)之间的注意力计算结果,然后根据注意力计算结果执行加权处理得到元素3-6对应到水平坐标轴的特征向量。在计算元素3-6对应到垂直坐标轴的特征向量时,可以分别计算元素3-6与同一列的其它元素(元素1-6~2-6、4-6~5-6)之间的注意力计算结果,然后根据注意力计算结果执行加权处理得到元素3-6对应到垂直坐标轴的特征向量。As shown in Figure 4, each row in the horizontal direction includes 10 elements, and each column in the vertical direction includes 5 elements. As an example: take elements 3-6. When calculating the feature vector of element 3-6 corresponding to the horizontal coordinate axis, you can calculate the distance between element 3-6 and other elements in the same row (elements 3-1~3-5, 3-7~3-10). Attention calculation results, and then perform weighting processing based on the attention calculation results to obtain the feature vectors of elements 3-6 corresponding to the horizontal coordinate axis. When calculating the eigenvector corresponding to the vertical coordinate axis of element 3-6, the distance between element 3-6 and other elements in the same column (elements 1-6~2-6, 4-6~5-6) can be calculated respectively. Attention calculation results, and then perform weighting processing based on the attention calculation results to obtain the feature vectors of elements 3-6 corresponding to the vertical coordinate axis.
另一种情况下,两个元素映射到其它坐标轴的位置之间的间隔位于设定距离内。以N=2为例,轴1为水平方向坐标轴(可以简称为水平坐标轴),轴2为垂直方向坐标轴(可以简称为垂直坐标轴)。在计算一个元素对应到水平坐标轴上的特征向量时,可以同行以及该同行相邻的一行或者多行的元素执行注意力计算。在计算一个元素对应到垂直坐标轴上的特征向量时,可以同列以及该同列相邻的一列或者多列的元素执行列注意力计算。参见图5为例,在计算元素3-6对应到水平坐标轴的特征向量时,可以分别计算元素3-6与同一行和相邻行的其它元素(元素3-1~3-5、3-7~3-10、2-1~2-10、4-1~4-10)之间的注意力计算结果,然后根据注意力计算结果执行加权处理得到元素3-6对应到水平坐标轴的特征向量。在计算元素3-6对应到垂直坐标轴的特征向量时,可以分别计算元素3-6与同一列和相邻行的其它元素(元素1-6~2-6、4-6~5-6、1-5~5-5、1-7~5-7)之间的注意力计算结果,然后根据注意力计算结果执行加权处理得到元素3-6对应得到垂直坐标轴的特征向量。In the other case, the separation between the positions of two elements mapped to other coordinate axes is within a set distance. Taking N=2 as an example, axis 1 is a horizontal coordinate axis (which may be referred to as a horizontal coordinate axis), and axis 2 is a vertical coordinate axis (which may be referred to as a vertical coordinate axis). When calculating the feature vector corresponding to an element on the horizontal axis, the attention calculation can be performed on the elements of the same row and one or more rows adjacent to the same row. When calculating the feature vector corresponding to an element on the vertical coordinate axis, column attention calculations can be performed on elements in the same column and one or more columns adjacent to the same column. Referring to Figure 5 as an example, when calculating the feature vector of element 3-6 corresponding to the horizontal coordinate axis, you can calculate element 3-6 and other elements in the same row and adjacent rows (elements 3-1 ~ 3-5, 3 -7~3-10, 2-1~2-10, 4-1~4-10), and then perform weighting processing based on the attention calculation results to obtain elements 3-6 corresponding to the horizontal coordinate axis eigenvector. When calculating the eigenvector corresponding to the vertical coordinate axis of element 3-6, you can separately calculate element 3-6 and other elements in the same column and adjacent rows (elements 1-6~2-6, 4-6~5-6 , 1-5~5-5, 1-7~5-7), and then perform weighting processing based on the attention calculation results to obtain the feature vector of the vertical coordinate axis corresponding to elements 3-6.
需要说明的是,本申请实施例提供的注意力网络也可以称为独立叠加注意力网络或者自我独立叠加注意力网络,还可以采用其它的称呼,本申请实施例对此不作具体限定。后续描述时以称为独立叠加注意力网络为例。It should be noted that the attention network provided by the embodiments of the present application can also be called an independent superimposed attention network or a self-independent superimposed attention network, and other names can also be used, which are not specifically limited in the embodiments of the present application. The following description takes what is called an independent stacked attention network as an example.
具体的,参见图6所示,独立叠加注意力网络根据输入的数据确定查询向量(Query,Q)、键值向量(Key,K)以及价值向量(Value,V),然后再根据Q、K、V进行轴1方 向~轴N方向注意力计算得到每个元素对应到轴1~轴N的特征向量,然后针对每个元素对应到轴1~轴N的特征向量进行加权和,得到每个元素的特征向量。Specifically, as shown in Figure 6, the independent superimposed attention network determines the query vector (Query, Q), key value vector (Key, K) and value vector (Value, V) based on the input data, and then based on Q, K , V carries axis 1 square Calculate the attention in the direction of ~ axis N to obtain the feature vector of each element corresponding to axis 1 ~ axis N, and then perform a weighted sum of the feature vectors of each element corresponding to axis 1 ~ axis N to obtain the feature vector of each element.
需要说明的是,本申请实施例提供的独立叠加注意力网络可以采用单头注意力机制,也可以采用多头注意力机制,本申请实施例对此不作具体限定。在采用多头注意力机制的情况下,独立叠加注意力网络在接收到输入的数据后,根据头(Head)的数量对输入的数据的维度进行分组。在每一组中采用本申请实施例提供的方式进行注意力计算,然后再将多组的结果进行拼接。It should be noted that the independent superimposed attention network provided by the embodiments of the present application can adopt a single-head attention mechanism or a multi-head attention mechanism, which is not specifically limited in the embodiments of the present application. In the case of using a multi-head attention mechanism, after receiving the input data, the independent stacked attention network groups the dimensions of the input data according to the number of heads. In each group, attention is calculated using the method provided by the embodiment of the present application, and then the results of multiple groups are spliced.
例如,以N=2为例,根据输入的数据确定Q、K和V时可以采用如下公式(1):
q(i,j)=WQh(i,j),k(i,j)=WKh(i,j),v(i,j)=WVh(i,j)        公式(1)。
For example, taking N=2 as an example, the following formula (1) can be used to determine Q, K and V based on the input data:
q (i,j) = W Q h (i,j) , k (i,j) = W K h (i,j) , v (i,j) = W V h (i,j) Formula (1 ).
其中,q(i,j)表示位置(i,j)元素的Query,k(i,j)表示位置(i,j)元素的Key,v(i,j)表示位置(i,j)元素的Value。i取值范围为0~m-1,j取值范围为0~n-1。2维的输入数据包括m行n列。Among them, q (i, j) represents the Query of the element at position (i, j), k (i, j) represents the Key of the element at position (i, j), and v (i, j) represents the element at position (i, j). Value. The value range of i is 0~m-1, and the value range of j is 0~n-1. The 2-dimensional input data includes m rows and n columns.
以每个元素对应到轴1上的特征向量为例。每个元素对应到轴1上的特征向量可以采用如下公式(2-1)确定。
Take the feature vector corresponding to each element on axis 1 as an example. The feature vector corresponding to each element on axis 1 can be determined using the following formula (2-1).
其中,dk为输入的数据的维度数。表示位置(i,j)元素对应到轴1的特征向量;Among them, d k is the number of dimensions of the input data. Represents the feature vector of the element at position (i, j) corresponding to axis 1;
结合上述方式,则位置(i,j)元素对应到轴2上的特征向量为如公式(2-2)所示。m表示待处理数据在水平坐标轴方向上每行包括的元素个数。
Combining the above methods, the feature vector corresponding to the element at position (i, j) on axis 2 is As shown in formula (2-2). m represents the number of elements included in each row of the data to be processed in the direction of the horizontal axis.
然后针对两个轴的元素对应的特征向量进行加权,采用如下公式(3)来确定:
Then the feature vectors corresponding to the elements of the two axes are weighted and determined using the following formula (3):
其中,h′(i,j)表示位置(i,j)元素的特征向量。w1表示轴1的权重,w2表示轴2的权重。Among them, h′ (i, j) represents the feature vector of the element at position (i, j). w 1 represents the weight of axis 1, and w 2 represents the weight of axis 2.
示例性地,w1=w2=1/2。For example, w 1 =w 2 =1/2.
一些情况下,涉及到分类场景中,可以在独立叠加注意力网络的网络参数中增加至少一个设定元素对应的编码数据。该至少一个设定元素对应的编码数据是独立叠加注意力网络中可学习的嵌入输入,即可以作为网络参数参与训练,在训练过程中每次调整网络参数时,可以调整该至少一个设定元素对应编码数据。In some cases, involving classification scenarios, encoding data corresponding to at least one setting element can be added to the network parameters of the independent overlay attention network. The encoded data corresponding to the at least one setting element is a learnable embedded input in the independent superposition attention network, that is, it can be used as a network parameter to participate in training. Each time the network parameters are adjusted during the training process, the at least one setting element can be adjusted. Corresponding encoded data.
作为一种举例,至少一个设定元素可以包括分类位和/或蒸馏位。分类位对应的编码数据也可以称为分类令牌(class token),蒸馏位对应的编码数据也可以称为蒸馏令牌。学生模型可以利用教师模型的知识蒸馏(KD)训练模式来训练得到。学生模型可以理解为教师模型压缩后的一个更小的模型。通过添加蒸馏位来与教师模型进行交互学习,最后通过蒸馏损失输出。所述Class令牌和蒸馏令牌是可学习的嵌入向量,Class令牌和蒸馏令牌通过与输入的数据包括的各个元素的编码数据进行注意力运算,对元素之间的全局关系进行建模,并且融合所有元素的信息,最终与分类器相连进行类别预测。As an example, at least one setting element may include a classification bit and/or a distillation bit. The encoded data corresponding to the classification bits can also be called a class token, and the encoded data corresponding to the distillation bits can also be called a distillation token. The student model can be trained using the knowledge distillation (KD) training mode of the teacher model. The student model can be understood as a smaller model compressed by the teacher model. Interactive learning with the teacher model is performed by adding distillation bits, and finally the output is passed through the distillation loss. The Class token and distillation token are learnable embedding vectors. The Class token and distillation token model the global relationship between elements by performing attention operations with the encoded data of each element included in the input data. , and fuses the information of all elements, and is finally connected to the classifier for category prediction.
参见图7所示为另一种独立叠加注意力网络的处理流程示意图。图7中还以输入的数据包括的元素能够映射到N个坐标轴为例,分别为轴1~轴N。针对各个轴上的元素分别 进注意力计算。然后再针对各个轴的元素的计算结果执行加权处理。针对分类位和蒸馏位的编码数据分别与其它所有元素的编码数据进行注意力加权计算,然后与N个轴的加权和进行特征融合。特征融合可以采用连接函数进行特征连接,比如采用concat函数。See Figure 7 for a schematic diagram of the processing flow of another independent superimposed attention network. Figure 7 also takes as an example that the elements included in the input data can be mapped to N coordinate axes, which are axis 1 to axis N respectively. For the elements on each axis, respectively Enter attention calculations. Then weighting is performed on the calculation results of the elements of each axis. The encoding data of classification bits and distillation bits are separately weighted with the encoding data of all other elements for attention weighting calculation, and then feature fusion is performed with the weighted sum of N axes. Feature fusion can use connection functions to connect features, such as the concat function.
例如,以N=2为例,根据输入的数据确定Q、K和V时可以采用如上公式(1)。分类位对应的Q、K和V,可以通过如下公式(4)来确定。
qc=WQhc,kc=WKhc,vc=WVhc       公式(4)。
For example, taking N=2 as an example, the above formula (1) can be used to determine Q, K and V based on the input data. Q, K and V corresponding to the classification bits can be determined by the following formula (4).
q c =W Q h c , k c =W K h c , v c =W V h c formula (4).
其中,qc表示分类位的Query,kc表示分类位的Key,vc表示分类位的Value。hc表示分类位的编码数据。Among them, q c represents the Query of the classification bit, k c represents the Key of the classification bit, and v c represents the Value of the classification bit. h c represents the coded data of the classification bit.
需要说明的是,输入的数据中每个元素对应的Q、K、V计算时采用的向量矩阵WQ、WK以及WV,与分类位(和/或蒸馏位)对应的Q、K和V计算时采用的向量矩阵相同。It should be noted that the vector matrices W Q , W K and W V used in the calculation of Q, K and V corresponding to each element in the input data, and the Q , K and W V corresponding to the classification bit (and/or distillation bit) The vector matrix used in V calculation is the same.
通过如下公式(5)确定分类位对应的特征向量。
The feature vector corresponding to the classification bit is determined through the following formula (5).
在一些可能的实施方式中,独立叠加注意力网络在执行注意力计算之前可以进行全连接处理,对输入的数据进行升维处理。在完成注意力计算之后,还可以进一步进行全连接处理,比如降维处理。独立叠加注意力网络的输入数据的维度与输出数据的维度相同。In some possible implementations, the independent stacked attention network can perform fully connected processing before performing attention calculation, and perform dimensionality enhancement processing on the input data. After completing the attention calculation, further full connection processing, such as dimensionality reduction processing, can be performed. The dimensions of the input data of the independent stacked attention network are the same as the dimensions of the output data.
参见表1所示,提供了常规的注意力网络与本申请实施例提供的注意力网络的计算复杂度比较。以两个坐标轴为例。其中m、n为两轴的维度,C为特征维度。As shown in Table 1, a comparison of the computational complexity of the conventional attention network and the attention network provided by the embodiment of the present application is provided. Take two coordinate axes as an example. Among them, m and n are the dimensions of the two axes, and C is the feature dimension.
表1
Table 1
对于多轴来说,假设第i轴的维度为Ni,独立叠加注意力网络的相关计算复杂度为(6),常规注意力网络的复杂度为(7),二者的比值为(8)。由公式(6)、(7)和(8)可知,本申请实施例提供的方案用于各轴维度相当的场景也具有低复杂度的优势。
Ω(独立叠加)=2C(∏Ni)(∑Ni)        (6)
Ω(常规)=2C(∏Ni)2          (7)
For multiple axes, assuming that the dimension of the i-th axis is N i , the relevant computational complexity of the independent stacked attention network is (6), the complexity of the conventional attention network is (7), and the ratio between the two is (8) ). It can be seen from formulas (6), (7) and (8) that the solution provided by the embodiment of the present application also has the advantage of low complexity when used in scenarios where the dimensions of each axis are equivalent.
Ω(independent superposition)=2C(∏N i )(∑N i ) (6)
Ω(conventional)=2C(∏N i ) 2 (7)
申请实施例提供的方案在多轴场景下均适用。比如视频数据中,假设空间为128×128,时间为16,独立叠加注意力网络的计算复杂度为常规的注意力网络的0.1%。比如10个坐标,每一轴的元素维度为128,独立叠加注意力网络的计算复杂度为常规注意力网络的1.1×10-18The solution provided by the application embodiment is applicable in multi-axis scenarios. For example, in video data, assuming that the space is 128×128 and the time is 16, the computational complexity of the independent superimposed attention network is 0.1% of that of the conventional attention network. For example, with 10 coordinates and the element dimension of each axis is 128, the computational complexity of the independent superimposed attention network is 1.1×10 -18 of the conventional attention network.
一些场景中,本申请实施例提供的独立叠加注意力网络可以应用到transformer模块中,用于对数据进行处理,比如图像分类、分割、目标定位;视频动作分类、时间定位、时空定位;音频与音乐分类、音源分离等。作为一种举例,参见图8所示,为本申请实施例所示例的一种transformer模块结构示意图。In some scenarios, the independent superimposed attention network provided by the embodiments of this application can be applied to the transformer module to process data, such as image classification, segmentation, and target positioning; video action classification, time positioning, and spatiotemporal positioning; audio and Music classification, sound source separation, etc. As an example, see FIG. 8 , which is a schematic structural diagram of a transformer module illustrated in an embodiment of the present application.
参见图8所示,transformer模块可以包括本申请实施例提供独立叠加注意力网络、线 性层以及多层感知机。独立叠加注意力网络用于对提取输入的数据的特征。线性层可以为层归一化(layer normalization,LN)。LN用于对独立叠加注意力网络的输出进行归一化处理。多层感知机(multilayer perceptron,MLP)与独立叠加注意力网络串行连接。多层感知机可以包括多个串行的全连接层。具体地,多层感知机也可以称为全连接神经网络。多层感知机包括输入层、隐藏层以及输出层,隐藏层的数量可以为一层或多层。其中,多层感知机中的网络层均为全连接层。即,多层感知机的输入层与隐藏层之间是全连接的,多层感知机的隐藏层与输出层之间也是全连接的。其中,全连接层是指全连接层中的每一个神经元都与上一层的所有神经元相连,用来把上一层提取到的特征综合起来。As shown in Figure 8, the transformer module may include the independent superimposed attention network and line provided by the embodiment of the present application. sexual layer and multi-layer perceptron. Independent stacked attention networks are used to extract features from input data. Linear layers can be layer normalization (LN). LN is used to normalize the output of independent stacked attention networks. Multilayer perceptron (MLP) is serially connected with independent superimposed attention network. A multilayer perceptron can include multiple serial fully connected layers. Specifically, the multi-layer perceptron can also be called a fully connected neural network. A multi-layer perceptron includes an input layer, a hidden layer and an output layer. The number of hidden layers can be one or more. Among them, the network layers in the multi-layer perceptron are all fully connected layers. That is, the input layer and hidden layer of the multi-layer perceptron are fully connected, and the hidden layer and output layer of the multi-layer perceptron are also fully connected. Among them, the fully connected layer means that each neuron in the fully connected layer is connected to all the neurons in the previous layer, which is used to synthesize the features extracted from the previous layer.
一些可能的实施例中,transformer模块中还可以包括另一个线性层,用于执行层标准化,通过层标准化计算标准化统计信息,可以降低计算时间。位于独立叠加注意力网络的输入端,参见图9所示,用于对输入transformer模块的数据先执行层归一化,以减少训练成本。In some possible embodiments, the transformer module may also include another linear layer for performing layer normalization. Calculating normalized statistical information through layer normalization can reduce calculation time. Located at the input end of the independent stacked attention network, see Figure 9, it is used to first perform layer normalization on the data input to the transformer module to reduce training costs.
如下结合几种应用场景,对本申请实施例提供方案进行详细说明。The solutions provided by the embodiments of this application are described in detail below based on several application scenarios.
场景一:以音频分类为例。参见图10所示为一种分类网络模型的结构示意图。分类网络模型中包括嵌入生成模块、M1个transformer模块以及分类模块。M1个transformer模块可以采用串联的方式部署。transformer模块采用图9所示的结构。嵌入生成模块用于从输入的音频数据中提取局部特征,也可以理解为生成音频数据的编码数据。音频数据可以映射到时间坐标轴以及频率坐标轴。音频数据可以被划分为多个音频点。比如被划分为T*F个音频点(patches),T表示时间维度上,F表示频率维度上。比如输入的数据:10s,32000Hz。时频谱中,频率128,时间1000维。音频数据被划分的patches的个数为:99(时间)*12(频率)。比如时间为水平坐标轴,频率作为垂直坐标轴,则每行包括99个音频点,每列包括12个音频点。比如,嵌入生成模块从输入的音频数据中提取局部特征的特征维度通过E1表示。transformer模块中的独立叠加注意力网络可以采用多头注意力机制。在分类模块中,对分类位和蒸馏位的结果进行平均后,通过线性层得到每一类的预测值。Scenario 1: Take audio classification as an example. Refer to Figure 10, which is a schematic structural diagram of a classification network model. The classification network model includes an embedding generation module, M1 transformer modules and a classification module. M1 transformer modules can be deployed in series. The transformer module adopts the structure shown in Figure 9. The embedding generation module is used to extract local features from the input audio data, which can also be understood as generating encoded data of the audio data. Audio data can be mapped to the time axis as well as the frequency axis. Audio data can be divided into multiple audio points. For example, it is divided into T*F audio points (patches), T represents the time dimension, and F represents the frequency dimension. For example, input data: 10s, 32000Hz. In the time spectrum, frequency is 128 and time is 1000 dimensions. The number of patches into which audio data is divided is: 99 (time) * 12 (frequency). For example, if time is the horizontal axis and frequency is the vertical axis, each row includes 99 audio points and each column includes 12 audio points. For example, the feature dimension of the embedding generation module that extracts local features from the input audio data is represented by E1. The independent stacked attention network in the transformer module can adopt a multi-head attention mechanism. In the classification module, after averaging the results of classification bits and distillation bits, the predicted value of each class is obtained through the linear layer.
参见图11所示,为分类网络模型的工作流程示意图。嵌入生成模块包括卷积层,用于对输入的音频数据(时频谱)进行卷积处理生成嵌入表示,输出(T×F,E1)的时频向量。E1表示特征维度。每个patches的特征维度为E1。示例性地,嵌入生成模块可以采用较大卷积步长(比如卷积步长为10左右)的二维卷积,从而生成的每个时频向量都代表局部patchE1s信息。本申请实施例中,目的是音频分类,可以合入分类位和蒸馏位,比如可以通过concat函数。一些实施例中,为了提高分类精度,可以叠加(add)位置编码,以帮助学习位置信息。位置编码的方式本申请实施例不作具体限定。See Figure 11, which is a schematic diagram of the workflow of the classification network model. The embedding generation module includes a convolution layer, which is used to perform convolution processing on the input audio data (time spectrum) to generate an embedding representation, and output a time-frequency vector of (T×F, E1). E1 represents the feature dimension. The feature dimension of each patch is E1. For example, the embedding generation module can use two-dimensional convolution with a larger convolution step size (for example, a convolution step size of about 10), so that each generated time-frequency vector represents local patchE1s information. In the embodiment of this application, the purpose is audio classification, and the classification bit and the distillation bit can be combined, for example, through the concat function. In some embodiments, in order to improve classification accuracy, position encoding can be added to help learn position information. The method of position encoding is not specifically limited in the embodiment of this application.
嵌入生成模块输出的嵌入向量输入Transformer块串联而成的骨干网络部分。嵌入向量输入到transformer模块,可以先通过线性层对嵌入向量执行线性操作,比如进行层归一化。层归一化后的数据输入到独立叠加注意力网络。独立叠加注意力网络可以对层归一化后的数据进行升维处理(比如,E1维数据升到3*E1维数据),进一步生成每个patches以及分类位、蒸馏位分别对应的Q、K、V。以独立叠加注意力网络采用多头注意力机制为例,独立叠加注意力网络进一步执行多头拆分,独立叠加注意力网络对分类位和蒸馏位与其它patches分别进行注意力加权得到分类位和蒸馏位的特征向量,并执行时间坐标轴的行注意力加权计算以及频率坐标轴的列注意力加权计算。将行注意力加权计算的结果以及列注意 力加权计算的结果进行加权得到的特征向量与分类位和蒸馏位的特征向量进行连接。示例性地,时间坐标轴和频率坐标轴对应的权重相同,均为0.5。进一步地,独立叠加注意力网络对连接后的特征向量进行降维处理,将3*E1维数据降到E1维数据。进一步地,再经过Transformer模块中LN层和MLP层处理后,输入到分类模块中。在分类模块中,对分类位和蒸馏位的结果进行平均后,通过线性层得到每一类的预测值。The embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks. The embedding vector is input to the transformer module. You can first perform linear operations on the embedding vector through a linear layer, such as layer normalization. The layer-normalized data is input to the independent stacked attention network. The independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E1-dimensional data is raised to 3*E1-dimensional data), and further generate Q and K corresponding to each patch, classification bit, and distillation bit respectively. ,V. Taking the independent stacked attention network using a multi-head attention mechanism as an example, the independent stacked attention network further performs multi-head splitting. The independent stacked attention network weights the attention of the classification bits and distillation bits with other patches to obtain the classification bits and distillation bits. feature vector, and perform row attention weighting calculation on the time axis and column attention weighting calculation on the frequency axis. Combine the results of weighted row attention calculations and column attention The eigenvector obtained by weighting the result of the force weighting calculation is connected with the eigenvectors of the classification bit and the distillation bit. For example, the weights corresponding to the time coordinate axis and the frequency coordinate axis are the same, which are both 0.5. Furthermore, the independent superimposed attention network performs dimensionality reduction processing on the connected feature vectors, reducing the 3*E1-dimensional data to E1-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the classification module. In the classification module, after averaging the results of classification bits and distillation bits, the predicted value of each class is obtained through the linear layer.
例如,采用图10所示的分类网络模型对如下音频数据的数据集1)和2)分别进行分类。1)Audioset,包括632个音频事件类的扩展类目和从一些视频中提取的2M(兆)个人工标记的10s声音剪辑的集合。类别覆盖广泛的人和动物声音,乐器和风格以及常见的日常环境声音。2)Opernmic 2018:乐器声音分类数据集,共20000样本,20类乐器,音频长度10s。则采用本申请提供的方案与采用现有方案的分类精度以及时间、系统性能要求的比较结果参见表2和表3所示。系统性能以所需的每秒所执行的浮点运算次数(floating-point operations per second,FLOPs)来表达为例。分类精度以平均精度均值(Mean Average Precision,mAP)来表达为例。For example, the classification network model shown in Figure 10 is used to classify the following audio data data sets 1) and 2) respectively. 1) Audioset, including 632 extended categories of audio event classes and a collection of 2M (mega) manually labeled 10s sound clips extracted from some videos. Categories cover a wide range of human and animal sounds, instruments and styles as well as common everyday environmental sounds. 2)Opernmic 2018: Musical instrument sound classification data set, a total of 20,000 samples, 20 categories of musical instruments, and audio length 10s. The comparison results of classification accuracy, time, and system performance requirements between the solution provided by this application and the existing solution are shown in Table 2 and Table 3. System performance is expressed in terms of the required number of floating-point operations per second (FLOPs) performed per second, for example. Classification accuracy is expressed as Mean Average Precision (mAP) as an example.
表2
Table 2
从上表2来看,采用包括本申请实施例提供的独立叠加注意力网络的transformer针对两种数据集来说,相比现有技术,预测精度均有所提高。Judging from Table 2 above, for both data sets, the prediction accuracy of the transformer including the independent superimposed attention network provided by the embodiment of the present application is improved compared with the existing technology.
表3
table 3
从表3来看,采用包括本申请实施例提供的独立叠加注意力网络的transformer或者独立叠加注意力网络相比现有技术,可以提高计算效率。应理解的是,上述现有技术的方法运行与本申请实施例的方法的运行是在同样环境中获得的。表2和表3仅作为一种示例,在不同的环境中运行可能结果会有出入。From Table 3, it can be seen that compared with the existing technology, using a transformer or an independent superimposed attention network including the independent superimposed attention network provided by the embodiment of the present application can improve the computing efficiency. It should be understood that the operation of the above-mentioned prior art method and the operation of the method in the embodiment of the present application are obtained in the same environment. Table 2 and Table 3 are only examples, and the results may be different when running in different environments.
场景二:以端到端图像分割为例。参见图12所示为一种图像分割网络模型的工作流程示意图。分类网络模型中包括嵌入生成模块、M2个transformer模块以及像素重建模块。M2个transformer模块可以采用串联的方式部署。transformer模块采用图9所示的结构。嵌入生成模块用于从输入的图像数据中提取局部特征,也可以理解为生成图像数据的编码数据。图像数据可以映射到水平坐标轴以及垂直坐标轴。图像数据可以被划分为多个图像 块。比如被划分为H*W个图像块(patches)。嵌入生成模块从输入的图像数据中提取局部特征的特征维度通过E2表示。transformer模块中的独立叠加注意力网络可以采用多头注意力机制。在像素重建模块中,将每个图像块的像素(pixels)强度值恢复。Scenario 2: Take end-to-end image segmentation as an example. See Figure 12 for a schematic workflow diagram of an image segmentation network model. The classification network model includes an embedding generation module, M2 transformer modules and a pixel reconstruction module. M2 transformer modules can be deployed in series. The transformer module adopts the structure shown in Figure 9. The embedding generation module is used to extract local features from the input image data, which can also be understood as generating the encoded data of the image data. Image data can be mapped to horizontal as well as vertical axes. Image data can be divided into multiple images piece. For example, it is divided into H*W image blocks (patches). The feature dimension of the local features extracted by the embedding generation module from the input image data is represented by E2. The independent stacked attention network in the transformer module can adopt a multi-head attention mechanism. In the pixel reconstruction module, the intensity value of pixels of each image block is restored.
嵌入生成模块包括卷积层,用于对输入的图像数据(时频谱)进行卷积处理生成嵌入表示,输出(H×M,E2)的图像向量。E2表示特征维度。每个元素的特征维度为E2。一些实施例中,为了提高分类精度,可以叠加(add)位置编码(H×M,E2),以帮助学习位置信息。位置编码的方式本申请实施例不作具体限定。The embedding generation module includes a convolution layer, which is used to perform convolution processing on the input image data (time spectrum) to generate an embedding representation, and output an image vector of (H×M, E2). E2 represents the feature dimension. The feature dimension of each element is E2. In some embodiments, in order to improve classification accuracy, position encoding (H×M, E2) can be added to help learn position information. The method of position encoding is not specifically limited in the embodiment of this application.
嵌入生成模块输出的嵌入向量输入Transformer块串联而成的骨干网络部分。嵌入向量输入到transformer模块,可以先通过线性层对嵌入向量执行线性操作,比如进行层归一化。层归一化后的数据输入到独立叠加注意力网络。独立叠加注意力网络可以对层归一化后的数据进行升维处理(比如,E维数据升到3*E2维数据),进一步生成每个patches分别对应的Q、K、V。以独立叠加注意力网络采用多头注意力机制为例,独立叠加注意力网络进一步执行多头拆分,独立叠加注意力网络执行水平坐标轴的行注意力加权计算以及垂直坐标轴的列注意力加权计算。将行注意力加权计算的结果以及列注意力加权计算的结果进行加权得到图像数据的特征向量。示例性地,水平坐标轴和垂直坐标轴对应的权重相同,均为0.5。进一步地,独立叠加注意力网络还可以对得到的图像数据的特征向量进行降维处理,将3*E2维数据降到E维数据。进一步地,再经过Transformer模块中LN层和MLP层处理后,输入到像素重建模块中。在像素重建模块中,通过层归一化处理以及全连接层处理后将每个图像块的像素(pixels)强度值恢复。The embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks. The embedding vector is input to the transformer module. You can first perform linear operations on the embedding vector through a linear layer, such as layer normalization. The layer-normalized data is input to the independent stacked attention network. The independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E-dimensional data is raised to 3*E2-dimensional data), and further generate Q, K, and V corresponding to each patch. Taking the independent stacked attention network using a multi-head attention mechanism as an example, the independent stacked attention network further performs multi-head splitting, and the independent stacked attention network performs row attention weighting calculations on the horizontal axis and column attention weighting calculations on the vertical axis. . The result of row attention weighting calculation and the result of column attention weighting calculation are weighted to obtain the feature vector of the image data. For example, the weights corresponding to the horizontal coordinate axis and the vertical coordinate axis are the same, both are 0.5. Furthermore, the independent superimposed attention network can also perform dimensionality reduction processing on the feature vectors of the obtained image data, reducing the 3*E2-dimensional data to E-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the pixel reconstruction module. In the pixel reconstruction module, the pixel intensity value of each image block is restored through layer normalization and fully connected layer processing.
场景三:以视频动作分类为例。参见图13所示为一种视频分类网络模型的工作流程示意图。分类网络模型中包括嵌入生成模块、M3个transformer模块以及分类模块。M3个transformer模块可以采用串联的方式部署。transformer模块采用图9所示的结构。嵌入生成模块用于从输入的视频数据中提取局部特征,也可以理解为生成视频数据的编码数据。视频数据可以映射到时间坐标轴、水平坐标轴和垂直坐标轴。视频数据可以被划分为多个图像块。比如被划分为H*W*T个图像块(patches),T表示时间坐标轴维度上,H表示水平坐标轴维度上,W表示垂直坐标轴维度上。嵌入生成模块包括三维卷积层,用于对输入的视频数据进行卷积处理生成嵌入表示,输出(H*W*T,E3)的视频向量。E3表示特征维度。每个patches的特征维度为E3。示例性地,本申请实施例中,目的是视频动作分类,可以合入分类位,比如可以通过concat函数将分类位的数据与(H*W*T,E3)的视频向量连接。一些实施例中,为了提高分类精度,可以叠加(add)位置编码,以帮助学习位置信息。位置编码的方式本申请实施例不作具体限定。加入分类位后以及叠加位置编码后,嵌入生成模块输出为维度为(H*W*T+1,E3)的嵌入向量。Scenario 3: Take video action classification as an example. Refer to Figure 13, which is a schematic workflow diagram of a video classification network model. The classification network model includes an embedding generation module, M3 transformer modules and a classification module. M3 transformer modules can be deployed in series. The transformer module adopts the structure shown in Figure 9. The embedding generation module is used to extract local features from the input video data, which can also be understood as generating the encoded data of the video data. Video data can be mapped to the time axis, horizontal axis, and vertical axis. Video data can be divided into image blocks. For example, it is divided into H*W*T image patches (patches), T represents the time coordinate axis dimension, H represents the horizontal coordinate axis dimension, and W represents the vertical coordinate axis dimension. The embedding generation module includes a three-dimensional convolution layer, which is used to perform convolution processing on the input video data to generate an embedding representation, and output a video vector of (H*W*T, E3). E3 represents the feature dimension. The feature dimension of each patch is E3. For example, in the embodiment of the present application, the purpose is to classify video actions, and the classification bits can be combined. For example, the data of the classification bits can be connected with the video vector of (H*W*T, E3) through the concat function. In some embodiments, in order to improve classification accuracy, position encoding can be added to help learn position information. The method of position encoding is not specifically limited in the embodiment of this application. After adding classification bits and superimposing position coding, the embedding generation module outputs an embedding vector with dimensions (H*W*T+1, E3).
嵌入生成模块输出的嵌入向量输入Transformer块串联而成的骨干网络部分。嵌入向量输入到transformer模块,可以先通过线性层对嵌入向量执行线性操作,比如进行层归一化。层归一化后的数据输入到独立叠加注意力网络。独立叠加注意力网络可以对层归一化后的数据进行升维处理(比如,E1维数据升到3*E1维数据),进一步生成每个patches以及分类位分别对应的Q、K、V。以独立叠加注意力网络采用多头注意力机制为例,独立叠加注意力网络进一步执行多头拆分,独立叠加注意力网络对分类位与其它patches分别进行注意力加权得到分类位的特征向量,并执行时间坐标轴的注意力加权计算、水平坐标轴的行注意力加权计算以及垂直坐标轴的列注意力加权计算。将时间坐标轴的注意力加权计算、 行注意力加权计算的结果以及列注意力加权计算的结果进行加权得到的特征向量与分类位的特征向量进行连接。进一步地,独立叠加注意力网络对连接后的特征向量进行降维处理,将3*E3维数据降到E3维数据。进一步地,再经过Transformer模块中LN层和MLP层处理后,输入到分类模块中。在分类模块中,通过线性层获得分类位对应的分类信息,然后通过全连接层处理后得到动作分类预测分布。The embedding vector output by the embedding generation module is input into the backbone network part formed by the series of Transformer blocks. The embedding vector is input to the transformer module. You can first perform linear operations on the embedding vector through a linear layer, such as layer normalization. The layer-normalized data is input to the independent stacked attention network. The independent superimposed attention network can perform dimensionality processing on the layer-normalized data (for example, E1-dimensional data is raised to 3*E1-dimensional data), and further generates Q, K, and V corresponding to each patch and classification bit. Taking the independent superposition attention network using the multi-head attention mechanism as an example, the independent superposition attention network further performs multi-head splitting. The independent superposition attention network performs attention weighting on the classification bits and other patches respectively to obtain the feature vector of the classification bit, and executes The attention weighting calculation on the time axis, the row attention weighting calculation on the horizontal axis, and the column attention weighting calculation on the vertical axis. Calculate the weighted attention of the time axis, The feature vector obtained by weighting the results of the row attention weighting calculation and the column attention weighting calculation is connected to the feature vector of the classification bit. Furthermore, the independent superimposed attention network performs dimensionality reduction processing on the connected feature vectors, reducing the 3*E3-dimensional data to E3-dimensional data. Further, after being processed by the LN layer and MLP layer in the Transformer module, it is input to the classification module. In the classification module, the classification information corresponding to the classification bits is obtained through the linear layer, and then processed through the fully connected layer to obtain the action classification prediction distribution.
本申请实施例还提供一种数据处理装置。可以参阅图14,为本申请实施例提供的一种数据处理装置的结构示意图。数据处理装置包括输入单元1410,用于接收待处理数据,所述待处理数据包括多个元素的编码数据。An embodiment of the present application also provides a data processing device. Please refer to FIG. 14 , which is a schematic structural diagram of a data processing device provided by an embodiment of the present application. The data processing device includes an input unit 1410 for receiving data to be processed, the data to be processed including encoded data of a plurality of elements.
处理单元1420,用于通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量。The processing unit 1420 is configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
在一种可能的实现方式中,第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其它元素之间的相关度;所述第一元素为所述多个元素中的任一元素;所述第一元素所在的第一区域内的其它元素映射到除所述第一坐标轴以外的其它坐标轴的位置与所述第一元素映射到所述其它坐标轴的位置相同和/或相邻;所述第一坐标轴为所述多个坐标轴中的任一坐标轴。In a possible implementation, the feature vector corresponding to the first element on the first coordinate axis is used to represent the correlation between the first element and other elements in the first region where the first element is located. ; The first element is any element among the plurality of elements; other elements in the first area where the first element is located are mapped to positions and coordinates of other coordinate axes other than the first coordinate axis. The positions of the first element mapped to the other coordinate axes are the same and/or adjacent; the first coordinate axis is any coordinate axis among the plurality of coordinate axes.
在一种可能的实现方式中,所述处理单元1420,具体用于:将所述第一元素分别与所述第一元素对应的第一区域内的其它元素之间进行注意力计算得到所述第一元素分别与所述其它元素之间注意力值;所述第一元素为所述多个元素中的任一元素;根据所述第一元素分别与所述其它元素之间注意力值执行加权处理得到所述第一元素对应第一坐标轴上的特征向量。In a possible implementation, the processing unit 1420 is specifically configured to calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the The attention value between the first element and the other elements respectively; the first element is any element among the plurality of elements; execution is performed according to the attention value between the first element and the other elements respectively Weighting processing is performed to obtain the feature vector on the first coordinate axis corresponding to the first element.
在一种可能的实现方式中,所述待处理数据包括音频数据,所述音频数据包括多个音频点,每个音频点映射到时间坐标轴和频率坐标轴;或者,In a possible implementation, the data to be processed includes audio data, the audio data includes multiple audio points, and each audio point is mapped to a time coordinate axis and a frequency coordinate axis; or,
所述待处理数据包括图像数据,所述图像数据包括多个像素点或者图像块,每个像素点或者图像块映射到水平坐标轴和垂直坐标轴;或者,The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
所述待处理数据包括视频数据,所述视频数据包括多个视频帧,每个视频帧包括多个像素点或者图像块,每个像素点或者图像块映射到时间坐标轴、在空间上的水平坐标轴和垂直坐标轴。The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
在一种可能的实现方式中,坐标轴的数量等于N;神经网络包括线性模块1421、N个注意力计算模块1422和加权模块1423。In a possible implementation, the number of coordinate axes is equal to N; the neural network includes a linear module 1421, N attention calculation modules 1422 and a weighting module 1423.
所述线性模块1421,用于基于所述待处理数据生成第一查询向量、第一键值向量和第一价值向量;第i个注意力计算模块1422,用于根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第i个坐标轴上的特征向量;i为小于或者等于N的正整数;加权模块1423,用于对所述多个元素中每个元素分别对应到N个坐标轴上的特征向量进行加权。The linear module 1421 is used to generate the first query vector, the first key value vector and the first value vector based on the data to be processed; the i-th attention calculation module 1422 is used to generate the first query vector, The first key value vector and the first value vector obtain a feature vector corresponding to each element of the plurality of elements on the i-th coordinate axis; i is a positive integer less than or equal to N; The weighting module 1423 is used to weight each of the plurality of elements corresponding to the feature vectors on the N coordinate axes.
在一种可能的实现方式中,所述神经网络还包括第N+1个注意力计算模块1424和特征融合模块1425;In a possible implementation, the neural network also includes an N+1 attention calculation module 1424 and a feature fusion module 1425;
所述线性模块1421,用于还用于基于至少一个设定元素的编码数据生成第二查询向量、第二键值向量和第二价值向量;The linear module 1421 is further configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one setting element;
所述第N+1个注意力计算模块1424,用于根据所述第二查询向量、所述第二键值向 量、所述第二价值向量获取所述至少一个设定元素对应的特征向量;所述设定元素的对应特征向量用于表征所述设定元素与所述多个元素之间的关联度;The N+1th attention calculation module 1424 is used to calculate the direction according to the second query vector and the second key value. The second value vector obtains the feature vector corresponding to the at least one setting element; the corresponding feature vector of the setting element is used to represent the correlation between the setting element and the multiple elements;
所述特征融合模块1425,用于将所述待处理数据的特征向量以及所述至少一个设定元素的特征向量进行特征融合。The feature fusion module 1425 is configured to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
在一种可能的实现方式中,所述至少一个设定元素的编码数据是在训练所述神经网络的过程中作为所述神经网络的网络参数经过多轮调整得到的。In a possible implementation, the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the process of training the neural network.
接下来介绍本申请实施例提供的一种执行设备,请参阅图15,图15为本申请实施例提供的执行设备的一种结构示意图,执行设备具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。具体的,本申请实施例还提供该装置另外一种结构,如图15所示,执行设备1500中可以包括通信接口1510、处理器1520。可选的,执行设备1500中还可以包括存储器1530。其中,存储器1530可以设置于装置内部,还可以设置于装置外部。一种示例中,上述图14中所示的各个单元均可以由处理器1520实现。另一种示例中,输入单元的功能由通信接口1510来实现。处理单元1420的功能由处理器1520实现。处理器1520通过通信接口1510接收待处理数据,并用于实现图3、图6-图13中所述的方法。在实现过程中,处理流程的各步骤可以通过处理器1520中的硬件的集成逻辑电路或者软件形式的指令完成图3、图6-图13中所述的方法。Next, an execution device provided by an embodiment of the present application is introduced. Please refer to Figure 15. Figure 15 is a schematic structural diagram of an execution device provided by an embodiment of the present application. The execution device can be embodied as a mobile phone, tablet, notebook computer, smart phone, etc. Wearable devices, servers, etc. are not limited here. Specifically, the embodiment of the present application also provides another structure of the device. As shown in Figure 15, the execution device 1500 may include a communication interface 1510 and a processor 1520. Optionally, the execution device 1500 may also include a memory 1530. The memory 1530 may be provided inside the device or outside the device. In one example, each of the units shown in FIG. 14 can be implemented by the processor 1520. In another example, the function of the input unit is implemented by the communication interface 1510. The functions of the processing unit 1420 are implemented by the processor 1520. The processor 1520 receives the data to be processed through the communication interface 1510, and is used to implement the methods described in Figures 3, 6-13. During the implementation process, each step of the processing flow can complete the method described in FIG. 3 and FIG. 6 to FIG. 13 through the integrated logic circuit of hardware in the processor 1520 or instructions in the form of software.
本申请实施例中通信接口1510可以是电路、总线、收发器或者其它任意可以用于进行信息交互的装置。其中,示例性地,该其它装置可以是与执行设备1500相连的设备。In the embodiment of this application, the communication interface 1510 may be a circuit, bus, transceiver, or any other device that can be used for information exchange. Wherein, as an example, the other device may be a device connected to the execution device 1500.
本申请实施例中处理器1520可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件单元组合执行完成。处理器1520用于实现上述方法所执行的程序代码可以存储在存储器1530中。存储器1530和处理器1520耦合。In the embodiment of the present application, the processor 1520 may be a general processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or execute The disclosed methods, steps and logical block diagrams in the embodiments of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software units in the processor. The program code executed by the processor 1520 to implement the above method may be stored in the memory 1530. Memory 1530 and processor 1520 are coupled.
本申请实施例中的耦合是装置、单元或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。The coupling in the embodiment of this application is an indirect coupling or communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information interaction between devices, units or modules.
处理器1520可能和存储器1530协同操作。存储器1530可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器1530是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。The processor 1520 may cooperate with the memory 1530. The memory 1530 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), or a volatile memory (volatile memory), such as a random access memory (random access memory). -access memory, RAM). Memory 1530 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
本申请实施例中不限定上述通信接口1510、处理器1520以及存储器1530之间的具体连接介质。本申请实施例在图15中以存储器1530、处理器1520以及通信接口1510之间通过总线连接,总线在图15中以粗线表示,其它部件之间的连接方式,仅是进行示意性说明,并不引以为限。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图15中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The specific connection medium between the communication interface 1510, the processor 1520 and the memory 1530 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 1530, the processor 1520 and the communication interface 1510 are connected through a bus in Figure 15. The bus is represented by a thick line in Figure 15. The connection methods between other components are only schematically explained. It is not limited. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 15, but it does not mean that there is only one bus or one type of bus.
基于以上实施例,本申请实施例还提供了一种计算机存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时可实现上述任意一个或多个实施例提供的方法。所述计算机存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取 存储器、磁碟或者光盘等各种可以存储程序代码的介质。Based on the above embodiments, embodiments of the present application also provide a computer storage medium, which stores a software program. When read and executed by one or more processors, the software program can implement any one or more of the above. Examples provide methods. The computer storage media may include: U disk, mobile hard disk, read-only memory, random access Various media that can store program code, such as memory, magnetic disks, or optical disks.
基于以上实施例,本申请实施例还提供了一种芯片,该芯片包括处理器,用于实现上述任意一个或多个实施例所涉及的功能,例如获取或处理上述方法中所涉及的信息或者消息。可选地,所述芯片还包括存储器,所述存储器,用于处理器所执行必要的程序指令和数据。该芯片,可以由芯片构成,也可以包含芯片和其他分立器件。Based on the above embodiments, embodiments of the present application also provide a chip, which includes a processor and is used to implement the functions involved in any one or more of the above embodiments, such as obtaining or processing the information involved in the above methods or information. Optionally, the chip further includes a memory, and the memory is used for necessary program instructions and data executed by the processor. The chip may be composed of chips or may include chips and other discrete devices.
具体的,请参阅图16,图16为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1600,NPU 1600作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1603,通过控制器1604控制运算电路1603提取存储器中的矩阵数据并进行乘法运算。Specifically, please refer to Figure 16. Figure 16 is a structural schematic diagram of a chip provided by an embodiment of the present application. The chip can be represented as a neural network processor NPU 1600. The NPU 1600 serves as a co-processor and is mounted to the main CPU (Host). CPU), tasks are allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1603. The arithmetic circuit 1603 is controlled by the controller 1604 to extract the matrix data in the memory and perform multiplication operations.
在一些实现中,运算电路1603内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1603是二维脉动阵列。运算电路1603还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1603是通用的矩阵处理器。In some implementations, the computing circuit 1603 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 1603 is a two-dimensional systolic array. The arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1603 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1602中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1601中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1608中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1602 and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory 1601 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1608 .
统一存储器1606用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1605,DMAC被搬运到权重存储器1602中。输入数据也通过DMAC被搬运到统一存储器1606中。The unified memory 1606 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1605, and the DMAC is transferred to the weight memory 1602. Input data is also transferred to unified memory 1606 via DMAC.
BIU为Bus Interface Unit即,总线接口单元1610,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1609的交互。BIU is the Bus Interface Unit, that is, the bus interface unit 1610, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1609.
总线接口单元1610(Bus Interface Unit,简称BIU),用于取指存储器1609从外部存储器获取指令,还用于存储单元访问控制器1605从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 1610 (Bus Interface Unit, BIU for short) is used to fetch the memory 1609 to obtain instructions from the external memory, and is also used for the storage unit access controller 1605 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1606或将权重数据搬运到权重存储器1602中或将输入数据搬运到输入存储器1601中。DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1606 or the weight data to the weight memory 1602 or the input data to the input memory 1601 .
向量计算单元1607包括多个运算处理单元,在需要的情况下,对运算电路1603的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。The vector calculation unit 1607 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 1603, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. Mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.
在一些实现中,向量计算单元1607能将经处理的输出的向量存储到统一存储器1606。例如,向量计算单元1607可以将线性函数;或,非线性函数应用到运算电路1603的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1607生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1603的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, vector calculation unit 1607 can store the processed output vectors to unified memory 1606 . For example, the vector calculation unit 1607 can apply a linear function; or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 1607 generates normalized values, pixel-wise summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1603, such as for use in a subsequent layer in a neural network.
控制器1604连接的取指存储器(instruction fetch buffer)1609,用于存储控制器1604使用的指令;统一存储器1606,输入存储器1601,权重存储器1602以及取指存储器1609均为On-Chip存储器。外部存储器私有于该NPU硬件架构。 The instruction fetch buffer 1609 connected to the controller 1604 is used to store instructions used by the controller 1604; the unified memory 1606, the input memory 1601, the weight memory 1602 and the instruction fetch memory 1609 are all On-Chip memories. External memory is private to the NPU hardware architecture.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。The processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。 Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (19)

  1. 一种数据处理方法,其特征在于,包括:A data processing method, characterized by including:
    接收待处理数据,所述待处理数据包括多个元素的编码数据;Receive data to be processed, the data to be processed including encoded data of a plurality of elements;
    通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量。Feature extraction is performed on the data to be processed through a neural network to obtain feature vectors for each of the plurality of elements corresponding to multiple coordinate axes, and each element of the plurality of elements is corresponding to the corresponding coordinate axes. The feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
  2. 如权利要求1所述的方法,其特征在于,第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其它元素之间的相关度;所述第一元素为所述多个元素中的任一元素;所述第一元素所在的第一区域内的其它元素映射到除所述第一坐标轴以外的其它坐标轴的位置与所述第一元素映射到所述其它坐标轴的位置相同和/或相邻;所述第一坐标轴为所述多个坐标轴中的任一坐标轴。The method of claim 1, wherein the feature vector corresponding to the first element on the first coordinate axis is used to characterize the relationship between the first element and other elements in the first region where the first element is located. The correlation between The position of is the same as and/or adjacent to the position of the first element mapped to the other coordinate axes; the first coordinate axis is any one of the multiple coordinate axes.
  3. 如权利要求2所述的方法,其特征在于,获得所述多个元素中每个元素分别对应到多个坐标轴上的特征向量,包括:The method of claim 2, wherein obtaining feature vectors corresponding to each of the plurality of elements on multiple coordinate axes includes:
    将所述第一元素分别与所述第一元素对应的第一区域内的其它元素之间进行注意力计算得到所述第一元素分别与所述其它元素之间注意力值;所述第一元素为所述多个元素中的任一元素;Calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the attention value between the first element and the other elements respectively; the first The element is any one of the plurality of elements;
    根据所述第一元素分别与所述其它元素之间注意力值执行加权处理得到所述第一元素对应第一坐标轴上的特征向量。Perform weighting processing according to the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述待处理数据包括音频数据,所述音频数据包括多个音频点,每个音频点映射到时间坐标轴和频率坐标轴;或者,The method according to any one of claims 1 to 3, characterized in that the data to be processed includes audio data, the audio data includes a plurality of audio points, each audio point is mapped to a time coordinate axis and a frequency coordinate axis. ;or,
    所述待处理数据包括图像数据,所述图像数据包括多个像素点或者图像块,每个像素点或者图像块映射到水平坐标轴和垂直坐标轴;或者,The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
    所述待处理数据包括视频数据,所述视频数据包括多个视频帧,每个视频帧包括多个像素点或者图像块,每个像素点或者图像块映射到时间坐标轴、在空间上的水平坐标轴和垂直坐标轴。The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴上的特征向量:所述多个坐标轴包括第一坐标轴和第二坐标轴;The method according to any one of claims 1 to 4, characterized in that feature extraction is performed on the data to be processed through a neural network to obtain multiple coordinates corresponding to each element in the plurality of elements. Characteristic vectors on the axes: the plurality of coordinate axes include a first coordinate axis and a second coordinate axis;
    通过所述神经网络基于所述待处理数据生成第一查询向量、第一键值向量和第一价值向量;Generate a first query vector, a first key value vector and a first value vector based on the data to be processed through the neural network;
    根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第一坐标轴上的特征向量;According to the first query vector, the first key value vector, and the first value vector, obtain a feature vector corresponding to each element of the plurality of elements on the first coordinate axis;
    根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第二坐标轴上的特征向量。 According to the first query vector, the first key value vector, and the first value vector, a feature vector corresponding to each element of the plurality of elements corresponding to the second coordinate axis is obtained.
  6. 如权利要求5所述的方法,其特征在于,所述方法还包括:The method of claim 5, further comprising:
    基于至少一个设定元素的编码数据生成第二查询向量、第二键值向量和第二价值向量;Generate a second query vector, a second key-value vector and a second value vector based on the encoded data of at least one setting element;
    根据所述第二查询向量、所述第二键值向量、所述第二价值向量获取所述至少一个设定元素对应的特征向量;所述设定元素的对应特征向量用于表征所述设定元素与所述多个元素之间的关联度;The feature vector corresponding to the at least one setting element is obtained according to the second query vector, the second key value vector, and the second value vector; the corresponding feature vector of the setting element is used to characterize the device The degree of association between a certain element and the plurality of elements;
    将所述待处理数据的特征向量以及所述至少一个设定元素的特征向量进行特征融合。Feature fusion is performed on the feature vector of the data to be processed and the feature vector of the at least one setting element.
  7. 如权利要求6所述的方法,其特征在于,所述至少一个设定元素的编码数据是在训练所述神经网络的过程中作为所述神经网络的网络参数经过多轮调整得到的。The method of claim 6, wherein the encoded data of the at least one setting element is obtained as network parameters of the neural network through multiple rounds of adjustments during the process of training the neural network.
  8. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it includes:
    输入单元,用于接收待处理数据,所述待处理数据包括多个元素的编码数据;An input unit configured to receive data to be processed, where the data to be processed includes encoded data of multiple elements;
    处理单元,用于通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量。A processing unit configured to perform feature extraction on the data to be processed through a neural network to obtain feature vectors corresponding to multiple coordinate axes for each of the multiple elements, and to extract feature vectors for each of the multiple elements. Elements respectively corresponding to the feature vectors of the multiple coordinate axes are weighted to obtain the feature vector of the data to be processed.
  9. 如权利要求8所述的装置,其特征在于,第一元素对应到第一坐标轴上的特征向量用于表征所述第一元素与所述第一元素所在的第一区域内的其它元素之间的相关度;所述第一元素为所述多个元素中的任一元素;所述第一元素所在的第一区域内的其它元素映射到除所述第一坐标轴以外的其它坐标轴的位置与所述第一元素映射到所述其它坐标轴的位置相同和/或相邻;所述第一坐标轴为所述多个坐标轴中的任一坐标轴。The device of claim 8, wherein the feature vector corresponding to the first element on the first coordinate axis is used to characterize the relationship between the first element and other elements in the first region where the first element is located. The correlation between The position of is the same as and/or adjacent to the position of the first element mapped to the other coordinate axes; the first coordinate axis is any one of the multiple coordinate axes.
  10. 如权利要求9所述的装置,其特征在于,所述处理单元,具体用于:The device according to claim 9, wherein the processing unit is specifically used for:
    将所述第一元素分别与所述第一元素对应的第一区域内的其它元素之间进行注意力计算得到所述第一元素分别与所述其它元素之间注意力值;所述第一元素为所述多个元素中的任一元素;Calculate the attention between the first element and other elements in the first area corresponding to the first element to obtain the attention value between the first element and the other elements respectively; the first The element is any one of the plurality of elements;
    根据所述第一元素分别与所述其它元素之间注意力值执行加权处理得到所述第一元素对应第一坐标轴上的特征向量。Perform weighting processing according to the attention values between the first element and the other elements to obtain the feature vector on the first coordinate axis corresponding to the first element.
  11. 如权利要求8-10任一项所述的装置,其特征在于,所述待处理数据包括音频数据,所述音频数据包括多个音频点,每个音频点映射到时间坐标轴和频率坐标轴;或者,The device according to any one of claims 8 to 10, wherein the data to be processed includes audio data, the audio data includes a plurality of audio points, each audio point is mapped to a time coordinate axis and a frequency coordinate axis. ;or,
    所述待处理数据包括图像数据,所述图像数据包括多个像素点或者图像块,每个像素点或者图像块映射到水平坐标轴和垂直坐标轴;或者,The data to be processed includes image data, the image data includes a plurality of pixel points or image blocks, each pixel point or image block is mapped to a horizontal coordinate axis and a vertical coordinate axis; or,
    所述待处理数据包括视频数据,所述视频数据包括多个视频帧,每个视频帧包括多个像素点或者图像块,每个像素点或者图像块映射到时间坐标轴、在空间上的水平坐标轴和垂直坐标轴。The data to be processed includes video data. The video data includes multiple video frames. Each video frame includes multiple pixel points or image blocks. Each pixel point or image block is mapped to a time coordinate axis and a horizontal level in space. coordinate axes and vertical coordinate axes.
  12. 如权利要求8-11任一项所述的装置,其特征在于,坐标轴的数量等于N;The device according to any one of claims 8-11, characterized in that the number of coordinate axes is equal to N;
    所述线性模块,用于基于所述待处理数据生成第一查询向量、第一键值向量和第一价 值向量;The linear module is used to generate a first query vector, a first key value vector and a first price based on the data to be processed. value vector;
    第i个注意力计算模块,用于根据所述第一查询向量、所述第一键值向量、所述第一价值向量,获得所述多个元素中每个元素分别对应到所述第i个坐标轴上的特征向量;i为小于或者等于N的正整数;The i-th attention calculation module is used to obtain, according to the first query vector, the first key-value vector, and the first value vector, each element in the plurality of elements corresponding to the i-th Eigenvectors on coordinate axes; i is a positive integer less than or equal to N;
    加权模块,用于对所述多个元素中每个元素分别对应到N个坐标轴上的特征向量进行加权。A weighting module is used to weight the feature vectors corresponding to each of the plurality of elements on the N coordinate axes.
  13. 如权利要求12所述的装置,其特征在于,所述神经网络还包括第N+1个注意力计算模块和特征融合模块;The device according to claim 12, wherein the neural network further includes an N+1 attention calculation module and a feature fusion module;
    所述线性模块,用于还用于基于至少一个设定元素的编码数据生成第二查询向量、第二键值向量和第二价值向量;The linear module is configured to generate a second query vector, a second key value vector and a second value vector based on the encoded data of at least one set element;
    所述第N+1个注意力计算模块,用于根据所述第二查询向量、所述第二键值向量、所述第二价值向量获取所述至少一个设定元素对应的特征向量;所述设定元素的对应特征向量用于表征所述设定元素与所述多个元素之间的关联度;The N+1th attention calculation module is used to obtain the feature vector corresponding to the at least one setting element according to the second query vector, the second key value vector, and the second value vector; The corresponding feature vector of the setting element is used to represent the degree of association between the setting element and the multiple elements;
    所述特征融合模块,用于将所述待处理数据的特征向量以及所述至少一个设定元素的特征向量进行特征融合。The feature fusion module is used to perform feature fusion on the feature vector of the data to be processed and the feature vector of the at least one setting element.
  14. 如权利要求13所述的装置,其特征在于,所述至少一个设定元素的编码数据是在训练所述神经网络的过程中作为所述神经网络的网络参数经过多轮调整得到的。The device of claim 13, wherein the encoded data of the at least one setting element is obtained as a network parameter of the neural network through multiple rounds of adjustments during the training of the neural network.
  15. 一种数据处理系统,其特征在于,包括用户设备和云端服务设备;A data processing system, characterized by including user equipment and cloud service equipment;
    所述用户设备,用于向云端服务设备发送服务请求,所述服务请求携带待处理数据,所述待处理数据包括多个元素的编码数据;所述服务请求用于请求云端服务器针对所述待处理数据完成指定处理任务;The user equipment is used to send a service request to the cloud service device, the service request carries data to be processed, and the data to be processed includes encoded data of multiple elements; the service request is used to request the cloud server to perform the processing for the data to be processed. Process data to complete designated processing tasks;
    所述云端服务设备,用于通过神经网络对所述待处理数据进行特征提取,以获得所述多个元素中每个元素分别对应到多个坐标轴的特征向量,并对所述多个元素中每个元素分别对应到所述多个坐标轴的特征向量进行加权处理,以获得所述待处理数据的特征向量;根据所述待处理数据的特征向量完成所述指定处理任务得到处理结果,向所述用户设备发送所述处理结果;The cloud service device is used to perform feature extraction on the data to be processed through a neural network to obtain feature vectors for each of the multiple elements corresponding to multiple coordinate axes, and to perform feature extraction on the multiple elements. Each element in is corresponding to the feature vectors of the multiple coordinate axes and is weighted to obtain the feature vector of the data to be processed; completing the specified processing task according to the feature vector of the data to be processed to obtain the processing result, Send the processing result to the user equipment;
    所述用户设备,还用于接收来自所述云端服务设备的所述处理结果。The user equipment is also configured to receive the processing result from the cloud service equipment.
  16. 一种电子设备,其特征在于,所述终端设备包括至少一个处理器和存储器;An electronic device, characterized in that the terminal device includes at least one processor and memory;
    所述存储器中存储有指令;Instructions are stored in the memory;
    所述至少一个处理器,用于执行所述存储器存储的所述指令,以实现权利要求1至7中任一项所述的方法。The at least one processor is configured to execute the instructions stored in the memory to implement the method according to any one of claims 1 to 7.
  17. 一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器和通信接口,所述通信接口和所述至少一个处理器通过线路互联;A chip system, characterized in that the chip system includes at least one processor and a communication interface, and the communication interface and the at least one processor are interconnected through lines;
    所述通信接口,用于接收待处理数据;The communication interface is used to receive data to be processed;
    所述处理器,用于针对所述待处理数据执行权利要求1至7中任意一项所述的方法。 The processor is configured to execute the method described in any one of claims 1 to 7 on the data to be processed.
  18. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至7任一项所述的方法。A computer storage medium, characterized in that the computer storage medium stores instructions, which when executed by a computer, cause the computer to implement the method described in any one of claims 1 to 7.
  19. 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至7任一项所述的方法。 A computer program product, characterized in that the computer program product stores instructions that, when executed by a computer, cause the computer to implement the method described in any one of claims 1 to 7.
PCT/CN2023/093668 2022-05-24 2023-05-11 Data processing method and apparatus WO2023226783A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210569598.0A CN117172297A (en) 2022-05-24 2022-05-24 Data processing method and device
CN202210569598.0 2022-05-24

Publications (1)

Publication Number Publication Date
WO2023226783A1 true WO2023226783A1 (en) 2023-11-30

Family

ID=88918423

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093668 WO2023226783A1 (en) 2022-05-24 2023-05-11 Data processing method and apparatus

Country Status (2)

Country Link
CN (1) CN117172297A (en)
WO (1) WO2023226783A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046847A (en) * 2019-12-30 2020-04-21 北京澎思科技有限公司 Video processing method and device, electronic equipment and medium
US20200193206A1 (en) * 2018-12-18 2020-06-18 Slyce Acquisition Inc. Scene and user-input context aided visual search
CN112183335A (en) * 2020-09-28 2021-01-05 中国人民大学 Handwritten image recognition method and system based on unsupervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200193206A1 (en) * 2018-12-18 2020-06-18 Slyce Acquisition Inc. Scene and user-input context aided visual search
CN111046847A (en) * 2019-12-30 2020-04-21 北京澎思科技有限公司 Video processing method and device, electronic equipment and medium
CN112183335A (en) * 2020-09-28 2021-01-05 中国人民大学 Handwritten image recognition method and system based on unsupervised learning

Also Published As

Publication number Publication date
CN117172297A (en) 2023-12-05

Similar Documents

Publication Publication Date Title
WO2021159714A1 (en) Data processing method and related device
CN111797893B (en) Neural network training method, image classification system and related equipment
US20220180202A1 (en) Text processing model training method, and text processing method and apparatus
Gao et al. MSCFNet: A lightweight network with multi-scale context fusion for real-time semantic segmentation
WO2022007823A1 (en) Text data processing method and device
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
WO2021164772A1 (en) Method for training cross-modal retrieval model, cross-modal retrieval method, and related device
WO2019228358A1 (en) Deep neural network training method and apparatus
WO2022022274A1 (en) Model training method and apparatus
WO2022001805A1 (en) Neural network distillation method and device
WO2022253074A1 (en) Data processing method and related device
WO2020104499A1 (en) Action classification in video clips using attention-based neural networks
WO2023165361A1 (en) Data processing method and related device
WO2020062299A1 (en) Neural network processor, data processing method and related device
US20240185086A1 (en) Model distillation method and related device
WO2021136058A1 (en) Video processing method and device
WO2024001806A1 (en) Data valuation method based on federated learning and related device therefor
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
Yi et al. Elanet: effective lightweight attention-guided network for real-time semantic segmentation
WO2024179485A1 (en) Image processing method and related device thereof
CN112529149A (en) Data processing method and related device
CN114282055A (en) Video feature extraction method, device and equipment and computer storage medium
CN113657272B (en) Micro video classification method and system based on missing data completion
WO2024114659A1 (en) Summary generation method and related device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810856

Country of ref document: EP

Kind code of ref document: A1