CN114925320B

CN114925320B - Data processing method and related device

Info

Publication number: CN114925320B
Application number: CN202111146270.XA
Authority: CN
Inventors: 胡智恒; 刘艳琳; 王永忠
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2023-10-20
Anticipated expiration: 2041-09-28
Also published as: CN114925320A

Abstract

The application discloses a data processing method which is applied to the field of artificial intelligence. The method comprises the following steps: obtaining partitioning information of a sparse matrix, wherein the partitioning information is used for indicating a plurality of matrix blocks divided in the sparse matrix; according to the block information, matrix data corresponding to each matrix block in the plurality of matrix blocks are obtained from the first matrix and the second matrix; according to the matrix data, performing matrix multiplication operation to obtain a plurality of matrix blocks, wherein the matrix blocks comprise all non-zero elements in a sparse matrix, and each matrix block in the matrix blocks comprises a plurality of elements; and splicing the matrix blocks to obtain a target matrix, wherein the target matrix is used for executing the operation related to the sparse matrix in the sparse attention network. Based on the scheme, repeated carrying of matrix data can be effectively avoided, the number of data carrying instructions is reduced, and the data processing efficiency is improved.

Description

Data processing method and related device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method and a related device.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In recent years, self-attention networks have found great utility in many natural language processing (Natural Language Processing, NLP) tasks, such as machine translation, emotion analysis, and problem solutions. With the widespread use of self-attention networks, self-attention networks derived from the field of natural language processing have also achieved very high performance in tasks such as image classification, object detection, and image processing.

Because of the existence of partially redundant three-dimensional computing information in the self-attention network, sparse attention networks capable of reducing the amount of computation have been developed. In a sparse attention network, the core computation is based on two dense matrices to compute a sparse matrix that is used to characterize sparse attention features. At present, the mode of calculating the sparse matrix is based on the positions of non-zero elements in the sparse matrix, matrix data required by calculating the non-zero elements are obtained, and the non-zero elements in the sparse matrix are calculated one by one.

However, the calculation mode of the sparse matrix needs to frequently carry matrix data required by calculation, repeated carrying of the matrix data occurs, so that the number of data carrying instructions is increased dramatically, the calculation memory ratio in the matrix operation process is reduced, and the data processing efficiency is lower.

Disclosure of Invention

The application provides a data processing method, which can effectively avoid repeated handling of matrix data, reduce the number of data handling instructions and improve the data processing efficiency.

The first aspect of the application provides a data processing method applied to a sparse attention network in the field of artificial intelligence. The method comprises the following steps: and obtaining blocking information of a sparse matrix, wherein the sparse matrix is an intermediate matrix obtained during the operation execution based on the sparse attention network, and the sparse matrix is used for executing subsequent operation in the sparse attention network. In a sparse attention network, the distribution of non-zero elements in the sparse matrix is regular. Therefore, by dividing the sparse matrix, a plurality of matrix blocks including all non-zero elements in the sparse matrix can be obtained, and the blocking information of the sparse matrix is used for indicating the plurality of matrix blocks divided in the sparse matrix.

According to the partitioning information of the sparse matrix, matrix data corresponding to each matrix block in the plurality of matrix blocks are obtained from a first matrix and a second matrix, wherein the first matrix and the second matrix are used for calculating the sparse matrix. For example, in the case of performing an operation based on a graphics processor (Graphics Processing Unit, GPU), the first matrix and the second matrix are stored in a memory of the GPU, and the GPU may be configured to transfer matrix data corresponding to the matrix block from the memory to an operation unit of the GPU by executing a data transfer instruction to calculate a specific value of an element in the matrix block.

And performing matrix multiplication operation according to the matrix data to obtain a plurality of matrix blocks, wherein the matrix blocks comprise all non-zero elements in the sparse matrix, and each matrix block in the matrix blocks comprises a plurality of elements.

And splicing the matrix blocks to obtain a target matrix, wherein the target matrix is used for executing the operation related to the sparse matrix in the sparse attention network. That is, the target matrix obtained by splicing the plurality of matrix blocks is used to replace the sparse matrix to realize other operations related to the sparse matrix in the sparse attention network.

In the scheme, a plurality of matrix blocks in the sparse matrix are obtained by dividing the sparse matrix in advance. In the calculation process of the sparse matrix, matrix data required by calculating each matrix block in the sparse matrix are obtained, and an actual value of each matrix block is calculated. Finally, a target matrix for replacing the sparse matrix is obtained by splicing the plurality of matrix blocks obtained through calculation, and the target matrix can be used for subsequent calculation related to the sparse matrix in the sparse attention network. By dividing the sparse matrix into a plurality of matrix blocks for calculation, repeated carrying of matrix data can be effectively avoided, the number of data carrying instructions is reduced, and the calculation memory ratio in the matrix operation process is improved, so that the data processing efficiency is improved.

In one possible implementation manner, when a plurality of matrix blocks of the sparse matrix include a target element, values of the target element in the plurality of matrix blocks of the sparse matrix are adjusted to be zero, so as to obtain a plurality of adjusted matrix blocks, wherein the target element is an element with zero value in the sparse matrix. And then, splicing the plurality of adjusted matrix blocks to obtain the target matrix. Thus, the element value of each position in the target matrix is the same as the element value of the corresponding position in the sparse matrix.

That is, the values of the elements of the calculated plurality of matrix blocks corresponding to the locations of the zero-valued elements of the sparse matrix may not be zero. In this case, the values of the elements at specific positions in the plurality of matrix blocks need to be adjusted to zero so that the values of the elements at respective positions in the plurality of matrix blocks can be matched with the values of the elements at corresponding positions in the sparse matrix.

In the scheme, the values of the target elements in the matrix blocks are adjusted, so that the element value of each position in the target matrix obtained by splicing the adjusted matrix blocks is identical to the element value of the corresponding position in the sparse matrix, the target matrix can replace the sparse matrix in the subsequent operation process, and the normal execution of the subsequent operation is ensured.

In addition, in the case of having the target element in the plurality of matrix blocks, although a small number of redundant calculations are added as compared with the prior art, the number of data handling can be greatly reduced and repeated handling of data can be avoided, thereby improving the efficiency of data processing.

In one possible implementation, the number of rows or columns of each of a plurality of matrix blocks of the sparse matrix is the same, the plurality of matrix blocks being used to represent the local attention feature. For example, in the case where the sparse matrix subsequently needs to perform the normalization operation, when the normalization operation is performed with the rows of the sparse matrix as the basic units, the column number of each of the plurality of matrix blocks is the same; when the normalization operation is performed with the columns of the sparse matrix as the basic unit, the number of rows of each of the plurality of matrix blocks is the same.

When the number of rows of each matrix block in the plurality of matrix blocks is the same, the plurality of matrix blocks are spliced in the dimension of columns to obtain the target matrix, and the number of rows of the target matrix is the same as the number of rows of the plurality of matrix blocks.

When the number of columns of each matrix block in the plurality of matrix blocks is the same, the plurality of matrix blocks are spliced in the dimension of the row to obtain the target matrix, and the number of columns of the target matrix is the same as the number of columns of the plurality of matrix blocks.

In one possible implementation manner, the partitioning information of the sparse matrix includes first partitioning information and second partitioning information, where the first partitioning information is used to indicate a plurality of first matrix blocks divided in the sparse matrix, the columns of the plurality of first matrix blocks are the same, and the total number of rows of the plurality of first matrix blocks is equal to the number of rows of the sparse matrix, and the second partitioning information is used to indicate a plurality of second matrix blocks divided in the sparse matrix, and the number of rows of the plurality of second matrix blocks is the same as the number of rows of the sparse matrix, and the number of columns of the plurality of second matrix blocks is smaller than the number of columns of the sparse matrix. Wherein the first plurality of matrix blocks may be for representing local attention features and the second plurality of matrix blocks may be for representing global attention features.

After a plurality of first matrix blocks and a plurality of second matrix blocks are obtained through calculation based on the first block information and the second block information, the plurality of first matrix blocks are spliced in the dimension of a row to obtain a first target matrix; and splicing the plurality of second matrix blocks in the dimension of the columns to obtain a second target matrix. And then, splicing the first target matrix and the second target matrix in the dimension of the columns to obtain the target matrix.

In one possible implementation, before the first target matrix and the second target matrix are spliced, it may be determined whether the first target matrix and the second target matrix contain elements that are identical in position in the sparse matrix, that is, whether the first target matrix and the second target matrix contain repeated elements.

For example, if the positions of the first element in the first target matrix and the second element in the second target matrix in the sparse matrix are the same, adjusting the value of the second element in the second target matrix to zero to obtain an adjusted second target matrix; and splicing the first target matrix and the adjusted second target matrix in the dimension of the columns to obtain the target matrix.

In the scheme, matrix blocks of different types in the sparse matrix are obtained through calculation based on different block information, and after the matrix blocks of the same type are spliced, the matrix blocks of two different types are spliced to obtain the target matrix. Based on the splicing mode, the continuity of data can be ensured while the matrix multiplication operation amount is reduced, so that repeated movement of the data is reduced as much as possible. In addition, in the case that the target matrix subsequently needs to execute the normalization operation, the calculation amount of the normalization operation can be reduced, and the number of data handling instructions required to be executed in the normalization operation process can be reduced.

In one possible implementation, the data processing method further includes: normalizing the target matrix to obtain a normalized target matrix; and obtaining an output matrix based on the normalized target matrix and the third matrix, wherein the output matrix is the output of the attention module in the sparse attention network, and the first matrix, the second matrix and the third matrix are obtained by calculating the same matrix based on different weights.

For example, the first matrix may be a matrix Q in the sparse attention network, the second matrix may be a matrix K in the sparse attention network, and the third matrix may be a matrix V in the sparse attention network. The matrix Q, the matrix K and the matrix V are respectively obtained by calculating the same matrix based on different weights.

In one possible implementation manner, the obtaining a fourth matrix based on the normalized target matrix and the third matrix includes: splitting the normalized target matrix in the column dimension based on the splicing mode of the target matrix to obtain a third target matrix and a fourth target matrix, wherein the size of the third target matrix is the same as that of the first target matrix, and the size of the fourth target matrix is the same as that of the second target matrix; splitting the third target matrix and the third matrix based on the splicing mode of the first target matrix to obtain a plurality of matrix blocks in the third target matrix and a plurality of matrix blocks in the third matrix, wherein the matrix blocks in the third target matrix and the matrix blocks in the third matrix have a one-to-one correspondence; performing matrix multiplication operation on the third target matrix and matrix blocks with corresponding relation in the third matrix, and splicing the matrix blocks obtained after performing the matrix multiplication operation in the row dimension to obtain a first output matrix; performing matrix multiplication operation on the sub-matrix of the fourth target matrix and the third matrix to obtain a second output matrix, wherein the sub-matrix is formed by a plurality of rows of elements in the third matrix; and adding the first output matrix and the second output matrix to obtain the output matrix.

A second aspect of the present application provides a data processing method based on a sparse attention network, including: acquiring data to be processed; processing the data to be processed based on a sparse attention network to obtain output data; wherein during processing of the data to be processed based on the sparse attention network, an operation related to a sparse matrix in the sparse attention network is performed according to the method of the first aspect or any implementation manner of the first aspect.

In one possible implementation, the data to be processed includes image data, text data, or voice data.

A third aspect of the present application provides a data processing apparatus comprising: an acquisition unit and a processing unit; the acquisition unit is used for acquiring block information of a sparse matrix, wherein the sparse matrix is an intermediate matrix obtained during the operation based on the sparse attention network, and the block information is used for indicating a plurality of matrix blocks divided in the sparse matrix; the obtaining unit is further configured to obtain, according to the partitioning information, matrix data corresponding to each of the plurality of matrix blocks from a first matrix and a second matrix, where the first matrix and the second matrix are matrices for calculating the sparse matrix; the processing unit is configured to perform matrix multiplication according to the matrix data to obtain a plurality of matrix blocks, where the plurality of matrix blocks include all non-zero elements in the sparse matrix, and each matrix block in the plurality of matrix blocks includes a plurality of elements; the processing unit is further configured to splice the plurality of matrix blocks to obtain a target matrix, where the target matrix is used to perform an operation related to the sparse matrix in the sparse attention network.

In a possible implementation manner, the processing unit is specifically configured to: when the matrix blocks comprise target elements, adjusting the values of the target elements in the matrix blocks to zero to obtain a plurality of adjusted matrix blocks, wherein the target elements are elements with zero values in the sparse matrix; and splicing the plurality of adjusted matrix blocks to obtain the target matrix.

In one possible implementation, the number of rows or columns of each matrix block in the plurality of matrix blocks is the same, and the plurality of matrix blocks are used for representing local attention features; the processing unit is specifically configured to: when the number of lines of each matrix block in the plurality of matrix blocks is the same, splicing the plurality of matrix blocks in the dimension of columns to obtain the target matrix, wherein the number of lines of the target matrix is the same as the number of lines of the plurality of matrix blocks; when the number of columns of each matrix block in the plurality of matrix blocks is the same, the plurality of matrix blocks are spliced in the dimension of the row to obtain the target matrix, and the number of columns of the target matrix is the same as the number of columns of the plurality of matrix blocks.

In one possible implementation manner, the blocking information includes first blocking information and second blocking information, the first blocking information is used for indicating a plurality of first matrix blocks divided in the sparse matrix, the columns of the plurality of first matrix blocks are the same and the total number of rows of the plurality of first matrix blocks is equal to the number of rows of the sparse matrix, the second blocking information is used for indicating a plurality of second matrix blocks divided in the sparse matrix, the number of rows of the plurality of second matrix blocks is the same as the number of rows of the sparse matrix and the number of columns of the plurality of second matrix blocks is smaller than the number of columns of the sparse matrix; the processing unit is specifically configured to: splicing the plurality of first matrix blocks in the dimension of the row to obtain a first target matrix; splicing the plurality of second matrix blocks in the dimension of the columns to obtain a second target matrix; and splicing the first target matrix and the second target matrix in the dimension of the columns to obtain the target matrix.

In one possible implementation, the plurality of first matrix blocks are used to represent local attention features and the plurality of second matrix blocks are used to represent global attention features.

In a possible implementation manner, the processing unit is specifically configured to: if the positions of the first element in the first target matrix and the second element in the second target matrix in the sparse matrix are the same, adjusting the value of the second element in the second target matrix to be zero to obtain an adjusted second target matrix; and splicing the first target matrix and the adjusted second target matrix in the dimension of the columns to obtain the target matrix.

In a possible implementation manner, the processing unit is further configured to: normalizing the target matrix to obtain a normalized target matrix; and obtaining an output matrix based on the normalized target matrix and the third matrix, wherein the output matrix is the output of the attention module in the sparse attention network, and the first matrix, the second matrix and the third matrix are obtained by calculating the same matrix based on different weights.

In a possible implementation manner, the processing unit is specifically configured to: splitting the normalized target matrix in the column dimension based on the splicing mode of the target matrix to obtain a third target matrix and a fourth target matrix, wherein the size of the third target matrix is the same as that of the first target matrix, and the size of the fourth target matrix is the same as that of the second target matrix; splitting the third target matrix and the third matrix based on the splicing mode of the first target matrix to obtain a plurality of matrix blocks in the third target matrix and a plurality of matrix blocks in the third matrix, wherein the matrix blocks in the third target matrix and the matrix blocks in the third matrix have a one-to-one correspondence; performing matrix multiplication operation on the third target matrix and matrix blocks with corresponding relation in the third matrix, and splicing the matrix blocks obtained after performing the matrix multiplication operation in the row dimension to obtain a first output matrix; performing matrix multiplication operation on the sub-matrix of the fourth target matrix and the third matrix to obtain a second output matrix, wherein the sub-matrix is formed by a plurality of rows of elements in the third matrix; and adding the first output matrix and the second output matrix to obtain the output matrix.

A fourth aspect of the present application provides a data processing apparatus comprising: an acquisition unit and a processing unit; the acquisition unit is used for acquiring data to be processed; the processing unit is used for processing the data to be processed based on a sparse attention network to obtain output data; wherein during processing of the data to be processed based on the sparse attention network, an operation related to a sparse matrix in the sparse attention network is performed according to the method of the first aspect or any implementation manner of the first aspect.

A fifth aspect of the present application provides an electronic device, which may comprise a processor, the processor being coupled to a memory, the memory storing program instructions which, when executed by the processor, implement the method of the first aspect or any implementation of the first aspect. For the steps in each possible implementation manner of the first aspect executed by the processor, reference may be specifically made to the first aspect, which is not described herein.

A sixth aspect of the application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of the first aspect or any implementation of the first aspect.

A seventh aspect of the application provides circuitry comprising processing circuitry configured to perform the method of the first aspect or any implementation of the first aspect.

An eighth aspect of the application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any implementation of the first aspect.

A ninth aspect of the present application provides a chip system comprising a processor for supporting a server or threshold value acquisition device to perform the functions referred to in the first aspect or any implementation manner of the first aspect, e.g. to send or process data and/or information referred to in the method. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the server or the communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Drawings

FIG. 1 is a schematic diagram of a transducer according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a Self-Attention structure according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a matrix and sequence association method of Local Attention according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another matrix and sequence association method of Local Attention according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another matrix and sequence association method of Local Attention according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a matrix and sequence association method of Global Attention according to an embodiment of the present application;

FIG. 7 is a schematic diagram of another Global Attention matrix according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a calculation flow of a spark Attention structure according to an embodiment of the present application;

FIG. 9 is a schematic diagram of calculating a sparse matrix based on a conventional sparse operation according to an embodiment of the present application;

FIG. 10 is a flowchart of a data processing method according to an embodiment of the present application;

fig. 11 is a schematic diagram of partitioning a sparse matrix according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a matrix block in a computation sparse matrix according to an embodiment of the present application;

fig. 13 is a schematic diagram of a matrix block in a spliced sparse matrix according to an embodiment of the present application;

fig. 14 is a schematic diagram illustrating a matrix block splicing manner according to an embodiment of the present application;

Fig. 15 is a schematic diagram of dividing a matrix block based on first partition information and second partition information according to an embodiment of the present application;

FIG. 16 is a schematic diagram of a matrix block calculation based on first and second block information according to an embodiment of the present application;

FIG. 17 is a schematic diagram of dividing a matrix block based on first block information and second block information according to an embodiment of the present application;

FIG. 18 is a schematic flow chart of calculation after splitting the normalized target matrix according to an embodiment of the present application;

fig. 19 is a schematic diagram of a calculation flow of an attention module in a sparse attention network according to an embodiment of the present application;

fig. 20 is an application scenario of a data processing method based on a sparse attention network according to an embodiment of the present application;

FIG. 21 is a schematic diagram of a data processing flow of GPT-3 according to an embodiment of the present application;

fig. 22 is a schematic structural diagram of a data processing apparatus 2200 according to an embodiment of the present application;

fig. 23 is a schematic structural diagram of an execution device according to an embodiment of the present application;

fig. 24 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, technical terms related to embodiments of the present application are described below.

Sparse matrix: a special matrix. In the sparse matrix, the number of elements with a value of 0 is far greater than the number of non-0 elements, i.e. the values of most elements in the sparse matrix are 0.

Attention network: a network model that utilizes an attention mechanism to increase the speed of model training. Currently, typical attention networks include a transducer model. The model applying the attention mechanism can give different weights to each part of the input sequence, thereby extracting more important characteristic information in the input sequence, so that the model is finally output more accurately.

In deep learning, the attention mechanism may be implemented by a weight vector describing importance: when an element is predicted or inferred, the association between the element and other elements is determined by a weight vector. For example, for a certain pixel in an image or a certain word in a sentence, the correlation between the target element and other elements may be quantitatively estimated using the attention vector, and the weighted sum of the attention vectors is taken as an approximation of the target value.

The attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when a human is viewing a picture, while the human eye can see the full view of the picture, when the human is looking deep and carefully, the human eye focuses only on a portion of the picture, where the human brain is primarily focused on this small pattern. That is, when a human carefully observes an image, the attention of the human brain to the whole image is not balanced and is distinguished by a certain weight, which is the core idea of the attention mechanism.

In brief, human vision processing systems tend to selectively focus on certain portions of an image, while ignoring other irrelevant information, thereby facilitating perception of the human brain. Similarly, in deep learning attention mechanisms, certain portions of the input may be more relevant than others in some questions involving language, speech, or vision. Thus, by means of the attention mechanism in the attention model, the attention model can be caused to perform different processing on different parts of the input data, such that the attention model only dynamically focuses on data related to the task.

Self-Attention (Self-Attention) mechanism: self-attention mechanisms are variants of attention mechanisms that reduce reliance on external information, and are more adept at capturing internal dependencies of data or features.

Self-attention network: a neural network employing a self-attention mechanism. Self-attention mechanisms are an extension of attention mechanisms. The self-attention mechanism is in fact an attention mechanism that relates different positions of a single sequence to calculate a representation of the same sequence. Self-attention mechanisms can play a key role in machine-reading, abstract or image description generation.

Taking natural language processing as an example of application of the self-attention network, the self-attention network processes input data of any length and generates new feature expressions of the input data, and then converts the feature expressions into target words. The self-attention network layer in the self-attention network uses the attention mechanism to obtain the relationships between all other words, thereby generating a new feature expression for each word. An advantage of the self-attention network is that the attention mechanism is able to directly capture the relationships between all words in a sentence without regard to word position.

Sparse attention mechanism: sparse attention mechanisms are variants of self-attention mechanisms, with objects that are noted by the sparse attention mechanism being more sparse than the self-attention mechanism.

Sparse attention network: a network model using a sparse attention mechanism can reduce the amount of computation of a memory and the network model.

Taking the example that the sparse attention network is applied to natural language processing, the sparse attention network processes input data with any length, generates new feature expressions of the input data, and then converts the feature expressions into target words. The self-attention network layer in the sparse attention network uses a sparse attention mechanism to obtain the relationship between each word and other partial words, thereby generating a new feature expression for each word.

Matrix multiplication operation (MatMul): matrix multiplication is a binary operation that yields a third matrix, the product of the first two, also commonly referred to as the matrix product, from two matrices. The matrix may be used to represent a linear mapping and the matrix product may be used to represent a composite of the linear mapping.

The Softmax function, also known as the normalized exponential function, is a generalization of the logic function. The Softmax function can transform one K-dimensional vector Z containing arbitrary real numbers into another K-dimensional vector σ (Z) such that each element in the transformed vector σ (Z) ranges between (0, 1) and the sum of all elements is 1. The Softmax function may be calculated as shown in equation one.

Wherein σ (z) j represents the value of the j-th element in the vector transformed by the Softmax function; zj represents the value of the j-th element in the vector Z; zk represents the value of the kth element in the vector Z; sigma represents the sum.

Pre-training model: the pre-training model is a well-trained, stored network that has been trained on a large data set.

Artificial intelligence accelerator: a microprocessor or computing system dedicated to hardware acceleration of artificial intelligence (especially artificial neural networks, machine vision, machine learning, etc.). Typical applications include robotics, internet of things, and other data-intensive applications or sensor driven tasks. Such as a graphics processor (Graphics Processing Unit, GPU), a neural network processor (Neural Network Processing Unit, NPU), a tensor processor (Tensor Processing Unit, TPU), etc.

The transducer architecture is a powerful sequence model, but the time and memory required for computation grows quadratically with sequence length. This greatly increases the storage and computational power requirements of the model for the hardware. Referring to fig. 1, fig. 1 is a schematic structural diagram of a transducer according to an embodiment of the present application. As shown in fig. 1, the transducer mainly includes a plurality of core modules connected in sequence, and each core module includes a serial connected attention module and a multi-layer perceptron.

The core structure of the transducer is the Self-Attention structure, namely the Attention module. The operation in the Self-Attention structure mainly includes two matrix multiplication operations and one Softmax operation. Referring to fig. 2, fig. 2 is a schematic structural diagram of a Self-Attention structure according to an embodiment of the present application.

As shown in fig. 2, in the Self-Attention structure, after the dimension of the input sequence is increased (i.e., the dimension of the input sequence is increased), matrix multiplication operation is performed on the matrix of the input sequence based on different weights, so as to obtain three different weight matrices, i.e., a matrix Q, a matrix K and a matrix V. Then, performing matrix multiplication operation on the matrix Q and the matrix K to obtain a matrix A; after the normalization operation is performed on the matrix a, a matrix A1 is obtained. And finally, performing matrix multiplication operation on the matrix A1 and the matrix V to obtain a final output matrix.

The batch size is a small part of samples used for carrying out one-time back propagation parameter update on the model weight in the data set during model training, and the size of the small part of samples is the batch size, which is simply called b; sequence length represents the length of the sequence at the time of input, and is simply called s; the sounding size is the length of each unit in the input sequence, which is called e for short; the head number indicates the number of parts for dividing each unit in the sequence when the attention calculation is performed, and is called h for short; embedding size per head is the unit length after dividing each unit in the sequence when performing the attention calculation, and is abbreviated as e1.

Studies have shown that there is partially redundant three-dimensional computational information in the Self-Attention structure, which means that the amount of computation and the amount of memory required for a transducer structure can be reduced by reducing the computation of the partial redundancy. Sparse Attention (Sparse Attention) is an artificially defined, regularly Sparse structure. In practical applications, spark Attention may be represented in the form of a Sparse matrix. The basic idea of spark Attention is to consider that each element is related to only a part of the elements in the sequence, thus reducing the computation of the relevance. Based on spark Attention, the calculated amount of the Attention structure can be reduced, and the final precision of the model is not affected basically.

The spark Attention can not only reduce the memory and improve the throughput index, but also increase the input sequence length of the model and improve the long-distance perception capability of the model. Therefore, it is significant to efficiently implement the calculation of spark Attention.

In general, the existing spark Attention mainly includes a Local Attention structure (Local Attention). In some cases, spark Attention includes Local Attention and Global Attention structure (Global Attention).

In particular, in Computer Vision (CV) and NLP tasks, the association between a cell in an input sequence and other cells in a local range tends to be high, while the association between the cell and other cells farther in the input sequence tends to be low. Local Attention is based on the characteristic that the association between units in a Local area is high. That is, local Attention mainly focuses on learning the relevance between Local contexts.

Referring to fig. 3, fig. 3 is a schematic diagram of a matrix and sequence association method of Local Attention according to an embodiment of the present application. As shown in (a) of fig. 3, the matrix of the Local Attention is a sparse matrix, only the values of some elements in the matrix of the Local Attention are valid, and the values of other elements in the matrix of the Local Attention are invalid. That is, in calculating the matrix of Local Attention, the values of some elements in the matrix of Local Attention need to be calculated, while the values of other partial elements need not be calculated.

As shown in fig. 3 (b), a cell in the input sequence has an association with only a cell in a local range where the cell is located. For example, the input sequence shown in fig. 3 (b) includes 16 units, and the 16 units are divided into 4 partial ranges, and the 4 partial ranges are respectively composed of 4 units at different positions in the input sequence. For any one local scope, any one unit in the local scope has relevance to all units in the local scope.

Assuming that the matrix of Local Attention is calculated based on the input matrix 1 and the input matrix 2, for the element of the nth row and the mth column in the matrix of Local Attention, the value of the element is calculated based on the element of the nth row in the input matrix 1 and the element of the mth column in the input matrix 2. Therefore, when the input sequence is represented by an input matrix, the association between the cells in the input sequence shown in fig. 3 (b) can be represented by the matrix of Local Attention shown in fig. 3 (a). In the matrix of Local Attention shown in fig. 3 (a), the element of each row is used to represent the association between a certain cell and all cells in the Local area where the cell is located.

Referring to fig. 4, fig. 4 is a schematic diagram of another matrix and sequence association method of Local Attention according to an embodiment of the present application. As shown in fig. 4 (b), a cell in the input sequence has an association with only other cells that are within the local range of and preceding the cell. In this case, the correlation between the cells in the input sequence shown in fig. 4 (b) can be expressed by the matrix of Local Attention shown in fig. 4 (a). In the Local Attention matrix shown in fig. 4 (a), the element of each row is used to represent the association between a certain cell and other cells located in front of the cell in the Local area where the cell is located.

Referring to fig. 5, fig. 5 is a schematic diagram of another matrix and sequence association method of Local Attention according to an embodiment of the present application. As shown in (b) of fig. 5, a cell in the input sequence has relevance only to the k cells before and after the cell and itself. In this case, the correlation between the cells in the input sequence shown in fig. 5 (b) can be expressed by the matrix of Local Attention shown in fig. 5 (a). In the matrix of Local Attention shown in fig. 5 (a), the element of each row is used to represent the association between a certain cell and the k cells before and after the cell and itself.

In general, local Attention can learn Local relevance well while greatly reducing the amount of computation required for matrix operations. However, the main disadvantage of Local Attention is that some information between global associations is lost, i.e. Local Attention cannot embody the association between a certain unit in the input sequence and other units in the whole input sequence.

Global Attention can represent the relevance between a certain unit in an input sequence and a unit in the whole sequence, so that some Global relevance information lost by Local Attention can be well compensated. However, global Attention is relatively large in calculation amount, and is generally suitable for tasks requiring long associated information such as question-answering tasks. In general, the representation in the Global Attention matrix is related to the task to which the model corresponds.

Referring to fig. 6, fig. 6 is a schematic diagram of a matrix and sequence association method of Global Attention according to an embodiment of the present application. As shown in fig. 6 (b), the input sequence is divided into 4 partial ranges, and a specific location unit in each partial range (e.g., the last unit in each partial range) has an association with all units in the entire input sequence. In this case, the correlation between the cells in the input sequence shown in fig. 6 (b) can be expressed by the matrix of Global Attention shown in fig. 6 (a). In the matrix of Local Attention shown in fig. 6 (a), the element of each column is used to represent the association between a certain cell and all cells in the entire input sequence.

In addition, referring to fig. 7, fig. 7 is a schematic diagram of another Global Attention matrix according to an embodiment of the present application.

At present, a Sparse Attention network is obtained by improving a Self-Attention structure in a Transformer on the basis of the Transformer, namely, the Self-Attention structure in the Transformer is transformed into a spark Attention structure. Specifically, taking a graph as an example, after a matrix Q, a matrix K and a matrix V are obtained by calculation, sparse calculation is performed on the matrix Q and the matrix K to obtain a sparse matrix including Local Attention. And then, after normalization operation is carried out on the Sparse matrix, sparse calculation is carried out on the matrix subjected to normalization operation and the matrix V, and output of a spark Attention structure is obtained.

Specifically, the Sparse Attention structure includes three Sparse computation operators, which are respectively: dense MatMul Dense to Sparse, spark Softmax and Sparse MatMul Dense to Dense. The three sparse computation operators described above will be described below with reference to the accompanying drawings, respectively.

(1)Dense MatMul Dense to Sparse。

Dense MatMul Dense to Sparse is a process for achieving multiplication of matrix Q and matrix K to obtain sparse matrix a. Dense MatMul Dense to Sparse is to obtain corresponding matrix data and calculate the element value of the target calculation position according to the position information of the target calculation position (i.e. the position of the effective element in the sparse matrix a) in the sparse matrix a. In the calculation process of the sparse matrix, if the element values of all target calculation positions in the sparse matrix are calculated, an instruction of multiple matrix multiplication is required to be executed, and repeated matrix data can be carried out from the matrix Q and the matrix K.

Referring to fig. 8, fig. 8 is a schematic diagram of a calculation flow of a spark Attention structure according to an embodiment of the present application. As shown in the figure, in the process of calculating the sparse matrix a, according to the positions of the effective elements in the sparse matrix a, corresponding matrix data in the matrix Q and the matrix K are carried into an operation unit from a storage unit (for example, a video memory of a GPU), and matrix multiplication operation is performed in the operation unit, so as to obtain the values of the effective elements. And calculating the values of the effective elements in the sparse matrix A one by one to obtain the sparse matrix A.

(2)Sparse Softmax。

The spark Softmax is a Softmax operation for realizing the Sparse matrix a, i.e. the Softmax operation is performed on the last dimension in the Sparse matrix a, resulting in the Sparse matrix A1. Specifically, spark Softmax is mainly to determine the valid input needed to perform Softmax operation according to the position information of each valid element in the Sparse matrix. As shown in fig. 8, the sparse matrix A1 is obtained by performing a softmax operation on each row in the sparse matrix a. Wherein the positions of the effective elements in the sparse matrix A1 are the same as the positions of the effective elements in the sparse matrix a.

(3)Sparse MatMul Dense to Dense。

Sparse MatMul Dense to Dense is a process for achieving multiplication of the sparse matrix A1 with the matrix V to obtain an output matrix. Specifically, the effective calculation blocks in the matrix V are found through the position information of the effective elements in the Sparse matrix A1, and matrix multiplication is performed on the effective calculation blocks in the matrix V and the corresponding calculation blocks in the Sparse matrix A1 until calculation of all the calculation blocks in the Sparse matrix A1 is completed, so that an output matrix of spark Attention is obtained.

Based on the above description, the existing spark Attention implementation scheme performs data processing based on a conventional Sparse operation. Compared with the Self-Attention implementation scheme, the existing spark Attention implementation scheme can effectively reduce the operation amount and the memory amount, but the data in the Sparse matrix is segmented too finely in the operation process, so that the matrix data carrying process becomes too complex, a large number of carrying instruction numbers are increased, and the calculation memory ratio is very low. In practical applications, the time taken to execute the transfer instruction to transfer data is much longer than the time taken to execute the matrix multiplication operation based on the data, so that the efficiency of data processing is low when a large number of transfer instructions are added.

In addition, in the case of performing a matrix multiplication operation based on the AI accelerator, since the operation unit in the AI accelerator can process a large matrix at the same time, the acceleration benefit obtained by the existing spark Attention implementation scheme on the AI accelerator is relatively low, and thus, the calculation force waste is easily caused.

For example, referring to fig. 9, fig. 9 is a schematic diagram of calculating a sparse matrix based on a conventional sparse operation according to an embodiment of the present application. As shown in fig. 9, the sparse matrix is a matrix with a size of 16×16, and the number of effective elements in the sparse matrix is 64, so that when the sparse matrix is calculated based on the conventional sparse operation, it is necessary to perform the matrix multiplication operation 64 times. In addition, before each matrix multiplication operation is performed, matrix data in the matrix Q and the matrix K need to be transferred from the storage unit to the operation unit, that is, 128 times of data transfer instructions need to be performed.

Taking the effective elements in the first 4 rows and the first 4 columns of the sparse matrix as an example, 16 effective elements are included in the first 4 rows and the first 4 columns of the sparse matrix. When calculating the value of the effective element of the first row and the first column in the sparse matrix, carrying the first row data in the matrix Q and the first column data in the matrix K; when calculating the values of the effective elements of the first row and the second column in the sparse matrix, carrying the first row data in the matrix Q and the second column data in the matrix K; in calculating the values of the active elements of the second row and the first column in the sparse matrix, the second row data in the matrix Q and the first column data in the matrix K need to be handled. And so on, in the process of calculating the values of the 16 effective elements, the data of each of the first four rows in the matrix Q needs to be carried four times, and the data of each of the first four columns in the matrix K needs to be carried four times, so that repeated carrying of matrix data is caused, and the calculation efficiency of the sparse matrix is reduced.

In view of this, the embodiment of the application provides a data processing method, which divides a sparse matrix in advance to obtain a plurality of matrix blocks in the sparse matrix. In the calculation process of the sparse matrix, matrix data required by calculating each matrix block in the sparse matrix are obtained, and an actual value of each matrix block is calculated. Finally, a target matrix for replacing the sparse matrix is obtained by splicing the plurality of matrix blocks obtained through calculation, and the target matrix can be used for subsequent calculation related to the sparse matrix in the sparse attention network. By dividing the sparse matrix into a plurality of matrix blocks for calculation, repeated carrying of matrix data can be effectively avoided, the number of data carrying instructions is reduced, and the calculation memory ratio in the matrix operation process is improved, so that the data processing efficiency is improved.

The data processing method provided by the embodiment of the application can be applied to electronic equipment, in particular to electronic equipment which needs to execute data processing tasks based on a sparse attention network. By way of example, the electronic device may be, for example, a server, a smart phone (mobile phone), a personal computer (personal computer, PC), a notebook, a tablet, a smart television, a mobile internet device (mobile internet device, MID), a wearable device, a Virtual Reality (VR) device, an augmented reality (augmented reality, AR) device, a wireless electronic device in industrial control (industrial control), a wireless electronic device in unmanned (self driving), a wireless electronic device in teleoperation (remote medical surgery), a wireless electronic device in smart grid (smart grid), a wireless electronic device in transportation security (transportation safety), a wireless electronic device in smart city (smart city), a wireless electronic device in smart home (smart home), and the like.

The apparatus to which the data processing method provided by the embodiment of the present application is applied is described above, and a scenario to which the data processing method provided by the embodiment of the present application is applied will be described below.

The data processing method provided by the embodiment of the application can be applied to a sparse attention network, wherein the sparse attention network is used for executing computer vision tasks or natural language processing tasks. That is, during the time that the electronic device performs a computer vision task or a natural language processing task through the sparse attention network, the data processing method provided by the embodiment of the present application may be used to perform operations.

Among them, natural language processing is an important direction in the fields of computer science and artificial intelligence. Natural language processing research can realize various theories and methods for effective communication between people and computers by natural language. Generally, natural language processing tasks mainly comprise tasks such as machine translation, public opinion monitoring, automatic abstract generation, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition and the like.

Computer vision is a science of how to learn to see machines. Furthermore, computer vision refers to machine vision such as identifying, tracking and measuring objects by using a camera and a computer instead of human eyes, and further performing graphic processing, so that the processed image becomes an image more suitable for human eyes to observe or transmit to an instrument for detection. Generally, computer vision tasks include tasks such as Image recognition (Image Classification), object Detection (Object Detection), semantic segmentation (Semantic Segmentation), and Image Generation (Image Generation).

Image recognition is a common classification problem, also commonly referred to as image classification. Specifically, in the image recognition task, the input of the neural network is image data, and the output value is the probability that the current image data belongs to each category. The category with the highest probability value is generally selected as the predicted category of the image data. Image recognition is one of the tasks of the earliest successful application of deep learning, and classical network models are VGG series, acceptance series, resNet series and the like.

The object detection refers to automatically detecting the approximate position of a common object in an image through an algorithm, and generally using a Bounding box (Bounding box) to represent the approximate position of the object, and classifying the class information of the object in the Bounding box.

Semantic segmentation refers to automatically segmenting and identifying content in an image through an algorithm. Semantic segmentation can be understood as a classification problem of each pixel, i.e. analyzing the class to which each pixel belongs to an object.

Image generation refers to obtaining a generated image with high fidelity by learning the distribution of a real image and sampling from the learned distribution. For example, generating a sharp image based on the blurred image; an defogged image is generated based on the fogged image.

The scenario of the application of the data processing method provided by the embodiment of the present application is described above, and a specific flow of the data processing method provided by the embodiment of the present application will be described below. Referring to fig. 10, fig. 10 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 10, the data processing method provided by the embodiment of the present application is applied to a sparse attention network, and includes the following steps 1001 to 1004.

In step 1001, block information of a sparse matrix is obtained, where the sparse matrix is an intermediate matrix obtained during an operation performed based on the sparse attention network, and the block information is used to indicate a plurality of matrix blocks divided in the sparse matrix.

Generally, in the process of performing an operation based on a sparse attention network, a sparse matrix is calculated according to two matrices, where the sparse matrix is an intermediate matrix obtained in the process of performing the operation, and the sparse matrix is used for performing a subsequent operation in the sparse attention network. Illustratively, as shown in fig. 8, during the operation based on the sparse attention network, sparse computation is performed on the matrix Q and the matrix K, resulting in a sparse matrix.

Since the sparse matrix in the sparse attention network is typically a regular sparse matrix, i.e. the distribution of non-zero elements in the sparse matrix is regular. Therefore, in the embodiment of the application, the sparse matrix can be divided to obtain a plurality of matrix blocks. All non-zero elements in the sparse matrix are included in the divided matrix blocks, and the values of all non-zero elements in the sparse matrix can be obtained by calculating the values of the elements in the matrix blocks, so that the calculation of the sparse matrix is realized. Wherein each of the divided matrix blocks includes a plurality of non-zero elements in the sparse matrix. Moreover, since the non-zero elements in the sparse matrix are far fewer than the zero elements, the total number of elements included in the divided matrix blocks is also fewer than the total number of elements in the sparse matrix.

For example, referring to fig. 11, fig. 11 is a schematic diagram illustrating partitioning of a sparse matrix according to an embodiment of the present application. As shown in fig. 11, the sparse matrix shown in fig. 11 (a) is divided into 4 matrix blocks, each matrix block has 4 rows and columns, each matrix block includes 16 non-zero elements, and the 4 matrix blocks include all non-zero elements in the sparse matrix.

The sparse matrix shown in fig. 11 (b) is divided into 4 matrix blocks in total, and the matrix blocks are divided in the same manner as the one shown in fig. 11 (a). And each matrix block includes 10 non-zero elements, and the 4 matrix blocks contain all non-zero elements in the sparse matrix.

The sparse matrix shown in fig. 11 (c) is divided into 4 matrix blocks, each having 4 rows and 8 columns, and the 4 matrix blocks contain all non-zero elements in the sparse matrix.

In addition, in the three types of division of the sparse matrix shown in fig. 11, there is no overlapping portion between the plurality of matrix blocks divided in the sparse matrix. The columns of each matrix block are the same, and the total number of rows of the matrix blocks is the same as the total number of rows of the sparse matrix.

Specifically, before performing an operation related to the sparse matrix in the sparse attention network (i.e., an operation in which the sparse matrix is an input), the blocking information of the sparse matrix may be acquired to determine a division manner of matrix blocks in the sparse matrix. Wherein the blocking information indicates a plurality of matrix blocks divided in the sparse matrix. For example, the blocking information may indicate position information of a plurality of matrix blocks divided in the sparse matrix, respectively.

Taking (a) in fig. 11 as an example, the sparse matrix is divided into 4 matrix blocks, and the block information corresponding to the sparse matrix may be an instruction: the position information of the first matrix block is the position where the intersection of the first row to the fourth row and the first column to the fourth column is located; the position information of the second matrix block is the position where the intersection of the fifth row to the eighth row and the fifth column to the eighth column is located; the position information of the third matrix block is the position where the intersection of the ninth row to the twelfth row and the ninth column to the twelfth column is located; the position information of the fourth matrix block is the position where the intersection of the thirteenth to sixteenth rows and the thirteenth to sixteenth columns is located.

Step 1002, obtaining matrix data corresponding to each matrix block in the plurality of matrix blocks from a first matrix and a second matrix according to the block information, where the first matrix and the second matrix are matrices for calculating the sparse matrix.

After the block information is obtained, the positions of a plurality of matrix blocks needing to be calculated in the sparse matrix can be determined. Based on the position of each matrix block in the sparse matrix, the matrix data required to calculate each matrix block may be determined.

For example, in the case where the sparse matrix is calculated based on the first matrix and the second matrix, the corresponding matrix data may be acquired from the first matrix and the second matrix based on the positional information of each matrix block to realize the calculation of each matrix block. For example, in the case of performing an operation based on the GPU, the first matrix and the second matrix are stored in a memory of the GPU, and the GPU may be configured to transfer matrix data corresponding to the matrix block from the memory to an operation unit of the GPU by executing a data transfer instruction to calculate a specific value of an element in the matrix block.

The first matrix may be, for example, a matrix Q in the sparse attention network, and the second matrix may be, for example, a matrix K in the sparse attention network.

And step 1003, performing matrix multiplication operation according to the matrix data to obtain a plurality of matrix blocks, wherein the matrix blocks comprise all non-zero elements in the sparse matrix, and each matrix block in the matrix blocks comprises a plurality of elements.

In the process of executing the calculation of each matrix block, after the electronic device carries matrix data corresponding to the matrix block to the operation unit by executing the data carrying instruction, the operation unit of the electronic device executes matrix multiplication operation based on the matrix data, so as to calculate and obtain the actual value of each element in each matrix block.

In this embodiment, for a plurality of calculated matrix blocks, each matrix block in the plurality of matrix blocks includes a portion of non-zero elements in the sparse matrix, the plurality of matrix blocks can include all non-zero elements in the sparse matrix.

For example, referring to fig. 12, fig. 12 is a schematic diagram of a matrix block in a computation sparse matrix according to an embodiment of the present application. As shown in fig. 12, for 4 matrix blocks divided in the sparse matrix, the 1 st matrix block may be obtained by performing a matrix multiplication operation on the first 4 rows of matrix data in the first matrix and the first 4 columns of matrix data in the second matrix; the 2 nd matrix block can be obtained by performing matrix multiplication operation on matrix data of 5-8 rows in the first matrix and matrix data of 5-8 columns in the second matrix; the 3 rd matrix block can be obtained by performing matrix multiplication operation on 9-12 rows of matrix data in the first matrix and 9-12 columns of matrix data in the second matrix; the 4 th matrix block may be obtained by performing a matrix multiplication operation on matrix data of 13-16 rows in the first matrix and matrix data of 13-16 columns in the second matrix.

Therefore, when the 1 st matrix block is calculated, the corresponding matrix data in the first matrix and the second matrix can be carried into the operation unit through 2 data carrying instructions, and the calculation of the 1 st matrix block can be realized. Similarly, in the process of the 2 nd matrix block, the 3 rd matrix block and the 4 th matrix block, the related matrix data can be carried out by executing 2 data carrying instructions. That is, in the process of calculating 4 matrix blocks in the sparse matrix, 8 times of data transfer instructions are executed in total, and the transferred data are different each time the data transfer instructions are executed, that is, the phenomenon of repeated data transfer does not occur.

As can be seen from comparing fig. 12 and fig. 9, for the same sparse matrix, the existing sparse matrix calculation method needs to execute 128 times of data carrying instructions, and the data in the input matrix needs to be carried repeatedly; in this embodiment, only 8 times of data handling instructions are needed to be executed by calculating the matrix blocks, and the data in the input matrix is not needed to be repeatedly handled.

And step 1004, splicing the matrix blocks to obtain a target matrix, wherein the target matrix is used for executing the operation related to the sparse matrix in the sparse attention network.

After the plurality of matrix blocks in the sparse matrix are calculated, the plurality of matrix blocks may be spliced to obtain a target matrix, so as to perform subsequent operations related to the sparse matrix in the sparse attention network based on the target matrix. Referring to fig. 13, fig. 13 is a schematic diagram of a matrix block in a spliced sparse matrix according to an embodiment of the present application. As shown in fig. 13, after a plurality of matrix blocks of the sparse matrix are calculated, the plurality of matrix blocks may be spliced in a dimension of a row, thereby obtaining the target matrix.

That is, the target matrix obtained by splicing the plurality of matrix blocks is used to replace the sparse matrix to realize other operations related to the sparse matrix in the sparse attention network.

For example, in the calculation flow shown in fig. 2, the first matrix described in this embodiment corresponds to the matrix Q in fig. 2, and the second matrix corresponds to the matrix K in fig. 2. After the above method 1000 is performed on the matrix Q and the matrix K, a target matrix is obtained, where the target matrix corresponds to the matrix a in fig. 2, that is, the target matrix is used to perform a subsequent normalization operation to obtain the matrix A1.

In this embodiment, a plurality of matrix blocks in the sparse matrix are obtained by dividing the sparse matrix in advance. In the calculation process of the sparse matrix, matrix data required by calculating each matrix block in the sparse matrix are obtained, and an actual value of each matrix block is calculated. Finally, a target matrix for replacing the sparse matrix is obtained by splicing the plurality of matrix blocks obtained through calculation, and the target matrix can be used for subsequent calculation related to the sparse matrix in the sparse attention network. By dividing the sparse matrix into a plurality of matrix blocks for calculation, repeated carrying of matrix data can be effectively avoided, the number of data carrying instructions is reduced, and the calculation memory ratio in the matrix operation process is improved, so that the data processing efficiency is improved.

Alternatively, in the case where a sparse matrix is used to extract local attention features in the input sequence, a plurality of matrix blocks divided by the sparse matrix are then used to represent the local attention features. In addition, in order to facilitate the splicing of the target matrix based on a plurality of matrix blocks, the number of rows or columns of each matrix block in the plurality of matrix blocks is the same.

Specifically, in the case where the sparse matrix subsequently needs to perform the normalization operation, when the normalization operation is performed with the rows of the sparse matrix as the basic units, the column number of each of the plurality of matrix blocks is the same; when the normalization operation is performed with the columns of the sparse matrix as the basic unit, the number of rows of each of the plurality of matrix blocks is the same.

For different matrix block division modes in the sparse matrix, the splicing modes of the matrix blocks are also different. The relation between the splicing mode of the matrix blocks and the dividing mode of the matrix blocks is as follows.

When the number of rows of each matrix block in the plurality of matrix blocks is the same, the plurality of matrix blocks are spliced in the dimension of columns to obtain a target matrix, and the number of rows of the target matrix is the same as the number of rows of the plurality of matrix blocks.

Referring to fig. 14, fig. 14 is a schematic diagram illustrating a matrix block splicing manner according to an embodiment of the present application. As shown in (a) of fig. 14, in the subsequent normalization operation on the sparse matrix, the normalization operation is performed with the rows of the sparse matrix as the basic unit, and thus the number of rows of the plurality of matrix blocks divided in the sparse matrix is the same. And when the plurality of matrix blocks are spliced, the plurality of matrix blocks are spliced in the dimension of the columns to obtain the target matrix.

As shown in (b) of fig. 14, in the subsequent normalization operation on the sparse matrix, the normalization operation is performed with the columns of the sparse matrix as the basic unit, and thus the columns of the plurality of matrix blocks divided in the sparse matrix are the same. And when the plurality of matrix blocks are spliced, the plurality of matrix blocks are spliced in the dimension of the row to obtain the target matrix.

It will be appreciated that for a plurality of matrix blocks divided by a sparse matrix, the plurality of matrix blocks may include elements with zero values in the sparse matrix in addition to non-zero elements in the sparse matrix. However, in calculating a plurality of matrix blocks based on matrix data, a value of each element in the plurality of matrix blocks is calculated. That is, the values of the elements of the calculated plurality of matrix blocks corresponding to the locations of the zero-valued elements of the sparse matrix may not be zero. In this case, the values of the elements at specific positions in the plurality of matrix blocks need to be adjusted to zero so that the values of the elements at respective positions in the plurality of matrix blocks can be matched with the values of the elements at corresponding positions in the sparse matrix.

In one possible embodiment, after obtaining the plurality of matrix blocks, it may be determined whether the target element is included in the plurality of matrix blocks. Wherein the target element is an element with zero median value in the sparse matrix. For example, for the sparse matrix shown in (b) in fig. 11, 6 target elements are included in the first matrix block in the sparse matrix. The values of the 6 target elements in the sparse matrix are 0, and the 6 target elements are respectively an element of a second column of a first row, an element of a third column of the first row, an element of a fourth column of the first row, an element of a third column of a second row, an element of a fourth column of the second row and an element of a fourth column of the third row. After the first matrix block in the sparse matrix is calculated, the actual values of the 6 target elements in the first matrix block may not be 0, so the values of the 6 target elements need to be adjusted to 0.

Specifically, when the plurality of matrix blocks include the target element, the values of the target element in the plurality of matrix blocks are adjusted to zero, so as to obtain a plurality of adjusted matrix blocks. And in the splicing stage of the matrix blocks, splicing the plurality of adjusted matrix blocks to obtain the target matrix. Thus, the element value of each position in the target matrix is the same as the element value of the corresponding position in the sparse matrix.

In this embodiment, by adjusting the values of the target elements in the plurality of matrix blocks, the element value of each position in the target matrix obtained by splicing the adjusted matrix blocks is the same as the element value of the corresponding position in the sparse matrix, so that the target matrix can replace the sparse matrix in the subsequent operation process, and the normal execution of the subsequent operation is ensured.

Alternatively, in some cases, the sparse matrix may include both Local Attention and Global Attention as described above, i.e., the sparse matrix is used to represent Local and Global Attention features of the input sequence. In this case, since the Local and Global Attention structures have large distribution differences in the sparse matrix, it may be difficult to achieve division of matrix blocks in the sparse matrix based on one matrix block division manner.

Based on this, in one possible embodiment, the electronic device may obtain the position information of the matrix block related to the Local Attention and the position information of the matrix block related to the Global Attention based on different partition information.

The first block information and the second block information are illustratively included in the block information acquired by the electronic device. The first block information is used for indicating a plurality of first matrix blocks divided in a sparse matrix, the columns of the plurality of first matrix blocks are the same, and the total number of rows of the plurality of first matrix blocks is equal to the number of rows of the sparse matrix. Furthermore, to facilitate the stitching of multiple first matrix blocks, each first matrix block is distributed over a different row in the sparse matrix. In particular, the plurality of first matrix blocks is used to represent local attention features.

The second block information is used for indicating a plurality of second matrix blocks divided in the sparse matrix, the number of rows of the plurality of second matrix blocks is the same as the number of rows of the sparse matrix, and the number of columns of the plurality of second matrix blocks is smaller than the number of columns of the sparse matrix. In particular, the plurality of second matrix blocks is used to represent global attention features.

Referring to fig. 15, fig. 15 is a schematic diagram of dividing a matrix block based on first partition information and second partition information according to an embodiment of the present application. As shown in fig. 15, after the sparse matrix is divided based on the first block information, 4 first matrix blocks can be obtained, the shapes of the 4 first matrix blocks are the same, and the number of rows and the number of columns of the 4 first matrix blocks are both 4. Furthermore, 4 first matrix blocks can be used to cover the Local Attention related locations in the sparse matrix, i.e. these 4 first matrix blocks can be used to represent Local Attention features.

Referring to fig. 16, fig. 16 is a schematic diagram of calculating a matrix block based on first block information and second block information according to an embodiment of the present application. As shown in fig. 16, after determining the position information of a plurality of first matrix blocks based on the first block information, matrix data corresponding to each matrix block may be acquired from the first matrix and the second matrix, respectively, and a matrix multiplication operation may be performed on the acquired matrix data, thereby calculating each first matrix block. In brief, the process of calculating the first matrix block is to obtain corresponding matrix blocks from the first matrix and the second matrix respectively, and perform matrix multiplication operation on the matrix blocks obtained from the first matrix and the second matrix to obtain the first matrix block.

After determining the position information of the plurality of first matrix blocks based on the second block information, data of the entire first matrix may be acquired, and matrix data corresponding to each second matrix block in the second matrix, respectively. Then, each second matrix block is obtained by performing matrix multiplication operation on matrix data corresponding to each second matrix block in the first matrix and the second matrix. In short, the spliced plurality of second matrix blocks can be obtained by splicing matrix data corresponding to each second matrix block in the second matrix and performing matrix multiplication operation on the spliced matrix and the first matrix.

After the sparse matrix is divided based on the second block information, 4 second matrix blocks can be obtained, the shapes of the 4 second matrix blocks are the same, the number of rows of the 4 first matrix blocks is 16, and the number of columns of the 4 first matrix blocks is 1. Furthermore, 4 first matrix blocks can be used to cover Global Attention related positions in the sparse matrix, i.e. these 4 first matrix blocks can be used to represent local Attention features.

After a plurality of first matrix blocks and a plurality of second matrix blocks are obtained through calculation based on the first block information and the second block information, the plurality of first matrix blocks are spliced in the dimension of a row to obtain a first target matrix; and splicing the plurality of second matrix blocks in the dimension of the columns to obtain a second target matrix.

And finally, splicing the first target matrix and the second target matrix in the dimension of the columns to obtain the target matrix.

Optionally, before the first target matrix and the second target matrix are spliced, it may be determined whether the first target matrix and the second target matrix contain elements that are identical in position in the sparse matrix, that is, whether the first target matrix and the second target matrix contain repeated elements.

Illustratively, if the positions of the first element in the first target matrix and the second element in the second target matrix in the sparse matrix are the same, adjusting the value of the second element in the second target matrix to zero to obtain an adjusted second target matrix; and then, splicing the first target matrix and the adjusted second target matrix in the dimension of the columns to obtain the target matrix.

Referring to fig. 17, fig. 17 is a schematic diagram of dividing a matrix block based on first partition information and second partition information according to an embodiment of the present application. As shown in fig. 17, after 4 first matrix blocks in the sparse matrix are obtained based on the first block information, the 4 first matrix blocks are sequentially spliced in order in the dimension of the row, to obtain a first target matrix with the number of rows being 16 and the number of columns being 4. Similarly, after obtaining 4 second matrix blocks in the sparse matrix based on the second partition information, the 4 second matrix blocks are sequentially spliced in sequence in the dimension of the columns to obtain a second target matrix with the number of rows being 16 and the number of columns being 4.

As can be seen from fig. 17, for the calculated plurality of second matrix blocks, the first non-zero element in each second matrix block appears in the corresponding first matrix block. Thus, after the second target matrix is obtained based on the concatenation of the plurality of second matrix blocks, the first non-zero element of each column in the second target matrix may be adjusted to 0, thereby obtaining an adjusted second target matrix.

It can be understood that, as can be seen from the foregoing description of the sparse attention network, after the sparse matrix is obtained by calculation, it is generally further required to perform a normalization operation on the sparse matrix, and perform a matrix multiplication operation on the normalized sparse matrix and another matrix, so as to obtain an output of the attention module in the sparse attention network.

Optionally, after the target matrix for replacing the sparse matrix is obtained based on the splicing of the plurality of matrix blocks, a normalization operation may be performed on the target matrix to obtain a normalized target matrix. And obtaining an output matrix based on the normalized target matrix and the third matrix, wherein the output matrix is the output of the attention module in the sparse attention network, and the first matrix, the second matrix and the third matrix are obtained by calculating the same matrix based on different weights.

Alternatively, since the normalized target matrix is actually formed by stitching the local attention feature and the global attention feature, it is difficult to obtain the fourth matrix by directly performing matrix multiplication on the normalized target matrix and the third matrix. In the actual calculation process, the normalized target matrix can be split according to the mode of splicing the local attention characteristic and the global attention characteristic, corresponding processing is carried out on the split matrix based on the fourth matrix, and finally the processed matrix is spliced.

Specifically, referring to fig. 18, fig. 18 is a schematic flow chart of calculation after splitting the normalized target matrix according to an embodiment of the present application. As shown in fig. 18, the process of obtaining the fourth matrix based on the normalized target matrix and the third matrix includes the following steps S1 to S5.

Step S1, splitting the normalized target matrix in the dimension of columns based on the splicing mode of the target matrix to obtain a third target matrix and a fourth target matrix, wherein the size of the third target matrix is the same as that of the first target matrix, and the size of the fourth target matrix is the same as that of the second target matrix.

In brief, the target matrix is obtained by splicing the first target matrix and the second target matrix in the dimension of the columns, and the shape of the normalized target matrix is the same as that of the target matrix. After normalization operation is carried out on the target matrix, the obtained normalized target matrix can be split according to the original splicing mode, so that a third target matrix corresponding to the first target matrix and a fourth target matrix corresponding to the second target matrix are obtained.

And S2, splitting the third target matrix and the third matrix based on the splicing mode of the first target matrix to obtain a plurality of matrix blocks in the third target matrix and a plurality of matrix blocks in the third matrix, wherein the matrix blocks in the third target matrix and the matrix blocks in the third matrix have a one-to-one correspondence.

Because the first target matrix is obtained by splicing a plurality of first matrix blocks in the row dimension, and the third target matrix corresponds to the first target matrix, the third target matrix can be split based on the splicing mode of the first target matrix, so that a plurality of matrix blocks in the third target matrix are obtained. In addition, since corresponding matrix multiplication operations are to be performed on the third target matrix and the third matrix subsequently, the third matrix can be split based on the same splitting manner, so as to obtain a plurality of matrix blocks in the third matrix. As shown in fig. 18, matrix blocks in the third target matrix have a one-to-one correspondence with matrix blocks in the third matrix.

And S3, performing matrix multiplication operation on the third target matrix and matrix blocks with corresponding relations in the third matrix, and splicing the matrix blocks obtained after performing the matrix multiplication operation on the dimension of the rows to obtain a first output matrix.

Since the matrix blocks in the third target matrix have a one-to-one correspondence with the matrix blocks in the third matrix, a plurality of pairs of matrix blocks can be obtained, each pair of matrix blocks including one matrix block in the third target matrix and one corresponding matrix block in the third matrix. Then, by performing a matrix multiplication operation on each of the plurality of pairs of matrix blocks, a plurality of matrix blocks on which the matrix multiplication operation is performed can be obtained. And splicing the plurality of matrix blocks subjected to matrix multiplication operation in the row dimension to obtain a first output matrix.

And S4, performing matrix multiplication operation on the sub-matrixes of the fourth target matrix and the third matrix to obtain a second output matrix, wherein the sub-matrixes are formed by a plurality of rows of elements in the third matrix.

Specifically, a plurality of rows of elements in the third matrix have a corresponding relation with the fourth target matrix, and a submatrix of the third matrix can be formed by acquiring a plurality of rows of elements in the third matrix which have a corresponding relation with the fourth target matrix. Then, a matrix multiplication operation is performed on the sub-matrices of the fourth target matrix and the third matrix, so that a second output matrix can be obtained.

As shown in fig. 18, the 4 th row element, the 8 th row element, the 12 th row element, the 16 th row element, and the fourth target matrix in the third matrix have a correspondence relationship, and the sub-matrix of the third matrix can be configured by acquiring the above 4 row elements from the third matrix.

And S5, adding the first output matrix and the second output matrix to obtain the output matrix.

Specifically, since the shapes of the first output matrix and the second output matrix are the same, the final output matrix, that is, the output of the attention module in the sparse attention network, can be obtained by adding the elements at the same position in the first output matrix and the second output matrix.

For easy understanding, the following describes in detail, with reference to the accompanying drawings, a process of performing an operation by using the data processing method provided by the embodiment of the present application by an attention module in a sparse attention network.

Referring to fig. 19, fig. 19 is a schematic diagram of a calculation flow of an attention module in a sparse attention network according to an embodiment of the present application. As shown in fig. 19, the calculation process of the attention module in the sparse attention network includes the following steps 1901 to 1909.

In step 1901, the partitioning information of Local Attention is acquired.

Step 1902, obtaining the partition information of Global Attention.

In the case that the sparse matrix includes a Local Attention structure and a Global Attention structure, the blocking information of the Local Attention and the blocking information of the Global Attention in the sparse matrix are acquired respectively, so that the positions of a plurality of matrix blocks in the Local Attention structure and the positions of a plurality of matrix blocks in the Global Attention structure are determined. The partition information of the Local attribute in step 1901 may refer to, for example, the first partition information described in the above embodiment, and the partition information of the Global attribute in step 1901 may refer to, for example, the second partition information described in the above embodiment.

In step 1903, matrix data of the matrix Q and the matrix K are acquired based on the partitioning information of the Local attribute, and a matrix block of the Local attribute is calculated.

After determining a plurality of matrix blocks in the Local Attention structure based on the partitioning information of the Local Attention, matrix data corresponding to each matrix in the plurality of matrix blocks are obtained from the matrix Q and the matrix K, and the plurality of matrix blocks in the Local Attention structure are obtained through calculation according to the matrix data.

The plurality of matrix blocks in the Local Attention structure described in this step may refer to, for example, the plurality of first matrix blocks described above.

Step 1904, obtaining matrix data of the matrix Q and the matrix K based on the partition information of the Global Attention, and calculating a matrix block of the Global Attention.

After a plurality of matrix blocks in the Global Attention structure are determined based on the partition information of the Global Attention, matrix data corresponding to each matrix in the plurality of matrix blocks are obtained from the matrix Q and the matrix K, and the plurality of matrix blocks in the Global Attention structure are obtained through calculation according to the matrix data.

The plurality of matrix blocks in the Global Attention structure described in this step may refer to, for example, the above-described plurality of second matrix blocks.

In step 1905, the matrix blocks of Local Attention and the matrix blocks of Global Attention are spliced.

After a plurality of matrix blocks of Local Attention and a plurality of matrix blocks of Global Attention are obtained, firstly splicing the plurality of matrix blocks of Local Attention to obtain a splicing matrix of Local Attention; and splicing the plurality of matrix blocks of the Global Attention to obtain a splicing matrix of the Global Attention. Then, the splice matrix of Local Attention and the splice matrix of Global Attention are further spliced to obtain a total splice matrix.

The splicing matrix of the Local Attention may be, for example, the first target matrix described in the foregoing embodiment; the splicing matrix of Global Attention may be, for example, the second target matrix described in the above embodiment; the total stitching matrix may be, for example, the target matrix described in the above embodiments.

In step 1906, normalization operation is performed on the spliced matrix and then split.

After the total splicing matrix is obtained by splicing, normalization operation can be performed on the total splicing matrix, and the matrix after normalization operation is split based on the splicing mode of the Local Attention splicing matrix and the Global Attention splicing matrix, so that the normalized Local Attention matrix and the normalized Global Attention matrix are obtained.

The normalized matrix of Local attribute may be, for example, the third target matrix described in the foregoing embodiment; the matrix of Global Attention after normalization may be, for example, the fourth target matrix described in the above embodiment.

In step 1907, a matrix multiplication operation is performed on the matrix V and the normalized Local Attention matrix.

After the normalized Local Attention matrix is obtained, the normalized Local Attention matrix and the matrix V can be blocked based on the same blocking mode, and matrix multiplication operation is performed on the blocked matrix, so that the output of the Local Attention is obtained.

The output of the Local Attention may be, for example, the first output matrix described in the above embodiment.

In step 1908, a matrix multiplication operation is performed on the matrix V and the normalized Global Attention matrix.

After the normalized Global Attention matrix is obtained, matrix data of a part of rows may be extracted from the matrix V to form a sub-matrix of the matrix V. Then, matrix multiplication operation is carried out on the normalized Global Attention matrix and the submatrices of the matrix V, and the output of the Global Attention is obtained.

The output of Global Attention may be, for example, the second output matrix described in the above embodiment.

In step 1909, the output of Local and Global Attention are superimposed.

And finally, adding elements at the same position in the output of the Local Attention and the output of the Global Attention to obtain a final output matrix, wherein the output matrix is the output of the Attention module in the sparse Attention network.

The embodiment of the application also provides a data processing method based on the sparse attention network, which comprises the following steps: acquiring data to be processed; and processing the data to be processed based on the sparse attention network to obtain output data. Wherein the data processing method according to the above embodiment performs an operation related to a sparse matrix in the sparse attention network during processing of the data to be processed based on the sparse attention network.

Optionally, the data to be processed includes image data, text data or voice data.

Specifically, referring to fig. 20, fig. 20 is an application scenario of a data processing method based on a sparse attention network according to an embodiment of the present application. As shown in fig. 20, a sparse attention network is deployed on an electronic device, the sparse attention network including attention modules therein, and the sparse attention network being for performing natural language processing tasks or computer vision tasks. An AI accelerator in an electronic device is used to perform operations in a sparse attention network. In the actual application process, text data, image data or voice data is input into the sparse attention network as input data of the sparse attention network, and an operation related to the sparse attention network is executed by the AI accelerator. Wherein the AIU accelerator performs operations related to the sparse matrix in the attention module in the sparse attention network based on the data processing method described in the above embodiment, thereby improving the operation efficiency of the sparse attention network.

In order to verify the beneficial effects of the data processing method provided by the present embodiment, the present embodiment replaces the sparse operator in the existing sparse attention network model, so that the replaced sparse operator can perform data processing based on the data processing method described in the foregoing embodiment.

Illustratively, the network model used to replace the sparse operator is to generate a Pre-training transducer-3 (generated Pre-trained Transformer 3, GPT-3). GPT-3 is an autoregressive language model (Autoregressive Language Model), and GPT-3 uses deep learning to generate human-like text. The main input of GPT-3 is data formed by some characters, and the data is output as the prediction result of the input characters. Referring to fig. 21, fig. 21 is a schematic diagram of a data processing flow of GPT-3 according to an embodiment of the present application. As shown in fig. 21, in this embodiment, the Multi-head attribute in the original model is replaced with the spark attribute for executing the data processing method provided in this embodiment. The implementation flow of spark Attention is described in the data processing method 1000 described in the above embodiment, and is not described herein.

In one possible embodiment, an embodiment of the present application provides a data processing apparatus. Referring to fig. 22, fig. 22 is a schematic structural diagram of a data processing apparatus 2200 according to an embodiment of the present application. As shown in fig. 22, the data processing apparatus 2200 includes: an acquisition unit 2201 and a processing unit 2202; the obtaining unit 2201 is configured to obtain blocking information of a sparse matrix, where the sparse matrix is an intermediate matrix obtained during an operation performed based on the sparse attention network, and the blocking information is used to indicate a plurality of matrix blocks divided in the sparse matrix; the obtaining unit 2201 is further configured to obtain, according to the blocking information, matrix data corresponding to each of the plurality of matrix blocks from a first matrix and a second matrix, where the first matrix and the second matrix are matrices for calculating the sparse matrix; the processing unit 2202 is configured to perform a matrix multiplication operation according to the matrix data, so as to obtain a plurality of matrix blocks, where the plurality of matrix blocks include all non-zero elements in the sparse matrix, and each matrix block in the plurality of matrix blocks includes a plurality of elements; the processing unit 2202 is further configured to splice the plurality of matrix blocks to obtain a target matrix, where the target matrix is used to perform an operation related to the sparse matrix in the sparse attention network.

In one possible implementation, the processing unit 2202 is specifically configured to: when the matrix blocks comprise target elements, adjusting the values of the target elements in the matrix blocks to zero to obtain a plurality of adjusted matrix blocks, wherein the target elements are elements with zero values in the sparse matrix; and splicing the plurality of adjusted matrix blocks to obtain the target matrix.

In one possible implementation, the number of rows or columns of each matrix block in the plurality of matrix blocks is the same, and the plurality of matrix blocks are used for representing local attention features; the processing unit 2202 is specifically configured to: when the number of lines of each matrix block in the plurality of matrix blocks is the same, splicing the plurality of matrix blocks in the dimension of columns to obtain the target matrix, wherein the number of lines of the target matrix is the same as the number of lines of the plurality of matrix blocks; when the number of columns of each matrix block in the plurality of matrix blocks is the same, the plurality of matrix blocks are spliced in the dimension of the row to obtain the target matrix, and the number of columns of the target matrix is the same as the number of columns of the plurality of matrix blocks.

In one possible implementation manner, the blocking information includes first blocking information and second blocking information, the first blocking information is used for indicating a plurality of first matrix blocks divided in the sparse matrix, the columns of the plurality of first matrix blocks are the same and the total number of rows of the plurality of first matrix blocks is equal to the number of rows of the sparse matrix, the second blocking information is used for indicating a plurality of second matrix blocks divided in the sparse matrix, the number of rows of the plurality of second matrix blocks is the same as the number of rows of the sparse matrix and the number of columns of the plurality of second matrix blocks is smaller than the number of columns of the sparse matrix; the processing unit 2202 is specifically configured to: splicing the plurality of first matrix blocks in the dimension of the row to obtain a first target matrix; splicing the plurality of second matrix blocks in the dimension of the columns to obtain a second target matrix; and splicing the first target matrix and the second target matrix in the dimension of the columns to obtain the target matrix.

In one possible implementation, the processing unit 2202 is specifically configured to: if the positions of the first element in the first target matrix and the second element in the second target matrix in the sparse matrix are the same, adjusting the value of the second element in the second target matrix to be zero to obtain an adjusted second target matrix; and splicing the first target matrix and the adjusted second target matrix in the dimension of the columns to obtain the target matrix.

In one possible implementation, the processing unit 2202 is further configured to: normalizing the target matrix to obtain a normalized target matrix; and obtaining an output matrix based on the normalized target matrix and the third matrix, wherein the output matrix is the output of the attention module in the sparse attention network, and the first matrix, the second matrix and the third matrix are obtained by calculating the same matrix based on different weights.

In one possible implementation, the processing unit 2202 is specifically configured to: splitting the normalized target matrix in the column dimension based on the splicing mode of the target matrix to obtain a third target matrix and a fourth target matrix, wherein the size of the third target matrix is the same as that of the first target matrix, and the size of the fourth target matrix is the same as that of the second target matrix; splitting the third target matrix and the third matrix based on the splicing mode of the first target matrix to obtain a plurality of matrix blocks in the third target matrix and a plurality of matrix blocks in the third matrix, wherein the matrix blocks in the third target matrix and the matrix blocks in the third matrix have a one-to-one correspondence; performing matrix multiplication operation on the third target matrix and matrix blocks with corresponding relation in the third matrix, and splicing the matrix blocks obtained after performing the matrix multiplication operation in the row dimension to obtain a first output matrix; performing matrix multiplication operation on the sub-matrix of the fourth target matrix and the third matrix to obtain a second output matrix, wherein the sub-matrix is formed by a plurality of rows of elements in the third matrix; and adding the first output matrix and the second output matrix to obtain the output matrix.

In another possible embodiment, the acquiring unit 2201 is configured to acquire data to be processed; the processing unit 2202 is configured to process the data to be processed based on a sparse attention network, so as to obtain output data; wherein during processing of the data to be processed based on the sparse attention network, an operation related to a sparse matrix in the sparse attention network is performed according to the method of the first aspect or any implementation manner of the first aspect.

Next, referring to fig. 23, fig. 23 is a schematic structural diagram of an execution device provided by an embodiment of the present application, where the execution device 2300 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, etc., and is not limited herein. The execution device 2300 may be deployed with the data processing apparatus described in the corresponding embodiment of fig. 23, to implement the functions of data processing in the corresponding embodiment of fig. 23. Specifically, the execution device 2300 includes: the receiver 2301, the transmitter 2302, the processor 2303 and the memory 2304 (where the number of processors 2303 in the execution device 2300 may be one or more, one processor is illustrated in fig. 23), wherein the processor 2303 may include an application processor 23031 and a communication processor 23032. In some embodiments of the application, the receiver 2301, transmitter 2302, processor 2303 and memory 2304 may be connected by a bus or other means.

Memory 2304 may include read only memory and random access memory, and provides instructions and data to the processor 2303. A portion of memory 2304 may also include non-volatile random access memory (NVRAM). The memory 2304 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for performing various operations.

The processor 2303 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The methods disclosed in the embodiments of the present application described above may be applied to the processor 2303 or implemented by the processor 2303. The processor 2303 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or by instructions in software form in the processor 2303. The processor 2303 may be a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 2303 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. Which is located in a memory 2304, and a processor 2303 reads information in the memory 2304, in combination with its hardware, to perform the steps of the method described above.

The receiver 2301 may be used to receive input numeric or character information and to generate signal inputs related to performing device related settings and function control. The transmitter 2302 may be used to output numeric or character information via a first interface; the transmitter 2302 may also be used to send instructions to the disk stack via the first interface to modify data in the disk stack; the transmitter 2302 may also include a display device such as a display screen.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps as performed by the aforementioned performing device, or causes the computer to perform the steps as performed by the aforementioned training device.

The embodiment of the present application also provides a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device or causes the computer to perform the steps performed by the aforementioned training device.

The execution device, training device or electronic device provided in the embodiment of the present application may be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution apparatus to execute the image processing method described in the above embodiment, or to cause the chip in the training apparatus to execute the image processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 24, fig. 24 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 2400, and the NPU 2400 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an arithmetic circuit 2403, and the controller 2404 controls the arithmetic circuit 2403 to extract matrix data in a memory and perform multiplication.

In some implementations, the arithmetic circuit 2403 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 2403 is a two-dimensional systolic array. The arithmetic circuit 2403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuit 2403 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2402 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 2401 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 2408.

The unified memory 2406 is used for storing input data and output data. The weight data is directly transferred to the weight memory 2402 through the memory cell access controller (Direct Memory Access Controller, DMAC) 2405. The input data is also carried into the unified memory 2406 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 2424, for the AXI bus to interact with the DMAC and finger memory (Instruction Fetch Buffer, IFB) 2409.

The bus interface unit 2424 (Bus Interface Unit, abbreviated as BIU) is configured to obtain an instruction from the external memory by the instruction fetch memory 2409, and further configured to obtain the raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 2405.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2406 or to transfer weight data to the weight memory 2402 or to transfer input data to the input memory 2401.

The vector calculation unit 2407 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 2403, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector calculation unit 2407 can store the vector of processed outputs to the unified memory 2406. For example, the vector calculation unit 2407 may perform a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 2403, such as linear interpolation of the feature planes extracted by the convolutional layer, and then such as a vector of accumulated values, to generate the activation value. In some implementations, vector calculation unit 2407 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 2403, e.g., for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 2409 connected to the controller 2404, for storing instructions used by the controller 2404;

the unified memory 2406, the input memory 2401, the weight memory 2402, and the finger memory 2409 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A method of data processing, comprising:

obtaining block information of a sparse matrix, wherein the sparse matrix is an intermediate matrix obtained during operation execution based on a sparse attention network, the distribution of non-zero elements in the sparse matrix is regular, the block information is used for indicating a plurality of matrix blocks divided in the sparse matrix, and the sparse attention network is used for executing a computer vision task or a natural language processing task;

according to the blocking information, matrix data corresponding to each matrix block in the plurality of matrix blocks are obtained from a first matrix and a second matrix, wherein the first matrix and the second matrix are used for calculating the sparse matrix;

according to the matrix data, performing matrix multiplication operation to obtain a plurality of matrix blocks, wherein the matrix blocks comprise all non-zero elements in the sparse matrix, and each matrix block in the matrix blocks comprises a plurality of elements;

and splicing the matrix blocks to obtain a target matrix, wherein the target matrix is used for executing the operation related to the sparse matrix in the sparse attention network.

2. The method of claim 1, wherein the stitching the plurality of matrix blocks to obtain the target matrix comprises:

When the matrix blocks comprise target elements, adjusting the values of the target elements in the matrix blocks to zero to obtain a plurality of adjusted matrix blocks, wherein the target elements are elements with zero values in the sparse matrix;

and splicing the plurality of adjusted matrix blocks to obtain the target matrix.

3. The method according to claim 1 or 2, wherein the number of rows or columns of each matrix block of the plurality of matrix blocks is the same, the plurality of matrix blocks being used to represent local attention features;

the splicing the matrix blocks to obtain a target matrix comprises the following steps:

when the number of lines of each matrix block in the plurality of matrix blocks is the same, splicing the plurality of matrix blocks in the dimension of columns to obtain the target matrix, wherein the number of lines of the target matrix is the same as the number of lines of the plurality of matrix blocks;

4. The method according to claim 1 or 2, wherein the blocking information includes first blocking information for indicating a plurality of first matrix blocks divided in the sparse matrix, the number of columns of the plurality of first matrix blocks being the same and a total number of rows of the plurality of first matrix blocks being equal to the number of rows of the sparse matrix, and second blocking information for indicating a plurality of second matrix blocks divided in the sparse matrix, the number of rows of the plurality of second matrix blocks being the same as the number of rows of the sparse matrix and a number of columns of the plurality of second matrix blocks being smaller than the number of columns of the sparse matrix;

splicing the plurality of first matrix blocks in the dimension of the row to obtain a first target matrix;

splicing the plurality of second matrix blocks in the dimension of the columns to obtain a second target matrix;

and splicing the first target matrix and the second target matrix in the dimension of the columns to obtain the target matrix.

5. The method of claim 4, wherein the plurality of first matrix blocks are used to represent local attention features and the plurality of second matrix blocks are used to represent global attention features.

6. The method according to claim 4 or 5, wherein the stitching the first target matrix and the second target matrix in the column dimension to obtain the target matrix comprises:

if the positions of the first element in the first target matrix and the second element in the second target matrix in the sparse matrix are the same, adjusting the value of the second element in the second target matrix to be zero to obtain an adjusted second target matrix;

and splicing the first target matrix and the adjusted second target matrix in the dimension of the columns to obtain the target matrix.

7. The method according to any one of claims 4-6, further comprising:

normalizing the target matrix to obtain a normalized target matrix;

and obtaining an output matrix based on the normalized target matrix and the third matrix, wherein the output matrix is the output of the attention module in the sparse attention network, and the first matrix, the second matrix and the third matrix are obtained by calculating the same matrix based on different weights.

8. The method of claim 7, wherein the obtaining a fourth matrix based on the normalized target matrix and the third matrix comprises:

splitting the normalized target matrix in the column dimension based on the splicing mode of the target matrix to obtain a third target matrix and a fourth target matrix, wherein the size of the third target matrix is the same as that of the first target matrix, and the size of the fourth target matrix is the same as that of the second target matrix;

splitting the third target matrix and the third matrix based on the splicing mode of the first target matrix to obtain a plurality of matrix blocks in the third target matrix and a plurality of matrix blocks in the third matrix, wherein the matrix blocks in the third target matrix and the matrix blocks in the third matrix have a one-to-one correspondence;

Performing matrix multiplication operation on the third target matrix and matrix blocks with corresponding relation in the third matrix, and splicing the matrix blocks obtained after performing the matrix multiplication operation in the row dimension to obtain a first output matrix;

performing matrix multiplication operation on the sub-matrix of the fourth target matrix and the third matrix to obtain a second output matrix, wherein the sub-matrix is formed by a plurality of rows of elements in the third matrix;

and adding the first output matrix and the second output matrix to obtain the output matrix.

9. A data processing method based on a sparse attention network, comprising:

acquiring data to be processed;

processing the data to be processed based on a sparse attention network to obtain output data;

wherein during processing of the data to be processed based on the sparse attention network, an operation related to a sparse matrix in the sparse attention network is performed according to the method of any one of claims 1-8.

10. The method of claim 9, wherein the data to be processed comprises image data, text data, or voice data.

11. A data processing apparatus, comprising: an acquisition unit and a processing unit;

The acquisition unit is used for acquiring block information of a sparse matrix, wherein the sparse matrix is an intermediate matrix obtained during the operation execution based on a sparse attention network, the distribution of non-zero elements in the sparse matrix is regular, the block information is used for indicating a plurality of matrix blocks divided in the sparse matrix, and the sparse attention network is used for executing a computer vision task or a natural language processing task;

the obtaining unit is further configured to obtain, according to the partitioning information, matrix data corresponding to each of the plurality of matrix blocks from a first matrix and a second matrix, where the first matrix and the second matrix are matrices for calculating the sparse matrix;

the processing unit is configured to perform matrix multiplication according to the matrix data to obtain a plurality of matrix blocks, where the plurality of matrix blocks include all non-zero elements in the sparse matrix, and each matrix block in the plurality of matrix blocks includes a plurality of elements;

the processing unit is further configured to splice the plurality of matrix blocks to obtain a target matrix, where the target matrix is used to perform an operation related to the sparse matrix in the sparse attention network.

12. The apparatus according to claim 11, wherein the processing unit is specifically configured to:

13. The apparatus of claim 11 or 12, wherein the number of rows or columns of each matrix block in the plurality of matrix blocks is the same, the plurality of matrix blocks being used to represent local attention features;

the processing unit is specifically configured to:

14. The apparatus according to claim 11 or 12, wherein the blocking information includes first blocking information for indicating a plurality of first matrix blocks divided in the sparse matrix, the number of columns of the plurality of first matrix blocks being the same and a total number of rows of the plurality of first matrix blocks being equal to the number of rows of the sparse matrix, and second blocking information for indicating a plurality of second matrix blocks divided in the sparse matrix, the number of rows of the plurality of second matrix blocks being the same as the number of rows of the sparse matrix and a number of columns of the plurality of second matrix blocks being smaller than the number of columns of the sparse matrix;

the processing unit is specifically configured to:

15. The apparatus of claim 14, wherein the plurality of first matrix blocks are used to represent local attention features and the plurality of second matrix blocks are used to represent global attention features.

16. The apparatus according to claim 14 or 15, wherein the processing unit is specifically configured to:

17. The apparatus according to any of the claims 14-16, wherein the processing unit is further configured to:

normalizing the target matrix to obtain a normalized target matrix;

18. The apparatus according to claim 17, wherein the processing unit is specifically configured to:

19. A data processing apparatus, comprising: an acquisition unit and a processing unit;

the acquisition unit is used for acquiring data to be processed;

the processing unit is used for processing the data to be processed based on a sparse attention network to obtain output data;

20. The apparatus of claim 19, wherein the data to be processed comprises image data, text data, or voice data.

21. An electronic device comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, when executed, the electronic device performing the method of any of claims 1 to 10.

22. A computer storage medium storing instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 10.