CN117131901A

CN117131901A - Sparse attention calculation model and method, electronic device and storage medium

Info

Publication number: CN117131901A
Application number: CN202210531111.XA
Authority: CN
Inventors: 屠要峰; 杨智; 竺沈涵; 郭子瑜; 栗伟清
Original assignee: Peking University; ZTE Corp
Current assignee: Peking University; ZTE Corp
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2023-11-28
Also published as: WO2023221940A1

Abstract

The invention discloses a sparse attention calculation model and method, electronic equipment and a storage medium. The model comprises a plurality of sequentially connected transducer layers; wherein the front preset number of the transducer layers are shallow layer transducer layers, and the rest number of the transducer layers are deep layer transducer layers; the method comprises the steps of processing input data in each layer of transducer layer in sequence, and outputting attention calculation results; the mode selector is connected with the last shallow layer of the transformers and each deep layer of the transformers and is used for receiving hidden vectors output by the last shallow layer of the transformers; outputting weights corresponding to a plurality of preset sparse modes respectively according to the hidden vectors output by the last shallow layer transformation layer; and respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers performs sparse attention calculation based on the weights corresponding to the preset multiple sparse modes. The scheme provided by the invention can reduce floating point operation times and occupation of memory in operation, and ensure that model accuracy is not damaged.

Description

Sparse attention calculation model and method, electronic device and storage medium

Technical Field

The present invention relates to the field of sparse attention mechanisms, and in particular, to a sparse attention calculation model and method, an electronic device, and a storage medium.

Background

Attention mechanisms (Attention) were originally proposed in the field of natural language processing by computing an Attention matrix to obtain Attention preference information of a model to an input. The model accuracy can be greatly improved by effectively utilizing the mechanism. Transformer is becoming increasingly interesting as a new deep learning model that uses the attentive mechanisms to completely replace the Recurrent Neural Network (RNNs) and LSTM structures due to its excellent performance in various natural language processing tasks. However, matrix computation by attention mechanisms brings about time and space consumption square to the length of the input sequence, which limits the application of a large number of transducer models using attention mechanisms to long sequence tasks. Moreover, as the attention deep learning model is bigger and bigger, more calculation force is needed, and especially when the edge side lands, the model is often limited by calculation force and memory, so that the model cannot operate.

Disclosure of Invention

The inventor researches find that, at present, many works use a method of attention thinning to calculate sparse attention by reducing the number of attention weights to be calculated to reduce the time and space consumption of an attention mechanism, and manually design a sparse mode by considering the attention weight distribution situation in a specific task.

However, in the prior art, a fixed sparse mode method is often used. That is, the same sparse mode is used for different input sequence instances in the same task, and most of them also use exactly the same or similar sparse mode for different tasks. However, in practical application, when input sequence examples of different tasks are input into a transducer model to perform intermediate calculation, the attention weight distribution conditions are different; even with the same task, there is a different tendency for the attention weight distribution corresponding to different input sequence instances. Therefore, the existing method for calculating attention by using the same sparse mode for different examples can lead to the fact that the sparse attention calculated by the model for some examples is greatly different from the original attention, so that model output is affected, and model accuracy is reduced.

The application provides a transform model reasoning optimization method based on an instance level self-adaptive sparse mode, which uses a mode selector to adaptively select an atomic sparse mode or a combination thereof for different input sequence instances so as to guide deep attention sparse calculation, reduce floating point operation times and memory occupation in running, ensure that the model precision is hardly damaged, and meanwhile, does not need to retrain the transform model in the execution process.

In order to solve the technical problem that a related model is difficult to deploy due to the fact that a large amount of computing power and memory are occupied on the edge side, the embodiment of the invention provides a sparse attention computing model, a sparse attention computing method, electronic equipment and a storage medium.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a sparse attention calculation model, which comprises the following steps:

a plurality of sequentially connected transducer layers; wherein the front preset number of the transducer layers are shallow layer transducer layers, and the rest number of the transducer layers are deep layer transducer layers; the method comprises the steps of processing input data in each layer of transducer layer in sequence, and outputting attention calculation results;

the mode selector is connected with the last shallow layer of the transformers and each deep layer of the transformers and is used for receiving hidden vectors output by the last shallow layer of the transformers; outputting weights corresponding to a plurality of preset sparse modes respectively according to the hidden vectors output by the last shallow layer transformation layer; and respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers performs sparse attention calculation based on the weights corresponding to the preset multiple sparse modes.

In the above scheme, the number of the preset sparse modes is 5, including a block sparse mode, a strip sparse mode, a cavity sparse mode, a global sparse mode and a random sparse mode.

In the above aspect, the mode selector includes:

the downsampling layer is used for reducing the dimension of the hidden vector output by the shallow layer transform layer to obtain a one-dimensional tensor;

the linear layer and the GELU layer between the linear layers are connected with the downsampling layer and are used for processing the one-dimensional tensor to obtain a second tensor;

and the normalization layer is connected with the GELU layers between the sex layer and the linear layer and is used for performing normalization processing on the second tensor and outputting weights corresponding to a plurality of preset sparse modes respectively.

In the above aspect, the model further includes:

and each predictor is connected with one deep layer transducer layer correspondingly and is used for receiving the intermediate prediction result output by the corresponding deep layer transducer layer and acquiring the loss between the intermediate prediction result and the input data label.

The embodiment of the invention also provides a sparse attention calculation method which is applied to a sparse attention calculation model, wherein the sparse attention calculation model comprises a plurality of sequentially connected transformer layers; wherein the front preset number of the transducer layers are shallow layer transducer layers, and the rest number of the transducer layers are deep layer transducer layers; the method comprises the following steps:

Receiving input data;

processing the data sequentially through each shallow layer transducer layer in the sparse attention calculation model, and outputting hidden vectors;

outputting weights corresponding to a plurality of preset sparse modes respectively according to the hidden vectors output by the last shallow layer transformation layer;

and respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers sequentially performs sparse attention calculation on the output result of the previous layer of transformers based on the weights corresponding to the preset multiple sparse modes.

In the above scheme, outputting weights corresponding to the preset plurality of sparse modes respectively according to the hidden vectors output by the last shallow layer transform layer includes:

performing dimension reduction on the hidden vector output by the shallow layer transformation layer to obtain a one-dimensional tensor;

processing the one-dimensional tensor by utilizing a linear layer and a GELU layer between the linear layers to obtain a second tensor;

and carrying out normalization processing on the second tensor, and outputting weights corresponding to a plurality of preset sparse modes respectively.

In the above solution, the training process of the sparse attention calculation model includes:

Training an initial model, and determining model parameters of the initial model;

fixing model parameters of the initial model, training a predictor in the model, and determining the model parameters of the predictor;

and fixing the model parameters of the initial model and the model parameters of the predictor, training a mode selector in the model, and determining the model parameters of the mode selector.

In the above solution, the training the initial model, and determining the model parameters of the initial model includes:

acquiring input training set data;

calculating the training set data sequentially through each layer of transformer layer to obtain a calculation result;

calculating a loss between the calculation result and the input data tag;

and updating the model parameters of the initial model by using the loss back propagation for a plurality of iterations.

In the above solution, the fixing the model parameters of the initial model, training a predictor in the model, and determining the model parameters of the predictor includes:

fixing model parameters of the initial model, and acquiring input training set data;

calculating the training set data sequentially through each layer of converter layer to obtain hidden vectors output by each layer of deep layer converter layer, and calculating loss between the hidden vectors output by each layer of deep layer converter layer and the input data label;

And updating model parameters of the predictor by using the loss back propagation for a plurality of iterations.

In the above solution, the fixing the model parameters of the initial model and the model parameters of the predictor, training a mode selector in the model, and determining the model parameters of the mode selector includes:

fixing model parameters of the initial model and model parameters of the predictor to obtain input training set data;

calculating the training set data sequentially through each layer of transformation layer to obtain hidden vectors output by the last shallow layer of transformation layer, and outputting weights corresponding to a plurality of preset sparse modes respectively based on the hidden vectors;

respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers sequentially performs sparse attention calculation on the output result of the previous layer of transformers based on the weights corresponding to the preset multiple sparse modes;

obtaining hidden vectors output by each deep layer transformation layer, and calculating loss between the hidden vectors output by each deep layer transformation layer and the input data labels;

and updating model parameters of the mode selector by using the loss back propagation for a plurality of iterations.

The embodiment of the invention also provides electronic equipment, which comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein,

the processor is configured to perform the steps of any of the methods described above when the computer program is run.

The embodiment of the invention also provides a storage medium, wherein a computer program is stored in the storage medium, and when the computer program is executed by a processor, the steps of any one of the methods are realized.

The sparse attention calculation model comprises a plurality of sequentially connected transducer layers, a sparse attention calculation method, electronic equipment and a storage medium; wherein the front preset number of the transducer layers are shallow layer transducer layers, and the rest number of the transducer layers are deep layer transducer layers; the method comprises the steps of processing input data in each layer of transducer layer in sequence, and outputting attention calculation results; the mode selector is connected with the last shallow layer of the transformers and each deep layer of the transformers and is used for receiving hidden vectors output by the last shallow layer of the transformers; outputting weights corresponding to a plurality of preset sparse modes respectively according to the hidden vectors output by the last shallow layer transformation layer; and respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers performs sparse attention calculation based on the weights corresponding to the preset multiple sparse modes. The scheme provided by the invention can reduce floating point operation times and occupation of memory in operation, and ensure that model accuracy is not damaged.

Drawings

FIG. 1 is a schematic structural diagram of a sparse attention computation model according to an embodiment of the present invention;

FIG. 2 is a block atomic sparse mode schematic diagram according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a stripe atomic sparse mode according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hole atom sparse mode in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a global atomic sparsity model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a random atomic sparse mode according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a mode selector according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of another mode selector according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another sparse attention calculation model according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a transducer module according to an embodiment of the present invention;

FIG. 11 is a flowchart of a sparse attention calculation method according to an embodiment of the present invention;

FIG. 12 is a flowchart of another sparse attention calculation method according to an embodiment of the present invention;

FIG. 13 is a flowchart of another sparse attention calculation method according to an embodiment of the present invention;

FIG. 14 is a flowchart of another sparse attention calculation method according to an embodiment of the present invention;

FIG. 15 is a flowchart of another sparse attention calculation method according to an embodiment of the present invention;

FIG. 16 is a flowchart of another sparse attention calculation method according to an embodiment of the present invention;

fig. 17 is an internal structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Before describing embodiments of the present invention, the following terms will be explained.

Attention (Attn, attention): in neural networks, attention is a technology that mimics attention in cognitive sciences. For the input matrices Q, K, V, the attention module calculates softmax (QK ^T ) V and outputting. The effect of the attention module is to enhance some parts of the input data and reduce others, forcing the network to focus more attention on a small and important part.

Normalized exponential function (softmax): a normalization function is used as an activation function in the field of deep learning. softmax R ^d →R ^d Input x= [ x ] ₁ ，x ₂ ，...，x _d ]The ith bit element maps toWith a value of between 0,1]。

Floating point numbers (FLOPs, floating Point Operations): the total number of floating point operations performed by a computer in a certain calculation process of a program is an index for measuring the calculation amount of a section of the program. The smaller the number, the smaller the theoretical calculation amount.

Gaussian error linear unit (gel, gaussian Error Linear Unit): is a neural network activation function. For each neuron, X is input, and the gaussian error linear unit multiplies the input by a random variable m compliant with Bernoulli distribution Bernoulli (Φ (X)), where Φ (X) =p (x+.x), and X to N (0, 1) are cumulative distributions of standard normal distribution functions. The formula can be written as GELU (X) =xp (X < X) =xΦ (X).

Sparse pattern (sparse pattern): a 01 matrix of the same shape as the attention matrix is used to identify the attention positions that need to be calculated. Calculating the attention using a sparse pattern means that only the attention value of the sparse pattern at the specified position needs to be calculated, and can be used to reduce the space and time consumption in calculating the attention matrix.

Sparse attention (sparse): the resulting attention was calculated using sparse mode.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

An embodiment of the present invention provides a sparse attention calculation model, as shown in fig. 1, including:

In particular, the present embodiments are applicable to the reasoning scenarios of the attention mechanism model on CPU/GPU hardware, including a variety of tasks in the computer vision and natural language processing fields.

The embodiment can realize the sparse mode self-adaptive selection based on the example level characteristics, and the mode selector is inserted into the shallow layer transform layer to accept the hidden vector output of the layer and output the weights of different atomic modes (namely sparse modes). The higher the weight value is, the more suitable the atomic mode is for the subsequent attention calculation of the hidden vector, and the distribution characteristic of the subsequent attention score in the matrix can be better extracted.

In this embodiment, the layer of the transducer before the mode selector is defined as the shallow layer, and the layer of the transducer after the mode selector is defined as the deep layer.

In practical application, when the shallow layer transducer layer and the deep layer transducer layer are divided, different choices can be made according to specific tasks, model super parameters and the requirements on reasoning time and space. For example, a 4-layer model may choose the first layer to be shallow and the rest deep; the 12-layer model can select the first three layers as shallow layers and the rest as deep layers. For a transducer model with a small number of layers, the first layer may be selected to be shallow, with the remainder being deep. For a fixed classification task, a division strategy with highest model precision on a specific data set can be selected and used for an actual application scene. If the optimization requirements on the reasoning time and space are higher, and the model accuracy is in a larger tolerance range, more shallow division modes can be selected.

In an embodiment, the number of the preset sparse modes is 5, including a block sparse mode, a stripe sparse mode, a hole sparse mode, a global sparse mode, and a random sparse mode.

The embodiment can use five atomic sparse modes for extracting different distribution characteristics in the attention weight matrix, including three diagonal sparse modes: block (block), bar (band), hole (dited), and two off-diagonal sparse modes: global, random. These five atomic sparse modes will be explained below in connection with fig. 2-6, where x is used to represent the number of positions that need to be calculated per row on average for each sparse mode (this is an estimate).

Referring to fig. 2, fig. 2 is a block atomic sparse mode. The block atomic sparsity model only requires the computation of blocks located on the diagonal of the attention moment array. As shown in a block atomic sparse pattern with x=3.

Referring to fig. 3, fig. 3 is a stripe atomic sparse pattern. The bar (band) atomic sparsity model computes only the diagonal and several positions beside the diagonal. As shown in a block atomic sparse pattern with x=3.

Referring to fig. 4, fig. 4 is a hole atom sparse mode. The sparse mode of the hole (related) atoms is similar to a stripe, except that each row contains holes (positions which do not need to be calculated) between positions which need to be calculated, and the figure is a sparse mode of the hole atoms with x=3.

Referring to fig. 5, fig. 5 is a global atomic sparse mode. Global (global) atomic sparse mode calculates the attention weights of several elements at the beginning of the sequence and the whole sequence, representing the column on the left and the row on the top in the attention matrix as a whole needs to be calculated. As shown in a global atomic sparse pattern of x=4.

Referring to fig. 6, fig. 6 is a random atomic sparse pattern. A random (random) atomic sparseness pattern randomly selects several positions in each row for computation. As shown by a random atomic sparse pattern of x=2.

In this embodiment, the mode selector receives the hidden vector output by the last shallow layer transform layer; outputting weights corresponding to a plurality of preset sparse modes respectively according to the hidden vectors output by the last shallow layer transformation layer; and respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers performs sparse attention calculation based on the weights corresponding to the preset multiple sparse modes. Because the embodiment is the weight of the sparse mode determined according to the hidden vector output by the last shallow layer transform layer of the different example features, the adaptive selection of the sparse mode according to the different example features can be realized.

Further, referring to fig. 7, in one embodiment, the mode selector includes:

the downsampling layer 701 is configured to reduce the dimension of the hidden vector output by the shallow layer transform layer, so as to obtain a one-dimensional tensor;

a linear layer and a linear interlayer GELU layer 702 connected to the downsampling layer for processing the one-dimensional tensor to obtain a second tensor;

and the normalization layer 703 is connected with the gel layer between the sex layer and the linear layer, and is used for performing normalization processing on the second tensor and outputting weights corresponding to a plurality of preset sparse modes respectively.

In practical applications, the linear layer may be 2 layers, and the gel layer between the linear layers may be 1 layer. Specifically, referring to fig. 8, the mode selector accepts input of hidden vectors of the model, the dimension of the input tensor is (sequence length, feature dimension), the mode selector firstly reduces each hidden vector (i.e. word feature) into one dimension by a downsampling method to obtain tensor with one dimension (sequence length), and then activates the tensor through GELU (Gaussian Error Linear Unit) between two linear layers and a primary layer to obtain final output. The first linear layer reduces the dimension of the tensor after downsampling to a certain constant (such as 25), the second linear layer reduces the dimension of the input tensor to 5, and finally the weight of the five-atom sparse mode is obtained after softmax normalization. That is, the mode selector outputs a tensor with dimension (5) at last, which corresponds to the weights of the five atomic sparse modes respectively.

Of course, the alternative structure of the mode selector in this embodiment is not limited to this structure, but should also include other small classification networks implemented by superposition of linear layers and activation functions, and simple classification algorithms such as XGBoost. The mode selector in this embodiment aims to achieve a lightweight of computation and space consumption for training and use procedures, which may be essentially a small classifier.

Further, referring to fig. 9, in an embodiment, the model further includes:

Specifically, referring to fig. 10, in practice, the present embodiment may follow the original transducer model design, where each layer (each transducer block) contains a preceding attention module and a following multi-layer perceptron (MLP) module.

A predictor is inserted after each deep transducer block. The structure of this predictor may be the same as the classification layer inserted at the end of the model when the transducer model is used to perform classification tasks. In addition, these predictors only function during the training phase of the mode selector.

The embodiment realizes the self-adaptive selection of different sparse modes for different input sequence examples to calculate sparse attention by inserting a mode selector in an original transducer model. Compared with the existing method using single and fixed sparse modes, the adaptive adjustment of the instance level is beneficial to improving the model precision; the accuracy is similar with respect to the original transducer model using the original attention, with some improvement in the task. According to the embodiment, floating point operation times of attention calculation are greatly reduced through attention sparsification calculation, and model reasoning time consumption is further reduced; the sparse matrix storage reduces the storage consumption of the attention matrix, so that the memory occupation during model reasoning is reduced.

In addition, the embodiment can be applied to an inference engine (such as an Adlik), adaptively sparsifies the model, automatically selects a proper sparsification mode according to the input example characteristics, and can combine the related acceleration technology of the bottom hardware to efficiently invoke the sparsification operator of the selected mode so as to realize the inference acceleration.

Here, it should be noted that, in practical application, the above modules of the present embodiment may be implemented by a processor in the device. The above apparatus of this embodiment is only exemplified by the division of the above program modules when executing, and in practical application, the above processing allocation may be performed by different program modules according to need, i.e. the internal structure of the terminal is divided into different program modules to complete all or part of the above processing.

The embodiment of the invention also provides a sparse attention calculation method which is applied to a sparse attention calculation model, wherein the sparse attention calculation model comprises a plurality of sequentially connected transformer layers; wherein the front preset number of the transducer layers are shallow layer transducer layers, and the rest number of the transducer layers are deep layer transducer layers; as shown in fig. 11, the method includes:

step 1101: receiving input data;

step 1102: processing the data sequentially through each shallow layer transducer layer in the sparse attention calculation model, and outputting hidden vectors;

step 1103: outputting weights corresponding to a plurality of preset sparse modes respectively according to the hidden vectors output by the last shallow layer transformation layer;

step 1104: and respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers sequentially performs sparse attention calculation on the output result of the previous layer of transformers based on the weights corresponding to the preset multiple sparse modes.

The embodiment can use five atomic sparse modes for extracting different distribution characteristics in the attention weight matrix, including three diagonal sparse modes: block (block), bar (band), hole (dited), and two off-diagonal sparse modes: global, random.

Further, as shown in fig. 12, in an embodiment, outputting weights corresponding to the preset plurality of sparse modes respectively according to the hidden vectors output by the last shallow layer transform layer includes:

step 1201: performing dimension reduction on the hidden vector output by the shallow layer transformation layer to obtain a one-dimensional tensor;

step 1202: processing the one-dimensional tensor by utilizing a linear layer and a GELU layer between the linear layers to obtain a second tensor;

step 1203: and carrying out normalization processing on the second tensor, and outputting weights corresponding to a plurality of preset sparse modes respectively.

Further, as shown in fig. 13, in an embodiment, the training process of the sparse attention calculation model includes:

step 1301: training an initial model, and determining model parameters of the initial model;

step 1302: fixing model parameters of the initial model, training a predictor in the model, and determining the model parameters of the predictor;

step 1303: and fixing the model parameters of the initial model and the model parameters of the predictor, training a mode selector in the model, and determining the model parameters of the mode selector.

Further, as shown in fig. 14, in an embodiment, the training the initial model, determining the model parameters of the initial model includes:

step 1401: acquiring input training set data;

Step 1402: calculating the training set data sequentially through each layer of transformer layer to obtain a calculation result;

step 1403: calculating a loss between the calculation result and the input data tag;

step 1404: and updating the model parameters of the initial model by using the loss back propagation for a plurality of iterations.

Further, as shown in fig. 15, in an embodiment, the fixing the model parameters of the initial model, training the predictor in the model, and determining the model parameters of the predictor includes:

step 1501: fixing model parameters of the initial model, and acquiring input training set data;

step 1502: calculating the training set data sequentially through each layer of converter layer to obtain hidden vectors output by each layer of deep layer converter layer, and calculating loss between the hidden vectors output by each layer of deep layer converter layer and the input data label;

step 1503: and updating model parameters of the predictor by using the loss back propagation for a plurality of iterations.

Further, as shown in fig. 16, in an embodiment, the fixing the model parameters of the initial model and the model parameters of the predictor, training the mode selector in the model, and determining the model parameters of the mode selector includes:

Step 1601: fixing model parameters of the initial model and model parameters of the predictor to obtain input training set data;

step 1602: calculating the training set data sequentially through each layer of transformation layer to obtain hidden vectors output by the last shallow layer of transformation layer, and outputting weights corresponding to a plurality of preset sparse modes respectively based on the hidden vectors;

step 1603: respectively inputting weights corresponding to the preset multiple sparse modes into each deep layer of transformers, so that each deep layer of transformers sequentially performs sparse attention calculation on the output result of the previous layer of transformers based on the weights corresponding to the preset multiple sparse modes;

step 1604: obtaining hidden vectors output by each deep layer transformation layer, and calculating loss between the hidden vectors output by each deep layer transformation layer and the input data labels;

step 1605: and updating model parameters of the mode selector by using the loss back propagation for a plurality of iterations.

Specifically, the above-described process will be described in detail with reference to the following examples.

Step (1), inputting the data of a training set of a transducer model (such as text data for text classification and corresponding labels) into a working computer with a transducer model;

Step (2), inputting the weight data of the transducer pre-training model into a working computer with a transducer model;

and (3) performing fine tuning on the pre-trained model aiming at a specific downstream task, inputting a batch of data into the model, performing forward calculation, calculating the loss (a loss function can be a cross entropy loss function and the like) between the output and input data of a classification layer at the tail of the model and a correct label, and performing back propagation to update model parameters. This step requires iteration until a stop condition is reached (typically the error is less than a certain value or a certain number of iterations is reached, e.g. the iteration reaches 20000 iterations).

Here, when step (2) is deleted, step (3) may become training for a specific task on the randomly initialized model, specifically: a batch of data is input into the model, forward calculation is performed, loss between labels with correct data output and input is calculated through a classification layer at the tail of the model (a loss function can be a cross entropy loss function and the like), and back propagation is performed to update model parameters. This step requires iteration until a stop condition is reached (typically the error is less than a certain value or a certain number of iterations is reached, e.g. the iteration reaches 20000 iterations).

Step (4) training the newly added predictor, which can be regarded as fine tuning of the overall model for the downstream task, to define the loss function of the predictor of the first layer as

Where D is the data set for the micro-call, θ is the set of model parameters, (x, y) is the feature-tag pair of the input sequence, H is the cross entropy loss function, f _i Is the output of the layer one predictor. The specific operation of fine tuning is as follows:

inputting a batch of data into the model after finishing the fine tuning in the step (3) for forward calculation, and receiving the hidden vector of the layer as input by a predictor after each deep layer transducer block and utilizing a loss functionThe penalty is calculated (here, the layer number taken through all deep layers) and then the predictor parameters are updated. This step requires fixing the parameters of the transducer model that have been trimmed in step (3) unchanged. This step requires iteration until a stop condition is reached such that the model accuracy tends to stabilize (e.g., 5000 iterations).

Step (5) training the newly added mode selector, and also regarding the fine tuning of the overall model for the downstream task, the model loss function is the same as that of step (4). The specific operation of fine tuning is as follows:

inputting a batch of data into the model, and performing traditional attention calculation on a shallow layer of the model;

The hidden vector output by the last transducer block of the model shallow layer is input into a mode selector, and the mode selector outputs weights of five atomic sparse modes;

normalizing the weights output by the mode selector using a normalized exponential function (softmax), if the sparse attention corresponding to the atomic sparse mode is Attn _i The normalized weight is recorded as w _i Each deep transducer block accepts these weights at the same time as the previous layer of hidden vector input, and performs weighted sparse attention calculation:the latter calculation is performed using this sparse attention instead of the original attention;

hidden vector output of each deep layer transducer block is input into a corresponding predictor for calculating the integral loss of the model;

the model overall loss is utilized for back propagation, and the mode selector parameters are updated, and the mode selector parameters are required to be fixed, and the parameters of the transducer model and the parameters of the predictors of each layer which are subjected to fine adjustment in the steps (3) and (4) are unchanged.

This step requires iteration until a stop condition is reached such that the model accuracy tends to stabilize (e.g., 5000 iterations).

The above is the training and fine tuning process of the model, through which the final parameters of the model can be obtained and the final transducer model for classification tasks can be built.

The following steps are required when the model is used for classifying tasks:

step S1: inputting test data (text or images, organized into a sequence) to be classified into a working computer;

step S2: inputting the sequence examples to be classified into a model, and carrying out traditional attention calculation on a shallow layer of the model;

step S3: the mode selector receives the hidden vector of the last transducer block in the shallow layer as input and outputs the weight of five atomic sparse modes;

step S4: and (3) selecting an atomic sparse mode with the largest weight obtained in the step (S3), and performing sparse attention calculation by using the atomic sparse mode by all the deep layer converters. The attention matrix positions except the sparse mode designated position do not need to be calculated, so that the floating point operation times are reduced in the calculation of the sparse attention matrix, and the runtime storage consumption of the attention matrix is reduced;

in addition, step S4 may be replaced with: and (3) selecting the diagonal atomic sparse mode with the maximum weight and the off-diagonal atomic sparse mode with the maximum weight obtained in the step (S3), and combining the diagonal atomic sparse mode with the maximum weight to obtain a new sparse mode. The new combined sparse mode needs the calculated position is the sum of the positions needed to be calculated by the atomic sparse mode participating in the combination. All the transducer blocks in the deep layer use this combined sparse mode for sparse attention computation. There is no need to calculate the attention matrix positions other than the sparse mode specified position, so the calculation of the sparse attention matrix reduces the number of floating point operations and reduces the runtime memory consumption of the attention matrix.

Step S5: the model outputs a probability that the input sequence belongs to a certain class and determines the class of the input sequence instance based on this probability.

Compared with the prior art, the embodiment has the following key technical characteristics:

(1) A multi-stage training method is designed for the mode selector, the original transducer model is not required to be retrained when the model is thinned, and only a certain fine adjustment is required to be carried out for a target task to train a lightweight thinned mode selector.

(2) As the transducer layer deepens, the distribution characteristics of the attention score in the matrix become apparent. The distribution characteristics of shallow attentions are extracted and used for sparse calculation of deep attentions, so that acceleration reasoning and reduction of memory occupation in running are realized on the premise of not influencing model accuracy.

(3) In the reasoning stage, for each input sequence instance, a mode selector inserted into an original transducer model automatically and adaptively selects one or a plurality of atomic sparse mode combinations to obtain a sparse mode which finally participates in calculation, so that the self-adaptive sparsification of attention at the instance level is realized.

The method of the embodiment greatly reduces the time and space consumption of attention calculation, and can avoid the problem that the fixed sparse mode has large difference in different tasks to a certain extent by realizing the self-adaptive selection of the atomic mode (and the combination thereof) at the instance level. Model accuracy is superior to methods that calculate attention using single-atom sparse modes, with similar accuracy as the original transducer using dense attention, even with some improvement in task.

The scheme of the embodiment is characterized in that: sparse attention computation is performed for each input sequence instance adaptively selecting a corresponding sparse mode using a mode selector. Dividing a transducer model into a shallow layer part and a deep layer part according to actual application requirements, and extracting shallow layer attention characteristics for guiding deep layer attention sparsification calculation. And only selecting the position pointed by the sparse mode to calculate the attention weight, and carrying out sparse attention calculation only in a transducer block in the deep layer of the transducer model. Training of the newly added predictor and mode selector is performed separately in sequence and placed after the end of the overall model training. UsingAs an i-th layer loss function, the model overall loss function is the sum of all deep predictor losses; the model parameters include: weights and biases for fully connected layers (including fully connected layers in the attention module and fully connected layers in the multi-layer perceptron module), weights and biases for regularized layers.

The present embodiment designs a multi-stage training method for the mode selector, and this training method can be considered as an additional fine tuning process. The method can directly carry out additional fine adjustment operation based on the pre-trained model, and saves time and resource consumption of the pre-trained large model.

To implement the method of the embodiments of the present invention, the embodiments of the present invention also provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of the method described above.

Based on the hardware implementation of the program modules, and in order to implement the method of the embodiment of the present invention, the embodiment of the present invention further provides an electronic device (computer device). In particular, in one embodiment, the computer device may be a terminal, and its internal structure may be as shown in fig. 17. The computer apparatus includes a processor a01, a network interface a02, a display screen a04, an input device a05, and a memory (not shown in the figure) which are connected through a system bus. Wherein the processor a01 of the computer device is adapted to provide computing and control capabilities. The memory of the computer device includes an internal memory a03 and a nonvolatile storage medium a06. The nonvolatile storage medium a06 stores an operating system B01 and a computer program B02. The internal memory a03 provides an environment for the operation of the operating system B01 and the computer program B02 in the nonvolatile storage medium a06. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. Which when executed by a processor a01, performs the method of any of the embodiments described above. The display screen a04 of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device a05 of the computer device may be a touch layer covered on the display screen, or may be a key, a track ball or a touch pad arranged on a casing of the computer device, or may be an external keyboard, a touch pad or a mouse.

It will be appreciated by those skilled in the art that the structure shown in FIG. 17 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The device provided by the embodiment of the application comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method of any one of the embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash memory (flashRAM). Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transshipment) such as modulated data signals and carrier waves.

It will be appreciated that the memory of embodiments of the invention may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile Memory may be a random access Memory (RAM, randomAccess Memory) that acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static RandomAccess Memory), synchronous static random access memory (SSRAM, synchronous Static RandomAccess Memory), dynamic random access memory (DRAM, dynamic RandomAccess Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic RandomAccess Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic RandomAccess Memory), direct memory bus random access memory (DRRAM, direct Rambus RandomAccess Memory). The memory described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A sparse attention computation model, the model comprising:

2. The model of claim 1, wherein the number of the preset sparse modes is 5, including a block sparse mode, a bar sparse mode, a hole sparse mode, a global sparse mode, and a random sparse mode.

3. The model of claim 1, wherein the mode selector comprises:

4. The model of claim 1, wherein the model further comprises:

5. A sparse attention computing method, characterized by being applied to a sparse attention computing model, the sparse attention computing model comprising a plurality of sequentially connected transducer layers; wherein the front preset number of the transducer layers are shallow layer transducer layers, and the rest number of the transducer layers are deep layer transducer layers; the method comprises the following steps:

receiving input data;

6. The method of claim 5, wherein outputting weights corresponding to a plurality of sparse modes respectively according to the hidden vectors output by the last shallow layer transform layer comprises:

7. The method of claim 5, wherein the training process of the sparse attention calculation model comprises:

8. The method of claim 7, wherein training an initial model and determining model parameters of the initial model comprises:

acquiring input training set data;

calculating a loss between the calculation result and the input data tag;

9. The method of claim 7, wherein the fixing model parameters of the initial model, training predictors in the model, determining model parameters of the predictors comprises:

10. The method of claim 7, wherein the fixing of the model parameters of the starting model and the model parameters of the predictor, training a mode selector in the model, determining model parameters of the mode selector comprises:

11. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor; wherein,

the processor being adapted to perform the steps of the method of any of claims 5 to 10 when the computer program is run.

12. A storage medium having a computer program stored therein, which, when executed by a processor, implements the steps of the method of any one of claims 5 to 10.