CN117688287A

CN117688287A - Self-attention-based data processing method, device, medium and terminal

Info

Publication number: CN117688287A
Application number: CN202311762425.1A
Authority: CN
Inventors: 祝永新; 郑小盈; 段皋翔
Original assignee: Shanghai Advanced Research Institute of CAS
Current assignee: Shanghai Advanced Research Institute of CAS
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-12

Abstract

The application provides a data processing method, a device, a medium and a terminal based on self-attention, binary conversion is carried out on a floating point type query matrix and a floating point type keyword matrix through a TIF conversion algorithm, and the similarity between each vector in the binary query matrix and each vector in the binary keyword matrix is measured and compared by using a Hamming distance, so that the method is a novel attention mechanism based on bit operation, the requirement on a high-precision computing unit is reduced while the global feature extraction capability is maintained, the computation complexity and difficulty are reduced while the data is processed by a transducer model, the large-scale floating point number operation is avoided, the computing efficiency is improved, the energy consumption is reduced, the computing accuracy is improved, and the method is crucial to edge equipment with limited energy resources and computing capability.

Description

Self-attention-based data processing method, device, medium and terminal

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, medium, and terminal based on self-attention.

Background

In recent years, the deep learning technology of large models has been paid attention to, and has good effects on tasks such as image processing, natural language processing and the like. The reasoning and training process of large models is however very demanding on computational resources, which results in the large models becoming very expensive to use and not friendly to devices with limited computational power, such as cell phones.

The existing deep learning large model technology is built based on a transducer model, and the transducer model is widely applied to various fields such as image processing, natural language processing and the like. The most important algorithm in the transducer model is the self-attention algorithm, which compares each input data two by two, measures their similarity by the size of the inner product, and then re-represents the output data according to their similarity.

The self-attention mechanism has better capability of capturing long-distance dependence and better parallelism compared with the previous algorithm. However, the cost comes with a higher computational complexity than previous algorithms. The complexity of squaring is brought about by the need to compare the input data two by two, which presents a number of computational challenges. Particularly for some low computational accuracy devices, deployment of self-attention mechanisms thereon is difficult.

Some existing thin operations for self-attention mechanisms focus on reducing the number of comparisons. But this reduces the computational effort while also reducing the capture ability of the algorithm to rely on long distances. In addition, even if the number of operands is reduced, a large number of floating point number matrix multiplication operations cannot be separated basically, which is still quite expensive operation, so that the transducer model is relatively complex in calculation in terms of data processing such as image processing, natural language processing and the like, high in calculation difficulty, high in energy consumption, low in performance and not guaranteed in accuracy.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a self-attention-based data processing method, device, medium and terminal, which are used for solving the technical problems of how to make a transducer model enjoy its high performance while processing data, reducing the computational complexity and difficulty, avoiding large-scale floating point number operation, so as to improve the computational efficiency, reduce the energy consumption, reduce the high-precision computing unit and improve the computational accuracy.

To achieve the above and other related objects, a first aspect of the present application provides a self-attention-based data processing method, including:

Obtaining data to be processed and calculating to obtain a corresponding floating point number matrix;

inputting the floating point matrix into a transducer model, and performing linear projection processing on the floating point matrix through a transducer block of the transducer model to obtain a floating point type query matrix, a floating point type keyword matrix and a floating point type assignment matrix corresponding to the floating point matrix;

binary conversion is carried out on the floating point type query matrix and the floating point type keyword matrix based on a binary conversion algorithm so as to obtain a corresponding binary query matrix and a binary keyword matrix;

calculating the Hamming distance between the binary query matrix and the sub-matrix at each corresponding position in the binary keyword matrix to obtain a self-attention score matrix;

multiplying the self-attention score matrix with the floating point type assignment matrix to obtain a floating point type self-attention matrix, so that the conversion model can obtain a processing result of the data to be processed by calculating the floating point type self-attention matrix.

In some embodiments of the first aspect of the present application, the binary conversion algorithm performs binary conversion on the floating-point query matrix and the floating-point keyword matrix to obtain corresponding binary query matrix and binary keyword matrix, where the binary conversion process includes:

Performing linear quantization processing on the floating-point type query matrix and the floating-point type keyword matrix to obtain a corresponding floating-point type quantization query matrix and a floating-point type quantization keyword matrix;

binary conversion is respectively carried out on the floating-point type quantization query matrix and the floating-point type quantization keyword matrix based on a binary conversion algorithm, and a plurality of binary query submatrices and binary keyword submatrices are obtained through calculation;

and calculating to obtain a corresponding binary query matrix and a binary keyword matrix according to the plurality of binary query sub-matrices and the binary keyword sub-matrices.

In some embodiments of the first aspect of the present application, performing linear quantization processing on the floating-point query matrix and the floating-point keyword matrix to obtain a corresponding floating-point quantized query matrix and a corresponding floating-point quantized keyword matrix includes:

wherein Q is _f Representing a floating-point quantized query matrix; k (K) _f Representing a floating-point quantization key matrix; q'. _f Representing a floating point query matrix; k'. _f Representing a floating-point key matrix; min (Q' _f ) Representing a floating point query matrix Q' _f Minimum value of the numerical values; max (Q' _f ) Representing a floating point query matrix Q' _f The maximum value of the numerical values; min (K' _f ) Representing a floating-point key matrix K' _f Minimum value of the numerical values; max (K' _f ) Representing a floating-point key matrix k' _f Maximum value of the numerical values.

In some embodiments of the first aspect of the present application, the binary conversion algorithm comprises a TIF conversion algorithm; the calculation process of the TIF conversion algorithm comprises the following steps:

V _i ＝(V _i-1 +Q _f )[1-[Θ(V _i-1 +Q _f -1)]+(V _i-1 +Q _f -1)[Θ(V _i-1 +Q _f -1)]；

wherein V is _i Representing acquisition of the ith binary query sub-matrixCumulative value in the whole process; q (Q) _f Representing a floating-point quantized query matrix; v (V) _i-1 Representing the acquisition of the i-1 th binary query sub-matrix +.>Cumulative value in the whole process; Θ (x) represents a Heaviside function; q (Q) _b Representing a binary query matrix; />Representing a binary query matrix Q _b The ith binary query sub-matrix of (a); t represents a time step.

V′ _i ＝(V′ _i-1 +K _f )[1-Θ(V _i-1 +K _f -1)]+(V′ _i-1 +K _f -1)[Θ(V _i-1 +K _f -1)]；

wherein V 'is' _i Representing acquisition of the ith binary keyword submatrixCumulative value in the whole process; k (K) _f Representing a floating-point quantization key matrix; v'. _i-1 Representing the acquisition of the i-1 th binary keyword submatrix +.>Cumulative value in the whole process; Θ (x) represents a Heaviside function; k (K) _b Representing a binary key matrix; / >Representing a binary keyword matrix K _b An ith binary keyword sub-matrix of (a); t represents a time step.

In some embodiments of the first aspect of the present application, calculating the hamming distance between the binary query matrix and the sub-matrix at each corresponding position in the binary keyword matrix to obtain the self-attention score matrix includes:

wherein A is _i Representing an ith self-attention matrix; a is that _i (m, n) represents the ith self-attention matrix A _i In (a) and (b)An mth row and an nth column element;representing the ith binary query submatrix +.>An mth row vector in (a); />Representing the ith binary keyword submatrix +.>An nth row vector of (a); />Representation->Hamming distance between them; a represents a self-attention score matrix; t represents a time step.

To achieve the above and other related objects, a second aspect of the present application provides a self-attention-based data processing apparatus, comprising:

the data preprocessing module is used for acquiring data to be processed and calculating to obtain a corresponding floating point number matrix;

the floating-point matrix acquisition module is used for inputting the floating-point matrix into a transducer model, and performing linear projection processing on the floating-point matrix through a transducer block of the transducer model to obtain a floating-point query matrix, a floating-point keyword matrix and a floating-point assignment matrix corresponding to the floating-point matrix;

The binary matrix conversion module is used for binary conversion of the floating-point type query matrix and the floating-point type keyword matrix based on a binary conversion algorithm so as to obtain a corresponding binary query matrix and a binary keyword matrix;

the self-attention score matrix calculation module is used for calculating the Hamming distance between the binary query matrix and the submatrices at each corresponding position in the binary keyword matrix to obtain a self-attention score matrix;

and the data processing output module is used for multiplying the self-attention score matrix with the floating point type assignment matrix to obtain a floating point type self-attention matrix, so that the conversion module can obtain a processing result of the data to be processed by calculating the floating point type self-attention matrix.

In some embodiments of the second aspect of the present application, the binary matrix conversion module is further configured to perform the following steps:

To achieve the above and other related objects, a third aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the self-attention-based data processing method as described above.

To achieve the above and other related objects, a fourth aspect of the present application provides an electronic terminal, including: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory to cause the terminal to perform the self-attention based data processing method as described above.

As described above, the self-attention-based data processing method, device, medium and terminal of the present application have the following beneficial effects: binary conversion is carried out on the floating point type query matrix and the floating point type keyword matrix through a TIF conversion algorithm, the similarity between each vector in the binary query matrix and each vector in the binary keyword matrix are measured and compared by using a Hamming distance, and the floating point type query matrix and the floating point type keyword matrix are novel attention mechanisms based on bit operation. In addition, the binary conversion process based on the TIF conversion algorithm of the application involves representing the input floating point number as a combination of T binary scores, so that the whole calculation process is within a quantifiable error range. Performing attention operations on low precision devices thus neatly trades off on a small but quantifiable level of precision, brings significant benefits in reducing power consumption requirements during computation and at the same time improving computing efficiency.

Drawings

FIG. 1A is a flow chart of a self-attention based data processing method according to an embodiment of the present application.

FIG. 1B is a flow chart illustrating a binary conversion process according to an embodiment of the present application.

FIG. 2A is a diagram of a novel attention mechanism architecture based on bit manipulation in one embodiment of the present application.

FIG. 2B is a table showing the performance of the text classification with other models according to one embodiment of the present application.

FIG. 2C is a table showing the performance of the image classification task with other models according to an embodiment of the present application.

FIG. 2D is a table showing the performance of the other models in hardware configuration according to one embodiment of the present application.

Fig. 3 is a schematic structural diagram of a self-attention-based data processing device according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.

Detailed Description

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings, which describe several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present application. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "upper," and the like, may be used herein to facilitate a description of one element or feature as illustrated in the figures as being related to another element or feature.

Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions or operations are in some way inherently mutually exclusive.

The transducer model has wide application in many fields such as image processing and natural language processing, and the most important algorithm in the transducer model is a self-attention algorithm, wherein the self-attention algorithm compares each input data two by two, measures the similarity degree of the input data by the size of an inner product, and then re-represents the output data according to the similarity degree of the input data. The self-attention mechanism has better capability of capturing long-distance dependence and better parallelism compared with the previous algorithm. However, the input data needs to be compared two by two, which brings about square complexity, so that the algorithm has higher computational complexity than the prior algorithm. Particularly for some low computational accuracy devices, deployment of self-attention mechanisms thereon is difficult.

Some existing operations for self-attention mechanism are simplified, and focus is on reducing the comparison times. But this reduces the computational effort while also reducing the capture ability of the algorithm to rely on long distances. In addition, even if the number of operands is reduced, a large number of floating point number matrix multiplication operations cannot be separated basically, which is still quite expensive operation, so that the transducer model is relatively complex in calculation in terms of data processing such as image processing, natural language processing and the like, high in calculation difficulty, high in energy consumption, low in performance and not guaranteed in accuracy.

In order to solve the problems in the background art, the invention provides a data processing method, a device, a medium and a terminal based on self-attention, which aim to solve the technical problems of reducing the computational complexity and difficulty, avoiding large-scale floating point number operation, improving the computational efficiency, reducing the energy consumption, reducing a high-precision computing unit and improving the computational accuracy when a transducer model enjoys high performance during data processing.

Before explaining the present invention in further detail, terms and terminology involved in the embodiments of the present invention will be explained, and the terms and terminology involved in the embodiments of the present invention are applicable to the following explanation:

<1> transducer model: is a neural network model based on a self-attention mechanism for processing sequence data.

<2> BiLSTM (Bidirectional Long Short-Term Memory) is a bi-directional long-short-Term Memory network, a variant of long-short-Term Memory network (LSTM). The BiLSTM can process both forward and reverse sequence data simultaneously to better capture the context information in the sequence.

<3> imdb (Internet Movie Data Base): the internet movie database is an online database of movie actors, movies, television shows, television stars, electronic games, and movie production.

<4> cifar: is a data set of image classifications, and is divided into ten categories (airplane, motor vehicle, birds, etc.), each category having 1000 pictures. The CIFAR dataset consists of two subsets of CIFAR-10 and CIFAR-100. CIFAR-10 contains 10 different classes of images, each class having 6000 color images of 32x32 pixel size. CIFAR-100 contains 100 different fine-grained categories, each with 600 images. The images of the CIFAR dataset originate from everyday objects and animals in the real world, with high complexity and diversity.

<5> vit (Vision Transformer): is a deep learning model for image classification that uses a similar architecture as the Transformer in Natural Language Processing (NLP) by decomposing the image into patches of fixed size and then treating these patches as sequence data using the Transformer.

<6>Hybrid Training: a hybrid training method can be customized and optimized according to specific tasks and data characteristics.

<7> STBP-tdBN: is a deep learning model with the ability to train directly on very deep SNNs (Spiking Neural Network, impulse neural networks) and to implement inferences efficiently on the corresponding hardware.

<8> spikeformer: a structure in which a impulse neural network model is applied to a transducer network is shown.

<10> fpga (Field Programmable GateArray): i.e. a field programmable gate array, is a programmable logic circuit consisting of a large number of programmable logic units, memory cells and interconnect resources.

<11> bram (Block RAM): is a storage resource in the FPGA, and is mainly used for storing user data and realizing functions of a lookup table (LUT), data storage and operation (RAM), a buffer area (FIFO) and the like.

<12> lut (look up Table): is known collectively as a look-up table, a memory structure used to implement a particular logical function or operation. In FPGAs, LUTs are typically used to implement combinational or sequential logic functions.

<13> FF (First-In, first-Out): is a FIFO first-in first-out memory.

<14> dsp (Digital Signal Processor): i.e. a digital signal processor, is a processing device dedicated to processing digital signals.

Meanwhile, in order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be further described in detail by the following examples with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1A, a flow chart of a self-attention-based data processing method in an embodiment of the present invention is shown. The self-attention-based data processing method in this embodiment mainly includes the following steps:

s101: and obtaining data to be processed and calculating to obtain a corresponding floating point number matrix.

In this embodiment, the data to be processed includes one or a combination of more of image data, video data, text data, and natural language data.

In this embodiment, encoding processing is performed on the obtained data to be processed, so as to calculate and obtain a floating point number matrix X after encoding the data to be processed.

S102: and inputting the floating point matrix into a transducer model, and performing linear projection processing on the floating point matrix through a transducer block of the transducer model to obtain a floating point type query matrix, a floating point type keyword matrix and a floating point type assignment matrix corresponding to the floating point matrix.

In this embodiment, the transducer model is a neural network model based on a self-attention mechanism for processing sequence data. Compared with the traditional cyclic neural network model, the transducer model has better parallel performance and shorter training time, and therefore, the transducer model is widely applied in the field of natural language processing. As a neural network model based on self-attention mechanisms, a transducer model is able to globally model each element in a sequence and build a link between the elements. The transducer model has better parallel performance and shorter training time than the recurrent neural network model. In the transducer model, techniques such as residual connection and layer normalization are also used to accelerate model convergence and improve model performance. When a task such as image classification, text generation and the like is processed by using a transducer-based deep learning model, the model is input as a picture or text to be processed, and the output is different according to the task targets, and can be translated text, picture category and the like. The transducer model is made up of a plurality of identical transducer blocks containing a self-attention layer. The invention modifies the self-attention layer in the original algorithm.

In this embodiment, the floating point number matrix is input into a transducer model, and a floating point type query matrix Q 'required in self-attention calculation is obtained through linear projection' _f Floating-point keyword matrix K' _f And floating point assignment matrix V _f The subscript f indicates that the matrix is a floating point number matrix, where the floating point query matrix Q' _f Floating-point keyword matrix K' _f And floating point assignment matrix V _f The calculation formula of (2) is as follows:

Q' _f ＝XW _q the method comprises the steps of carrying out a first treatment on the surface of the Formula (1)

K' _f ＝XW _k The method comprises the steps of carrying out a first treatment on the surface of the Formula (2)

V _f ＝XW _v The method comprises the steps of carrying out a first treatment on the surface of the Formula (3)

Wherein Q' _f Representing a floating point query matrix; k'. _f Representing a floating-point key matrix; v (V) _f Representing a floating point assignment matrix; x represents a floating point matrix; w (W) _q Representing a floating point query matrix W' _f A corresponding linear projection matrix; w (W) _k Representing a floating-point key matrix K' _f A corresponding linear projection matrix; w (W) _v Representing a floating point assignment matrix V _f A corresponding linear projection matrix.

S103: and binary conversion is carried out on the floating-point type query matrix and the floating-point type keyword matrix based on a binary conversion algorithm so as to obtain a corresponding binary query matrix and a binary keyword matrix.

In this embodiment, as shown in fig. 1B, a flow chart of a binary conversion process in the embodiment of the invention is shown. The binary conversion algorithm is based on binary conversion of the floating-point query matrix and the floating-point keyword matrix to obtain corresponding binary query matrix and binary keyword matrix, and the binary conversion process comprises the following steps:

S1031: and carrying out linear quantization processing on the floating-point type query matrix and the floating-point type keyword matrix to obtain a corresponding floating-point type quantization query matrix and a floating-point type quantization keyword matrix.

In this embodiment, for the floating point query matrix Q' _f Floating-point keyword matrix K' _f Linear quantization is performed to ensure that all values are positive and that the values are at 0,1]In the range, the data distinction degree in the calculation is ensured, and the accuracy loss of the transducer model is ensured to be in an acceptable range, so that the transducer model can ensure the accuracy of a processing result and simultaneously can improve the reasoning speed of the transducer model when processing the data to be processed.

In this embodiment, the floating-point quantized query matrix Q _f Floating point type quantization keyMatrix K _f The calculation process of (1) comprises:

S1032: and respectively carrying out binary conversion on the floating-point type quantization query matrix and the floating-point type quantization keyword matrix based on a binary conversion algorithm, and calculating to obtain a plurality of binary query submatrices and binary keyword submatrices.

In this embodiment, the binary conversion algorithm includes a TIF conversion algorithm, and based on the TIF conversion algorithm, binary conversion is performed on the floating-point type quantized query matrix and the floating-point type quantized keyword matrix, so that T binary query sub-matrices and binary keyword sub-matrices can be calculated.

S1033: and calculating to obtain a corresponding binary query matrix and a binary keyword matrix according to the plurality of binary query sub-matrices and the binary keyword sub-matrices.

In this embodiment, the binary query matrix Q calculated based on the TIF conversion algorithm _b Comprising T binary query sub-matricesTIF-based conversion calculationBinary keyword matrix K obtained by calculation _b Comprising T binary keyword submatrices +.>Where the subscript b represents that the matrix is a binary type of data. The binary query matrix Q is described below _b Binary keyword matrix K _b The calculation process of (2) is described in detail.

In this embodiment, V is set to obtain the ith binary query sub-matrixCumulative value in the whole process, let V ₀ ＝Q _f Then calculate the binary query sub-matrix Q based on TIF conversion algorithm _b The process of (1) comprises:

V _i ＝(V _i-1 +Q _f )[1-Θ(V _i-1 +Q _f -1)]+(V _i-1 +Q _f -1)[ΘV _i-1 +W _f -1)]the method comprises the steps of carrying out a first treatment on the surface of the Formula (6)

In this embodiment, V 'is set' _i To obtain the ith binary keyword submatrixCumulative value in the whole process, let V' ₀ ＝K _f Then calculate binary keyword matrix K based on TIF conversion algorithm _b The process of (1) comprises:

V' _i ＝(V' _i-1 +k _f )[1-Θ(V' _i-1 +K _f -1)]+(V' _i-1 +K _f -1)[Θ(V' _i-1 +K _f -1)]the method comprises the steps of carrying out a first treatment on the surface of the Formula (10)

Wherein V 'is' _i Representing acquisition of the ith binary keyword submatrixCumulative value in the whole process; k (k) _f Representing a floating-point quantization key matrix; v'. _i-1 Representing the acquisition of the i-1 th binary keyword submatrix +.>Cumulative value in the whole process; Θ (x) represents a Heaviside function; k (K) _b Representing a binary key matrix; />Representing a binary keyword matrix K _b An ith binary keyword sub-matrix of (a); t represents a time step.

In this embodiment, the Heaviside function is a step function representing a function that is abrupt at a specific location, the Heaviside function depends on a variable and has two values-1 and 0, and is 1 when the variable is less than zero and 0 and greater than or equal to zero.

In this embodiment, the binary query matrix Q is output based on the TIF conversion algorithm _b Can be expressed as a floating-point quantized query matrix Q _f As a function of time step T, as shown in equation (9). Binary keyword matrix k based on TIF conversion algorithm output _b Can be expressed as a floating-point quantization key matrix K _f As a function of the time step T, as shown in equation (13), the time step T can be controlled by itself according to the actual use requirement.

It is worth to say that, based on the TIF conversion algorithm, binary conversion is performed on the floating-point query matrix and the floating-point keyword matrix, so that the calculated amount is reduced, the requirement on a high-precision calculation unit is reduced, the conversion model processes data to enjoy high-performance expression, meanwhile, the calculation complexity and difficulty are reduced, large-scale floating-point number operation is avoided, the calculation efficiency is improved, the energy consumption is reduced, and the calculation accuracy is improved.

S104: and calculating the Hamming distance between the binary query matrix and the submatrices at each corresponding position in the binary keyword matrix to obtain a self-attention score matrix.

In this embodiment, the hamming distance is used inside the data transmission error control code, and the hamming distance represents the number of different characters at the corresponding positions of two (same length) strings. The present invention uses hamming distance to measure binary query matrix Q _b And binary keyWord matrix K _b By comparing the two vectors a and b bit by bit, and calculating the different character numbers of the vectors at the same position, namely the Hamming distance, the calculation formula of the Hamming distance is as follows:

wherein a represents a binary query matrix Q _b Is a vector of (a) in (b); b represents a binary keyword matrix K _b Is a vector of (a) in (b); a, a _i A character string representing a position in the vector a; b _i A character string representing that the vector b is located at a corresponding position; XOR (a) _i ,b _i ) Representing character string a _i And b _i Number of different characters in the corresponding position.

In this embodiment, the calculating the hamming distance between the binary query matrix and the sub-matrix at each same position in the binary keyword matrix to obtain the self-attention score matrix includes:

Wherein A is _i Representing an ith self-attention matrix; a is that _i (m, n) represents the ith self-attention matrix A _i An nth column element of an mth row;representing the ith binary query submatrix +.>An mth row vector in (a); />Representing the ith binary keyword submatrix +.>An nth row vector of (a); />Representation->Hamming distance between them; a represents a self-attention score matrix; t represents a time step.

In this embodiment, the ith binary query sub-matrix is calculatedAnd the ith binary keyword submatrixThe Hamming distance between them to obtain the ith self-attention matrix A _i Ith self-attention matrix A _i The m-th row and n-th column element A _i (m, n) is made of->M-th row vector->And->N-th row vector->The result is shown in formula (15). Wherein, the larger the hamming distance is, the smaller the similarity is.

In the present embodiment, the binary query matrix Q is combined with all time steps T _b And binary keyword matrix K _b To calculate the final self-attention score matrix A, as shown in equation (17), adding 1 to the denominator to prevent A _i An extreme case of 0 affects the calculation result.

It should be noted that the existing standard self-attention mechanism compares the similarity of each vector in the floating-point query matrix and each vector in the floating-point keyword matrix through floating-point inner product operation. The present invention uses hamming distance to measure and compare binary query matrix Q _b Binary keyword matrix K _b The similarity between each vector in the (a) is improved by utilizing a bitwise XOR operation, and complex floating point number multiplication and addition are avoided, so that the calculation efficiency is improved, the energy consumption is reduced, and the modification is critical to the edge equipment with limited energy resources and calculation capacity.

Further, the binary conversion process based on the TIF conversion algorithm of the invention involves representing the input floating point number as a combination of T binary scores, so that the whole calculation process is within a quantifiable error range. Performing attention operations on low precision devices thus neatly trades off on a small but quantifiable level of precision, brings significant benefits in reducing power consumption requirements during computation and at the same time improving computing efficiency.

S105: multiplying the self-attention score matrix with the floating point type assignment matrix to obtain a floating point type self-attention matrix, so that the conversion model can obtain a processing result of the data to be processed by calculating the floating point type self-attention matrix.

In this embodiment, the calculation process of the floating-point type self-attention matrix includes:

Y _f ＝AV _f the method comprises the steps of carrying out a first treatment on the surface of the Formula (18)

Wherein V is _f Representing a floating point assignment matrix; a represents a self-attention score matrix; y is Y _f Representing a floating point type self-attention matrix.

In this embodiment, the floating-point type self-attention matrix is a floating-point type matrix, so that the conversion module obtains the processing result of the data to be processed by calculating the floating-point type self-attention matrix.

It is worth to say that the self-attention-based data processing method is a novel attention mechanism based on bit operation, is suitable for binary data, becomes the core operation of a transform model, performs binary conversion on the floating point type query matrix and the floating point type keyword matrix through a TIF conversion algorithm, and reduces the calculation amount by utilizing bit operation while maintaining the global feature extraction capability, thereby reducing the requirement on a high-precision calculation unit, ensuring that the transform model enjoys high-performance expression while processing data, reducing the calculation complexity and difficulty, avoiding large-scale floating point number operation, improving the calculation efficiency, reducing the energy consumption and improving the calculation accuracy, and being crucial to edge equipment with limited energy resources and calculation capability. In addition, the binary conversion process based on the TIF conversion algorithm of the invention involves representing the input floating point number as a combination of T binary scores, so that the whole calculation process is within a quantifiable error range. Performing attention operations on low precision devices thus neatly trades off on a small but quantifiable level of precision, brings significant benefits in reducing power consumption requirements during computation and at the same time improving computing efficiency.

As shown in fig. 2A, a diagram of a novel attention mechanism architecture based on bit manipulation in an embodiment of the present invention is shown. The method comprises the steps of inputting a floating point number matrix corresponding to an image, performing linear processing on the floating point number matrix, performing binary conversion based on a TIF conversion algorithm, calculating a self-attention score matrix through a Hamming distance, multiplying the self-attention score matrix by a floating point type assignment matrix to obtain a floating point type self-attention matrix, and calculating the floating point type self-attention matrix by the conversion model to obtain a processing result of the input image.

As shown in fig. 2B, a table is shown for comparing performance of the text classification with other models in an embodiment of the present invention. BitFormer is expressed as a method related to the invention, is a novel attention mechanism based on bit operation, and performs performance comparison analysis on the method related to the invention and original transformers and BiLSTM. According to experimental results, the invention achieves better performance than two common baseline models, such as sports, lottery, politics, constellation, society, technology, stock, finance, among 8 of 15 tasks. In fact, the present invention achieves a significant 1.2 percent improvement over the standard transducer in the final average result.

As shown in fig. 2C, a table is shown for comparing performance of the image classification task with other models in the embodiment of the present invention. The input data is CIFAR data, and the invention achieves 95.88% performance in CIFAR10 TOP 1 accuracy, which is highest in other SNN-based models. In CIFAR100, the accuracy of the invention reaches 80.13 percent. This is at least 8 percent higher than the other models, only 0.89% lower than the VIT, showing the ability to replace the transducer.

As shown in fig. 2D, a table is shown for comparing performance of the embodiment of the present invention with other models on a hardware structure. The novel attention mechanism based on bit operations, which the present invention relates to, shows a significant 14.5% delay reduction, indicating a faster processing speed, despite consuming 3.4% more BRAM resources compared to standard transformers. In addition, from the aspect of the use efficiency of the DSP, the FF and the LUT, the consumption of DSP resources is reduced by 70.4%, and the use efficiency of the FF and the LUT is respectively improved by 40.9% and 36.4%.

It is worth to say that the self-attention-based data processing method is a novel attention mechanism based on bit operation, compared with the prior art, the method has the advantages that the calculation complexity and difficulty are obviously reduced, a large number of floating point number multiplication operations are avoided while the high performance of a model is ensured, the calculation efficiency is improved, the energy consumption is reduced, the calculation accuracy is improved, and the method has very important significance for edge equipment with limited energy resources and calculation capacity.

As shown in fig. 3, a schematic structural diagram of a self-attention-based data processing device in an embodiment of the present invention is shown. The apparatus 300 mainly comprises: the functions of the data preprocessing module 301, the floating-point matrix acquiring module 302, the binary matrix converting module 303, the self-attention score matrix calculating module 304, and the data processing output module 305 are described in detail below:

the data preprocessing module 301 is configured to obtain data to be processed and calculate a corresponding floating point number matrix.

The floating-point matrix obtaining module 302 is configured to input the floating-point matrix into a transducer model, and perform linear projection processing on the floating-point matrix through a transducer block of the transducer model, so as to obtain a floating-point query matrix, a floating-point keyword matrix, and a floating-point assignment matrix corresponding to the floating-point matrix.

In this embodiment, the floating point number matrix is input into a transducer model, and a floating point type query matrix Q 'required in self-attention calculation is obtained through linear projection' _f Floating-point keyword matrix K' _f And floating point assignment matrix V _f The subscript f indicates that the matrix is a floating point number matrix, where the floating point query matrix Q' _f Floating-point keyword matrix K' _f And floating point assignment matrix V _f The calculation formulas of (a) are shown in the foregoing formulas (1), (2) and (3), and are not described here again.

The binary matrix conversion module 303 is configured to perform binary conversion on the floating-point query matrix and the floating-point keyword matrix based on a binary conversion algorithm, so as to obtain a corresponding binary query matrix and a binary keyword matrix.

In this embodiment, the binary matrix conversion module 303 is further configured to perform the following steps:

(1) And carrying out linear quantization processing on the floating-point type query matrix and the floating-point type keyword matrix to obtain a corresponding floating-point type quantization query matrix and a floating-point type quantization keyword matrix.

In this embodiment, for the floating point query matrix Q' _f Floating-point keyword matrix K' _f Linear quantization is performed to ensure that all values are positive and that the values are at 0,1 ]In the range, the data distinction degree in the calculation is ensured, and the accuracy loss of the transducer model is ensured to be in an acceptable range, so that the transducer model can ensure the accuracy of a processing result and simultaneously can improve the reasoning speed of the transducer model when the transducer model processes the data to be processed.

In this embodiment, the floating-point quantized query matrix Q _f Floating-point quantization key matrix K _f The calculation process of (2) is shown in the foregoing formula (4) and formula (5), and will not be described in detail here.

(2) And respectively carrying out binary conversion on the floating-point type quantization query matrix and the floating-point type quantization keyword matrix based on a binary conversion algorithm, and calculating to obtain a plurality of binary query submatrices and binary keyword submatrices.

(3) And calculating to obtain a corresponding binary query matrix and a binary keyword matrix according to the plurality of binary query sub-matrices and the binary keyword sub-matrices.

In this embodiment, the binary conversion is performed based on the TIF conversion algorithm to obtain the binary query matrix and the binary keyword matrix, and the calculation process is shown in the foregoing formulas (6) to (13), which are not described herein.

The self-attention score matrix calculation module 304 is configured to calculate hamming distances between the binary query matrix and the sub-matrices at each corresponding position in the binary keyword matrix, so as to obtain a self-attention score matrix.

In this embodiment, the calculation process for obtaining the self-attention score matrix based on the hamming distance is shown in the foregoing formulas (14) to (17), and will not be described here.

The data processing output module 305 is configured to multiply the self-attention score matrix with the floating point type assignment matrix to obtain a floating point type self-attention matrix, so that the conversion module calculates the floating point type self-attention matrix to obtain a processing result of the data to be processed.

In this embodiment, the calculation process of the floating point type self-attention matrix is shown in the formula (18), and will not be described herein.

In an embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a self-attention-based data processing method as described above.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

As shown in fig. 4, which is a schematic structural diagram of an electronic terminal in an embodiment of the present application, an electronic terminal 400 provided in this example includes: a processor 401 and a memory 402; the memory 402 is connected to the processor 401 via a system bus and performs communication with each other, the memory 402 is used for storing a computer program, and the processor 401 is used for executing the computer program stored in the memory 402, so that the electronic terminal 400 performs the self-attention-based data processing method as described above.

Referring to fig. 4, an optional hardware structure diagram of an electronic terminal 400 according to an embodiment of the present invention is shown, where the terminal 400 may be a mobile phone, a computer device, a tablet device, a personal digital processing device, a factory background processing device, etc. The electronic terminal 400 includes: at least one processor 401, a memory 402, at least one network interface 404, and a user interface 406. The various components in the device are coupled together by a bus system 405. It is understood that the bus system 405 is used to enable connected communications between these components. The bus system 405 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus systems in fig. 3.

The user interface 406 may include, among other things, a display, keyboard, mouse, trackball, click gun, keys, buttons, touch pad, or touch screen, etc.

It is to be appreciated that memory 402 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), which serves as an external cache, among others. By way of example, and not limitation, many forms of RAM are available, such as static random Access Memory (SRAM, staticRandom Access Memory), synchronous static random Access Memory (SSRAM, synchronous Static RandomAccess Memory). The memory described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 402 in the embodiment of the present invention is used to store various kinds of data to support the operation of the electronic terminal 400. Examples of such data include: any executable programs for operating on electronic terminal 400, such as operating system 4021 and application programs 4022; the operating system 4021 contains various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 4022 may include various application programs such as a media player (MediaPlayer), a Browser (Browser), and the like for implementing various application services. The self-attention-based data processing method provided by the embodiment of the present invention may be included in the application 4022.

The method disclosed in the above embodiment of the present invention may be applied to the processor 401 or implemented by the processor 401. The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 401 or by instructions in the form of software. The processor 401 described above may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 401 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. The general purpose processor 401 may be a microprocessor or any conventional processor or the like. The steps of the accessory optimization method provided by the embodiment of the invention can be directly embodied as the execution completion of the hardware decoding processor or the execution completion of the hardware and software module combination execution in the decoding processor. The software modules may be located in a storage medium having memory and a processor reading information from the memory and performing the steps of the method in combination with hardware.

In an exemplary embodiment, the electronic terminal 400 may be implemented by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable LogicDevice) for performing the aforementioned methods.

In summary, the application provides a data processing method, device, medium and terminal based on self-attention, which performs binary conversion on the floating point query matrix and the floating point keyword matrix through a TIF conversion algorithm, and uses hamming distance to measure and compare the similarity between each vector in the binary query matrix and the binary keyword matrix, so that the method is a novel attention mechanism based on bit operation, and the method is crucial to edge equipment with limited energy resources and computing capacity by utilizing bitwise operation while maintaining global feature extraction capacity, reduces the computation amount, reduces the requirement on a high-precision computing unit, ensures that a transform model enjoys high-performance expression while avoiding large-scale floating point number operation, improves computing efficiency, reduces energy consumption and improves computing accuracy. In addition, the binary conversion process based on the TIF conversion algorithm of the application involves representing the input floating point number as a combination of T binary scores, so that the whole calculation process is within a quantifiable error range. Performing attention operations on low precision devices thus neatly trades off on a small but quantifiable level of precision, brings significant benefits in reducing power consumption requirements during computation and at the same time improving computing efficiency. Therefore, the method effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. A self-attention based data processing method, comprising:

2. The method for processing data based on self-attention as in claim 1, wherein said binary conversion algorithm binary converts said floating-point query matrix and floating-point keyword matrix to obtain corresponding binary query matrix and binary keyword matrix, and the binary conversion process comprises:

3. The method for processing data based on self-attention as claimed in claim 2, wherein the linear quantization processing is performed on the floating-point query matrix and the floating-point keyword matrix to obtain the corresponding floating-point quantized query matrix and the floating-point quantized keyword matrix, and the obtaining manner of the floating-point quantized query matrix and the floating-point quantized keyword matrix includes:

wherein Q is _f Representing a floating-point quantized query matrix; k (K) _f Representing a floating-point quantization key matrix; q'. _f Representing a floating point query matrix; k'. _f Representing a floating-point key matrix; min (Q' _f ) Representing a floating point query matrix Q' _f Minimum value of the numerical values; max (Q' _f ) Representing a floating point query matrix Q' _f The maximum value of the numerical values; min (K' _f ) Representation ofFloating-point keyword matrix K' _f Minimum value of the numerical values; max (K' _f ) Representing a floating-point key matrix K' _f Maximum value of the numerical values.

4. The self-attention-based data processing method of claim 2, wherein the binary conversion algorithm comprises a TIF conversion algorithm; the calculation process of the TIF conversion algorithm comprises the following steps:

V _i ＝(V _i-1 +Q _f )[1-Θ(V _i-1 +Q _f -1)]+(V _i-1 +Q _f -1)[Θ(V _i-1 +Q _f -1)]；

wherein V is _i Representing acquisition of the ith binary query sub-matrixCumulative value in the whole process; q (Q) _f Representing a floating-point quantized query matrix; v (V) _i-1 Representing the acquisition of the i-1 th binary query sub-matrix +. >Cumulative value in the whole process; Θ (x) represents a Heaviside function; q (Q) _b Representing a binary query matrix; />Representing a binary query matrix Q _b The ith binary query sub-matrix of (a); t represents a time step.

5. The self-attention-based data processing method of claim 4, wherein the binary conversion algorithm comprises a TIF conversion algorithm; the calculation process of the TIF conversion algorithm comprises the following steps:

(V′ _i ＝(V′ _i-1 +K _f )[1-Θ(V′ _i-1 +K _f -1)]+(V′ _i-1 +K _f -1)[Θ(V′ _i-1 +K _f -1)]；

wherein V 'is' _i Representing acquisition of the ith binary keyword submatrixCumulative value in the whole process; k (K) _f Representing a floating-point quantization key matrix; v'. _i-1 Representing the acquisition of the i-1 th binary keyword submatrix +.>Cumulative value in the whole process; Θ (x) represents a Heaviside function; k (K) _b Representing a binary key matrix; />Representing a binary keyword matrix K _b An ith binary keyword submatrix of (a)The method comprises the steps of carrying out a first treatment on the surface of the T represents a time step.

6. The method of claim 5, wherein calculating hamming distances between the binary query matrix and the sub-matrices at each corresponding position in the binary keyword matrix to obtain the self-attention score matrix comprises:

wherein A is _i Representing an ith self-attention matrix; a is that _i (m, n) represents the ith self-attention matrix A _i An nth column element of an mth row;representing the ith binary query submatrix +.>An mth row vector in (a); />Representing the ith binary keyword submatrix +.>An nth row vector of (a); />Representation of/>Hamming distance between them; a represents a self-attention score matrix; t represents a time step.

7. A self-attention based data processing apparatus, comprising:

8. The self-attention-based data processing device of claim 7, wherein the binary matrix conversion module is further configured to perform the steps of:

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the self-attention-based data processing method according to any one of claims 1 to 6.

10. An electronic terminal, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, to cause the terminal to execute the self-attention-based data processing method according to any one of claims 1 to 6.