CN116597824A

CN116597824A - Imagination voice classification method and system based on attention-guided tensor network

Info

Publication number: CN116597824A
Application number: CN202310580969.XA
Authority: CN
Inventors: 孔万增; 李昌盛; 周文晖; 王宇涵; 莫良言; 金宣妤
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-08-15

Abstract

The invention discloses a imagination voice classification method and system based on an attention-guided tensor network. Acquiring idea imagination voice brain electricity data and a label corresponding to the same; carrying out data enhancement on the ideological imagination voice electroencephalogram data to construct a training data set; constructing an attention-directed tensor network, training by using a training set after data enhancement in a data set, and testing by using a test set which is not enhanced in the data set; and realizing imagination voice classification of the brain electricity by using the trained and verified attention-guided tensor network. The method combines the data enhancement and the tensor network technology guided by the classification identification bit of the attention mechanism, and realizes the high-precision imagination voice brain electricity classification performance.

Description

Imagination voice classification method and system based on attention-guided tensor network

Technical Field

The invention belongs to the field of brain-computer interfaces, relates to a method and a system for classifying imagination voice based on an attention-guided tensor network, in particular to a method for judging imagination voice category by carrying out data enhancement and feature extraction on imagination voice brain-computer data based on a tensor network technology guided by a classification identification bit of a data enhancement and attention mechanism.

Background

It is desirable in the future that BCI be able to decode the human visual imagination and output it into a real world environment. Once the imagined word or dialogue is decoded by the BCI system, it can be used as a neural command, outputting the user imagined word through speech synthesis, or controlling robots and devices based on the word. Therefore, imagining the effectiveness and practicality of speech decoding is a non-negligible important issue. To implement these types of BCIs, research into extracting relevant features of the notional speech paradigm may improve the effectiveness of capturing brain activity associated with speech. Recently, researchers have studied various methods, particularly a deep learning method, which has been developed with the development of natural language processing technology, to accurately capture phoneme-level speech from brain signals.

Imagination of speech can be a key paradigm in developing intuitive systems that are easy for users to operate. Commands that identify the intuitive intent of the user and translate it into the outside world are one of the key functions of BCI. Using the notional speech paradigm allows the communication of the BCI to be significantly improved, as it can directly convey the user's intent through the notional speech or word itself, rather than through the spelling of a single letter. Meanwhile, the technique can apply such decoded results to control an external device. Imagine speech is an emerging paradigm that can shift a user's intent to an external device. Imagining a speech paradigm may provide a vital advantage over traditional BCI paradigms (e.g., MI). For example, increasing the number of classes in MI depends on the movement of the body parts, which may naturally overlap when many classes are needed, whereas the speech properties of different classes may allow more variation between classes without the concept of overlapping. Furthermore, the decoded imagined speech may directly match the interaction between the user's intent and the device feedback in a real-world environment. Finally, this feature of imagining a speech paradigm may help to develop a more practical BCI system, providing a high degree of freedom for the user. Thus, BCI is more prone to a technique of decoding human visual intent. However, imagined speech multi-class classification performance is still at a relatively low level compared to traditional BCI paradigms (such as MI or ERP). Efficient feature selection or classification methods for imagined speech may help to improve decoding performance. The multi-class classification performance of imagined speech is improved to the level of the conventional BCI paradigm, thereby enabling simple communication through internal speech or control of the external environment.

Disclosure of Invention

The invention aims at solving the defects of the prior art, and provides a method and a system for classifying imagined voices based on an attention-guided tensor network, wherein the performance of the imagined voices in multiple classification is at a relatively low level. The prior knowledge is introduced to the model by utilizing the data enhancement technology, so that the model learns more robust features to solve the problems of fewer samples of the existing data set and the like. A multi-headed attention mechanism is used to efficiently extract characteristic information of the data in the time dimension. The tensor network is used to solve the problem of small samples of the data set and improve the classification performance of the model.

In a first aspect, the present invention provides a method for imagining speech classification based on an attention-directed tensor network, comprising the steps of:

step S1: acquiring idea imagination voice brain electricity data and a label corresponding to the same;

step S2: when the model is trained, data enhancement is carried out on the ideas imagination voice brain electricity data, and a training data set is constructed;

step S3: constructing an attention-directed tensor network, training by using a training set after data enhancement in a data set, and testing by using a test set which is not enhanced in the data set;

step S4: and realizing imagination voice classification of the brain electricity by using the trained and verified attention-guided tensor network.

In a second aspect, the present invention provides a imagined speech classification system comprising a trained and validated attention-directed tensor network.

The beneficial effects of the invention are as follows:

the invention uses a multi-headed self-care mechanism to focus on the information of different time steps and different channels in the EEG signal at the same time, thereby better capturing the time sequence and the spatial correlation in the EEG signal. The importance of each time step and channel is then calculated using these correlations as weights. This may help the model automatically learn important features in the EEG signal, thereby improving the performance of the model. Different feature representations may also be learned by a multi-headed self-care mechanism and combined to form the final representation. This may help the model better process different EEG signals, thereby improving the generalization ability of the model.

One potential problem with deep learning is the large number of parameters. Thus, fitting requires a large number of samples, training the model takes a lot of time, and EEG samples are often inadequate. The invention introduces the method of converting the weight matrix of the fully connected layer into the tensor format in the tensor learning network, so that the number of parameters is greatly reduced, and the expression capability of the layer is reserved.

In summary, the invention provides a imagination voice classification method based on an attention-guided tensor network, which combines data enhancement and a tensor network technology guided by a classification identification bit of an attention mechanism, wherein the network comprises a data acquisition module, a data enhancement module, a multi-head attention module and a classification module which are respectively used for data enhancement, feature acquisition and classification results. Meanwhile, the network adopts random position coding, classification identification bits are added, and a tensor network is used for realizing high-precision imagination voice brain electricity classification performance.

Drawings

FIG. 1 is a flow chart of a method of imagining speech classification;

FIG. 2 is a diagram showing experimental paradigm of the data set used

FIG. 3 is a schematic diagram of feature extraction based on a multi-head attention mechanism;

fig. 4 is a schematic diagram of a classification module based on a tensor network.

Detailed Description

The process according to the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a method for imagining voice classification based on an attention-directed tensor network comprises the following specific steps:

a method of imagination speech classification based on an attention-directed tensor network, comprising the steps of:

step S1: acquiring idea imagination voice brain electricity data and a label corresponding to the same.

The data used in this experiment was a public data set and the experiment recorded EEG data of 15 subjects (S1-S15; age 20-30 years) as shown in FIG. 2. During the experiment, the subject sits in a comfortable chair with a 24 inch liquid crystal display in front. The subject is required to imagine silent utterances of a given word or phrase as if they were doing a real speech without moving any of the utterances or making sounds. The subject is instructed not to perform any brain activity other than the given task. They are required to neither move nor blink when they imagine or accept hints. All imagination experiments were performed using a black screen so that the subject was not subjected to any stimulus to avoid any other factors affecting brain activity. An auditory cue representing one of the five words/phrases is presented randomly for 2 seconds, followed by a cross mark of 0.8 seconds to 1.2 seconds. Researchers require subjects to imagine a given thread immediately after the cross marker disappears from the screen. Each random cue goes through 4 cross-tag phases (0.8-1.2 s) and imagined speech phase (2 s) in turn. After four phases of imagined speech, a 3s relaxation phase is allowed to clear the subject's mind for the next word/phrase. Electroencephalogram data was recorded using a signal amplifier (BrainAmp, brainProducts GmbH, germany). Raw data were recorded using a BrainVision (BrainProducts GmbH, germany) and MATLAB 2019a (The MathWorks Inc. USA), using 64 electroencephalogram electrodes following The 10-20 International configuration. The ground and reference channels are placed on Fpz and FCz, respectively. The impedance of all electrodes between the sensor and the scalp skin is kept below 15 k.

The experiment recorded an electroencephalogram of class 5 imaginative words/phrases. Is marked asWhere T is the time dimension and the size is 795.C is the channel dimension, and the size is 64. I.e. the original data size is 64 x 795.

the method for enhancing the data by using Mixup linear interpolation specifically comprises the following steps:

wherein (X) _i ,Y _i ) And (X) _j ,Y _j ) Is two samples randomly extracted from training data, X _i ,X _j Is the original data input, Y _i ,Y _j For the single thermal coding of the corresponding class, lambda E [0,1 ]]；

Mixup is a data enhancement technique used to improve the performance of the model. It creates a new training example by randomly combining a pair of examples from different classes. The technique can make the model more robust, reduce over-fitting and improve generalization capability. Specifically, mixup will input two samples X _i ,X _j With a random proportion lambda epsilon 0,1]Performing linear interpolation to generate a new sampleAt the same time their labels Y _i ,Y _j Interpolation is performed in the same ratio to generate a new tag +.>Thus, the model can learn more characteristics and similarities among different categories, thereby improving generalization capability. After data enhancement, the data size remains unchanged 795×64.

This approach introduces a priori knowledge to the model: by introducing the priori knowledge, the enhanced data can enable the model to learn more robust features, and the generalization of the deep learning model is improved.

the attention guiding tensor network comprises a feature extraction module of a cascade multi-head attention mechanism and a classification module of a tensor learning network;

1) The feature extraction module of the cascade multi-head attention mechanism as shown in fig. 3 comprises an embedded layer, a classification identification bit Class Token layer, a position coding layer, a first LN regularization layer, a multi-head self-attention layer, a first residual error connection layer, a second LN regularization layer, a feedforward network layer, a second residual error connection layer and a third LN regularization layer which are sequentially connected in series;

1.1, up-sampling the channel dimension of 795×64 brain electrical data by a full connection layer to increase the data dimension, and extracting finer granularity information to obtain 795×1024 data;

1.2, generating a vector with a size of 1 x 1024 by using a random initialization mode in the Class token layer, splicing the vector to a data head of an embedded layer book to realize statistics of global characteristic information and reduce local characteristic information interference, wherein the data size is 796 x 1024; using Class Tokens, it can encode the statistics of the entire notional speech data, which can be updated continuously as the network trains. The aggregation (global feature aggregation) of information on all other Token is performed, and the information is not based on the content of the data, so that the bias on a specific Token in the data can be avoided. Secondly, the fixed position codes are used for the Class token, so that the interference of the position codes on the output can be effectively avoided.

The position coding layer of 1.3 adopts a random position coding method, which comprises the following steps: generating a random number matrix with the same format as the input data, and adding the random number matrix and the input data to be used as the output of a position coding layer; the position coding layer adopts a random position coding method, so that the problem that a model cannot capture the position relation in the time dimension in an input sequence can be solved. The position-coding layer assigns a position code to each vector in the time dimension of each input sequence, which position code is added to the time-dimension vector, so that the time-dimension vector can contain information about its position in the input sequence. Therefore, the neural network can better understand the sequence and the relation of the time dimension in the input sequence, and further improve the performance of the model. Specifically, the position-coding layer outputs a random number matrix in the same format as its input data, and adds it to the input data as the input data of the multi-headed self-attention layer.

1.4, carrying out normalization processing on the output data of the position coding layer by the first LN regularization layer;

1.5, mapping the LN regularization layer output data to different subspaces by the multi-head self-attention mechanism layer, and then performing point multiplication operation on all subspaces to calculate an attention vector; finally, splicing and mapping the attention vectors calculated in all subspaces to an original input space to obtain a final attention vector so as to realize the feature correlation of the statistical imagination voice data in the time dimension;

the multi-head self-attention mechanism layer has the functions of enhancing the understanding and expressing ability of the model to input data and improving the accuracy and generalization ability of the model. In particular, the multi-headed attention layer may multi-headed divide the input data, and each head may focus on a different portion of the input data, thereby extracting different characteristic information. These heads can be computed in parallel, thereby speeding up the training of the model. And finally, combining the calculation results of the plurality of heads to obtain a final output result. The expression of the multi-headed self-focusing layer is as follows (3):

wherein MultiHead (Q, K, V) represents the resulting output attention vector; concat represents a splice operation; wheree head _i Representing the attention vector calculated in the ith subspace;

i represents different subspaces, a query vector Q, a key vector K and a value vector V are obtained from the output data of the first LN regularization layer through a full connection layer and serve as the input of the multi-head self-attention module, W _i ^Q Mapping matrix for Q in different subspaces, W _i ^K Mapping matrix for K in different subspaces, W _i ^V Mapping matrix for V in different subspaces, W ^O By W in all subspaces _i ^V Splicing to obtain the final product;

the calculation mode of the attention vector on the independent subspace is as follows in sequence: firstly, carrying out dot multiplication operation on a query vector Q and a key vector K, and dividing the query vector Q and the key vector K by the square root of the dimension of the key vector KObtaining a score matrix of the query vector Q, finally transmitting the result into a Softmax function, normalizing the result by using the Softmax function to obtain a weight matrix, multiplying the weight matrix by a value vector V to obtain a subspace attention vector, wherein the expression is as follows (4):

wherein the parameter matrix dimensions d of Q, K, V _q ，d _k And d _v All 128, the number of attention head heads is 8, d _model 1024.

By linear transformation, query vector Q is derived from d _model Dimension map d _q * head, key vector K from d _model Dimension map d _k * head, value vector V from d _model Dimension map d _v *head；

Increasing the number of attention heads implicitly without reducing the hidden dimension assigned to each attention head can effectively extract global features and improve classification accuracy.

1.6 the first residual connection layer performs residual connection on the multi-head self-attention mechanism layer output so as to improve the characterization capability of a network on imagination voice data and effectively solve the problems of gradient disappearance and gradient explosion;

1.7, the second LN regularization layer performs normalization processing on the output data of the first residual error connection layer;

1.8 the Feed-forward network layer (Feed-ForwardNetwork, FFN) consists of two layers of Feed-forward neural networks, the first layer of Feed-forward network taking the output of the second LN regularized layer from d _model Dimension map is 4*d _model Dimension, activation function is GELU function, second layer feedforward neural network is 4*d again _model Dimension map back d _model Dimension, not using an activation function;

the expression of each layer of feedforward network is as follows (5):

wherein W is ₁ And W is ₂ Is a randomly initialized weight vector, b ₁ And b ₂ Is a randomly initialized bias; x represents the output of the second LN regularization layer;

1.9, the second residual connection layer performs residual connection on the output of the feedforward network layer, so as to improve the representation capability of the network on imagined voice data;

1.10, the third LN regularization layer performs normalization processing on the output data of the second residual error connection layer;

2) The classification module of the tensor learning network acquires data with ClassTokes in the output data of the feature extraction module of the cascade multi-head attention mechanism, and carries out prediction classification on the data;

the classification module of the tensor learning network as shown in fig. 4 comprises a tensor network, an activation layer and a full connection layer which are sequentially connected in series;

the tensor network performs tensor processing on input data with the size of 1 x 1024 so as to extract linear relation characteristics of the network on high-dimensional imagination voice data, specifically:

linearly transforming an N-dimensional input vector to obtain a mathematical expression represented by the formula (6)

y ₁ ＝Wx ₁ +b (6)

Wherein the method comprises the steps ofIs a weight matrix>For inputting data +.>Is biased;

wherein the element y (i) in y is represented by formula (7):

according to tensor learning thought, all y, W, x, b are converted into tensor representation, and marked as y, W, x, b; the method specifically comprises the following steps:

first, x ε R ^N*S Conversion to a 5-dimensional tensorDenoted as x (j) ₁ ,...,j ₅ ) Wherein N x S = S ₁ *S ₂ *S ₃ *S ₄ *S ₅ I.e. converting the input Class token vector 1 x 1024 into a five-dimensional tensor with the size of 4 x 4;

via bijective function F (i) = (F) ₁ (i),f ₂ (i),f ₃ (i),f ₄ (i),f ₅ (i))＝(i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ ) Vectors y, b and y (i) are indexed by index i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ )，b(i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ ) The five-dimensional tensor representation is related as shown in equation (8):

y(F(i))＝y(i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ )＝y(i)

wherein y, b.epsilon.R ^M ，d= 5,i e 1,2,; y (i), b (i) is an element in y, b, y (i) ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ ) Five-dimensional tensors also of size 4 x 4;

the same applies to the weight matrixSee (9):

F(i)＝(f ₁ (i),f ₂ (i),f ₃ (i),f ₄ (i),f ₅ (i))＝(i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ )

G(j)＝(g ₁ (j),g ₂ (j),g ₃ (j),g ₄ (j),g ₅ (j))＝(j ₁ ,j ₂ ,j ₃ ,j ₄ ,j ₅ ) (9)

the weight matrix W may be associated with its corresponding tensor W and converted into a tensor column format Tensor Train Format (TT-format) as shown in equation (10):

wherein each g [ i ] _k ,j _k ]Denoted as i in the case of the same k _k *j _k *r _k-1 *r _k K e 1, 2..5, where r ₀ ＝r ₅ =1, where r _k-1 *r _k Called tensor rank TT-rank, TT-rank is [1,8,8,8,8,1 ]]；

Finally, the formula (6) can be converted into a tensor form represented by the formula (11):

the activation layer uses RELU activation function to transmit tensor network output data into full connection through the activation layer to obtain classification result;

the corresponding classification labels are output through the tensor network classification module, and the loss function is calculated by comparing the classification labels with the real labels, and the cross entropy loss function is adopted, and the specific formula is as follows:

wherein M is ₁ Is the number of tests, N ₁ Is the number of categories to be counted,represents the mth ₁ True label of secondary test->Representing the representation class n ₁ Mth m ₁ Predictive probability of secondary trial. When trained in conjunction with the model, it was noted as criterion, and the loss was calculated as follows:

loss＝λ*criterion(pred,Y _i )+(1-λ)criterion(pred,Y _j ) (13)

wherein the method comprises the steps of

When the invention is specifically used, adam with high convergence rate is used as an optimizer, the initial learning rate is set to 8e-5, and the batch size is set to 8.

Table (1) imagination speech classification accuracy of different subjects using the above method

Claims

1. A method of imagination speech classification based on an attention-directed tensor network, characterized in that the method comprises the steps of:

step S2: carrying out data enhancement on the ideological imagination voice electroencephalogram data to construct a training data set;

1) The feature extraction module of the cascade multi-head attention mechanism comprises an embedded layer, a classification identification bit Class Token layer, a position coding layer, a first LN regularization layer, a multi-head self-attention layer, a first residual error connection layer, a second LN regularization layer, a feedforward network layer, a second residual error connection layer and a third LN regularization layer which are sequentially connected in series;

the multi-head self-attention mechanism layer maps the LN regularization layer output data to different subspaces, then performs point multiplication operation on all subspaces, and calculates an attention vector; finally, splicing and mapping the attention vectors calculated in all subspaces to an original input space to obtain a final attention vector so as to realize the feature correlation of the statistical imagination voice data in the time dimension;

2) The classification module of the tensor learning network acquires data with Class Tokens in the output data of the feature extraction module of the cascade multi-head attention mechanism, and carries out prediction classification on the data;

the classification module of the tensor learning network comprises a tensor network, an activation layer and a full connection layer which are sequentially connected in series;

the tensor network performs tensor processing on the input data to realize linear relation feature extraction of the network on the high-dimensional imagination voice data;

2. The method according to claim 1, wherein the data enhancement in step S2 is specifically:

wherein (X) _i ,Y _i ) And (X) _j ,Y _j ) Is two samples randomly extracted from training data, X _i ,X _j Is the original data input, Y _i ,Y _j For the single thermal coding of the corresponding class, lambda E [0,1 ]]。

3. The method of claim 1, wherein in step S3, the embedded layer in the feature extraction module of the cascade multi-head attention mechanism upsamples a channel dimension of electroencephalogram data through a full connection layer to increase the data dimension, and extracts finer granularity information to obtain 795×1024 data;

the Class token layer generates vectors with the size of 1 x 1024 by using a random initialization mode, and splices the vectors to the data head of the embedded layer book so as to realize statistics of global characteristic information and reduce local characteristic information interference, and the data size is 796 x 1024 at the moment;

the position coding layer adopts a random position coding method, and specifically comprises the following steps: generating a random number matrix with the same format as the input data, and adding the random number matrix and the input data to be used as the output of a position coding layer;

and the first LN regularization layer performs normalization processing on the output data of the position coding layer.

4. A method according to claim 1 or 3, characterized in that in step S3, the expression of the multi-headed self-attention layer is as follows (3):

By linear transformation, query vector Q is derived from d _model Dimension map d _q * head, key vector K from d _model Dimension map d _k * head, value vector V from d _model Dimension map d _v *head。

5. The method according to claim 4, wherein in step S3, the first residual connection layer performs residual connection on the multi-headed self-attention mechanism layer output, so as to improve the representation capability of the network for imagined voice data;

the second LN regularization layer normalizes the output data of the first residual error connection layer;

the feedforward network layer consists of two layers of feedforward neural networks, wherein the first layer of feedforward network is used for outputting the second LN regularization layer from d _model Dimension map is 4*d _model Dimension, activation function is GELU functionThe second layer feedforward neural network is further processed from 4*d _model Dimension map back d _model Dimension, not using an activation function;

the second residual connection layer performs residual connection on the feedforward network layer output so as to improve the representation capability of the network on imagination voice data;

and the third LN regularization layer performs normalization processing on the output data of the second residual error connection layer.

6. The method according to claim 5, wherein in step S3, the expression of each layer of feedforward network is represented by the following formula (5):

wherein W is ₁ And W is ₂ Is a randomly initialized weight vector, b ₁ And b ₂ Is a randomly initialized bias; x represents the output of the second LN regularization layer.

7. The method according to claim 1, characterized in that in step S3, the tensor network is in particular:

an N-dimensional input vector is linearly transformed, thus obtaining a mathematical expression as shown in formula (6):

y ₁ ＝Wx ₁ +b (6)

wherein y is ₁ The element y (i) in the formula(7) The following is shown:

first of all,conversion to 5-dimensional tensor->Denoted as x (j) ₁ ,...,j ₅ ) Wherein N x S = S ₁ *S ₂ *S ₃ *S ₄ *S ₅ I.e. converting the input Class token vector 1 x 1024 into a five-dimensional tensor with the size of 4 x 4;

y(F(i))＝y(i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ )＝y(i)

b(F(i))＝b(i ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ )＝b(i) (8)

wherein y, b.epsilon.R ^M ，y (i), b (i) is an element in y, b, y (i) ₁ ,i ₂ ,i ₃ ,i ₄ ,i ₅ ) Five-dimensional tensors also of size 4 x 4;

the same applies to the weight matrixSee (9):

the weight matrix W is associated with its corresponding tensor W and converted into a tensor column format Tensor Train Format (TT-format) as shown in equation (10):

Finally, the formula (6) is converted into a tensor form shown in the formula (11):

8. the method according to claim 1 or 7, characterized in that the activation layer uses a RELU activation function.

9. The method according to claim 1, characterized in that the loss function of the attention-directing tensor network employs a cross entropy loss function, the specific formula being as follows:

wherein M is ₁ Is the number of tests, N ₁ Is the number of categories to be counted,represents the mth ₁ True label of secondary test->Representing the representation class n ₁ Mth m ₁ Predictive probability of secondary trial; when training in combination with the model, the cross entropy loss function is noted as criterion, and the loss calculation mode is as follows: loss=λ x criterion (pred, Y _i )+(1-λ)criterion(pred,Y _j ) (13)

Wherein the method comprises the steps of

10. A classification system implementing the method according to any of claims 1-9, characterized by comprising a trained and validated attention-directed tensor network.