CN115062728A

CN115062728A - Emotion identification method and system based on time-frequency domain feature level fusion

Info

Publication number: CN115062728A
Application number: CN202210821466.2A
Authority: CN
Inventors: 黄澳康
Original assignee: Hangzhou Yanghong Information Technology Co ltd
Current assignee: Hangzhou Yanghong Information Technology Co ltd
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2022-09-16

Abstract

The application relates to the field of emotion intelligent identification, and particularly discloses an emotion identification method and system based on time-frequency domain feature level fusion, wherein a plurality of time domain statistical features collected by a sampling window are coded by using a deep neural network model to extract global high-dimensional implicit features of each time domain statistical feature in the plurality of time domain statistical features, the features of each sampling window data are correlated at a data level, further, frequency domain feature analysis is carried out on electroencephalogram signals based on the power spectral density of a frequency domain, then the time domain and frequency domain feature information is comprehensively classified, and in the process of feature fusion, probabilistic interpretation based on information distribution is further carried out on the feature information, so that the feature distribution extracted by a neural network can be self-adapted along with iteration, and is gradually close to real information distribution, to improve the accuracy of emotion recognition.

Description

Emotion identification method and system based on time-frequency domain feature level fusion

Technical Field

The invention relates to the field of intelligent emotion recognition, in particular to an emotion recognition method and system based on time-frequency domain feature level fusion.

Background

With the development of science and technology and the advancement of society, Human-Computer Interaction (HCI) technology has attracted more and more attention. HCI has a wide application prospect, for example, in the field of man-machine conversation, emotion recognition can enable a machine to perceive emotional psychological states of people, so that the machine can know conversation objects more, more humanized answers are provided, and man-machine conversation experience is improved. The existing emotion recognition schemes include an emotion recognition scheme based on a physiological signal and an emotion recognition scheme based on a non-physiological signal, but the accuracy and the real-time performance of emotion recognition of the schemes cannot meet the requirements of practical application easily.

Therefore, an optimized emotion recognition scheme is desired.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides an emotion recognition method and system based on time-frequency domain feature level fusion, which encodes a plurality of time domain statistical features collected by a sampling window by using a deep neural network model, to extract a global high-dimensional implicit feature of each of the plurality of temporal statistical features, and the characteristics of the sampling window data are correlated at a data level, and further, the electroencephalogram signal is subjected to frequency domain characteristic analysis based on the power spectral density of a frequency domain, and then the time domain and the frequency domain are combined to carry out comprehensive classification, and in the process of feature fusion, the characteristic information is further subjected to probabilistic interpretation based on information distribution, so that the characteristic distribution extracted by the neural network can be self-adapted along with iteration, thereby gradually approaching the real information distribution and improving the accuracy of emotion recognition.

According to one aspect of the application, a method for emotion recognition with time-frequency domain feature level fusion is provided, which includes:

acquiring an electroencephalogram signal of an object to be identified;

intercepting a plurality of sampling window data from the electroencephalogram signal along a time sequence dimension by using a preset sampling window;

extracting a plurality of time domain statistical characteristics in each sampling window data, wherein the time domain statistical characteristics comprise a mean value, a median, a maximum value, a minimum value, a standard deviation, a variance, a skewness and a kurtosis;

passing the plurality of time-domain statistical features of each of the sampling windows through a context encoder comprising an embedded layer to obtain a plurality of feature vectors, and concatenating the plurality of feature vectors to obtain a first feature vector corresponding to each of the sampling window data

Acquiring a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal by using a power spectral density analysis method;

passing the power spectral density sequence through a first convolutional neural network using a one-dimensional convolution kernel to obtain a first signature;

passing the power spectral density profile through a second convolutional neural network using a three-dimensional convolution kernel to obtain a second feature map;

two-dimensional arrangement is carried out on the first eigenvectors of the sampling window data to form an eigenvector matrix, and then a third characteristic diagram is obtained through a third convolution neural network;

performing information distribution-based probabilistic interpretation on the first feature map, the second feature map and the third feature map respectively to obtain a corrected first feature map, a corrected second feature map and a corrected third feature map, wherein the information distribution-based probabilistic interpretation on the feature maps is performed based on logarithmic values of Softmax function values of positions in the feature maps;

fusing the corrected first feature map, the corrected second feature map and the corrected third feature map to obtain a classification feature map; and

and passing the classification feature map through a classifier with a plurality of classification labels to obtain a classification result, wherein the classification result is used for representing an emotion recognition result.

According to another aspect of the present application, there is provided an emotion recognition system with time-frequency domain feature level fusion, including:

the physiological information acquisition unit is used for acquiring an electroencephalogram signal of an object to be identified;

the sampling unit is used for intercepting a plurality of sampling window data from the electroencephalogram signal along a time sequence dimension by using a preset sampling window;

the time domain analysis unit is used for extracting a plurality of time domain statistical characteristics in each sampling window data, wherein the time domain statistical characteristics comprise a mean value, a median, a maximum value, a minimum value, a standard deviation, a variance, a skewness and a kurtosis;

a time domain semantic coding unit, configured to pass multiple time domain statistical features of each of the sampling windows through a context encoder including an embedded layer to obtain multiple feature vectors, and cascade the multiple feature vectors to obtain a first feature vector corresponding to each of the sampling window data;

the frequency domain analysis unit is used for acquiring a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal by using a power spectral density analysis method;

a first frequency-domain feature encoding unit for passing the power spectral density sequence through a first convolutional neural network using a one-dimensional convolutional kernel to obtain a first feature map;

a second frequency-domain feature encoding unit for passing the power spectral density profile through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature map;

the time domain correlation characteristic coding unit is used for performing two-dimensional arrangement on the first characteristic vectors of the sampling window data to form a characteristic matrix and then obtaining a third characteristic diagram through a third convolutional neural network;

a probabilistic interpretation unit, configured to perform probabilistic interpretation based on information distribution on the first feature map, the second feature map, and the third feature map respectively to obtain a corrected first feature map, a corrected second feature map, and a corrected third feature map, where the probabilistic interpretation based on information distribution on the feature maps is performed based on logarithmic values of Softmax function values of respective positions in the feature maps;

a feature map fusion unit, configured to fuse the modified first feature map, the modified second feature map, and the modified third feature map to obtain a classification feature map; and

and the emotion recognition result generation unit is used for enabling the classification characteristic graph to pass through a classifier with a plurality of classification labels to obtain a classification result, and the classification result is used for representing the emotion recognition result.

Compared with the prior art, the emotion recognition method and the emotion recognition system based on time-frequency domain feature level fusion provided by the application, which encodes a plurality of time domain statistical features collected by a sampling window by using a deep neural network model, to extract a global high-dimensional implicit feature of each of the plurality of temporal statistical features, and the characteristics of the sampling window data are correlated at a data level, and further, the electroencephalogram signal is subjected to frequency domain characteristic analysis based on the power spectral density of a frequency domain, and then the time domain and the frequency domain are combined to carry out comprehensive classification, and in the process of feature fusion, the characteristic information is further subjected to probabilistic interpretation based on information distribution, so that the characteristic distribution extracted by the neural network can be self-adapted along with iteration, thereby gradually approaching the real information distribution and improving the accuracy of emotion recognition.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is an application scene diagram of an emotion recognition method based on time-frequency domain feature level fusion according to an embodiment of the present application;

FIG. 2 is a flowchart of an emotion recognition method based on time-frequency domain feature level fusion according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a system architecture of an emotion recognition method based on time-frequency domain feature level fusion according to an embodiment of the present application;

FIG. 4 is a block diagram of an emotion recognition system with time-frequency domain feature level fusion according to an embodiment of the application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Overview of a scene

As described above, Human-Computer Interaction (HCI) technology has attracted more and more attention as technology has developed and society has advanced. HCI has a wide application prospect, for example, in the field of man-machine conversation, emotion recognition can enable a machine to perceive emotional psychological states of people, so that the machine can know conversation objects more, more humanized answers are provided, and man-machine conversation experience is improved. The existing emotion recognition schemes include an emotion recognition scheme based on a physiological signal and an emotion recognition scheme based on a non-physiological signal, but the accuracy and the real-time performance of emotion recognition of the schemes cannot meet the requirements of practical application easily.

Therefore, an optimized emotion recognition scheme is desired.

Correspondingly, in the technical scheme of the application, the electroencephalogram signal is used as input data to construct the emotion recognition scheme. The electroencephalogram signals contain rich human physiological characteristic information and also contain a plurality of noise interferences, so that the key point of the technical scheme is how to encode the electroencephalogram signals to quickly and accurately extract the characteristics beneficial to emotion pattern recognition.

In order to reduce the amount of calculation, the acquired electroencephalogram signal is subjected to sampling processing, and specifically, a plurality of sampling window data are intercepted from the electroencephalogram signal along a time sequence dimension by a preset sampling window.

Then, extracting a plurality of time domain statistical characteristics in each sampling window data based on the existing time domain characteristic statistical analysis method, wherein the time domain statistical characteristics comprise a mean value, a median, a maximum value, a minimum value, a standard deviation, a variance, a skewness and a kurtosis. It should be understood that there is an association between multiple time domain statistical features of each sampling window, and therefore, in this embodiment of the present application, a context encoder performs global context-based semantic coding on the multiple time domain statistical features of each sampling window to extract a high-dimensional implicit feature of each time domain statistical feature in the multiple time domain statistical features, where a global context of the multiple time domain statistical features is a semantic context, so as to obtain multiple feature vectors corresponding to the multiple time domain statistical features. And then, cascading the plurality of feature vectors to obtain a first feature vector corresponding to each sampling window data, wherein the first feature vector is used for representing the time-domain high-dimensional implicit features of each sampling window data.

Therefore, in the embodiment of the present application, the first eigenvectors of each of the sampling window data are further associated at a data level, and specifically, the first eigenvectors of each of the sampling window data are arranged into a two-dimensional eigenvector matrix according to the sample dimension of the sampling window data. Then, the feature matrix is subjected to explicit spatial coding by using a convolutional neural network model to extract high-dimensional local implicit associated features of the feature matrix, namely, associated high-dimensional implicit associated features of time domain features of sampling window data of the electroencephalogram signals at each time point in a time sequence dimension to obtain a third feature map.

And simultaneously, carrying out frequency domain characteristic analysis on the electroencephalogram signals. Specifically, a power spectral density analysis method, which is an existing frequency domain analysis means, is used to acquire a power spectral density distribution map and a power spectral density sequence from the electroencephalogram signal. Then, the power spectral density sequence is passed through a first convolutional neural network using a one-dimensional convolutional kernel to obtain a first feature map, that is, the power spectral density sequence is subjected to one-dimensional convolutional encoding using a first convolutional neural network model to extract a high-dimensional implicit expression of the associated features of the power spectral density of the electroencephalogram signal in a time sequence dimension. Considering that the power spectral density profile is a four-dimensional tensor in data structure, in the embodiment of the present application, the power spectral density profile is passed through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature map, that is, the power spectral density profile is subjected to convolutional encoding based on the three-dimensional convolutional kernel with the three-dimensional convolutional neural network to extract a high-dimensional implicit feature of the power spectral density profile.

For a first feature map containing frequency domain power spectral density sequence information, a second feature map containing frequency domain power spectral density distribution pattern information, and a third feature map containing correlation information of time domain statistical features in context semantics and time sequence, if weighted fusion is performed only from the aspect of feature distribution of the feature maps in a high-dimensional feature space, the weighting coefficients serving as hyper-parameters are difficult to converge in the training process, and the accuracy is low. Therefore, in the present application, the first feature map, the second feature map, and the third feature map are first subjected to probabilistic interpretation based on information distribution, and then fused.

Specifically, the probabilistic interpretation of the feature map based on information distribution is expressed as:

that is, by performing probabilistic interpretation on the feature values of the feature maps according to the positions, the feature distribution extracted by the neural network can be adapted by itself along with iteration, so as to gradually approach the true information distribution, and thus, because the true information distribution of the first feature map, the second feature map and the third feature map is closer than the feature distribution in the high-dimensional feature space, the compatibility of the corrected first feature map, the second feature map and the third feature map can be effectively improved, so that the training process of the weighting coefficients serving as the hyper-parameters becomes easier, and the result is more accurate.

Based on this, the application provides an emotion recognition method for time-frequency domain feature level fusion, which includes: acquiring an electroencephalogram signal of an object to be identified; intercepting a plurality of sampling window data from the electroencephalogram signal along a time sequence dimension by using a preset sampling window; extracting a plurality of time domain statistical characteristics in each sampling window data, wherein the time domain statistical characteristics comprise a mean value, a median, a maximum value, a minimum value, a standard deviation, a variance, a skewness and a kurtosis; passing the plurality of time domain statistical features of each sampling window through a context encoder comprising an embedding layer to obtain a plurality of feature vectors, and cascading the plurality of feature vectors to obtain a first feature vector corresponding to each sampling window data, and acquiring a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal by using a power spectral density analysis method; passing the power spectral density sequence through a first convolutional neural network using a one-dimensional convolution kernel to obtain a first signature; passing the power spectral density profile through a second convolutional neural network using a three-dimensional convolution kernel to obtain a second feature map; performing two-dimensional arrangement on the first eigenvectors of each sampling window data to obtain an eigenvector matrix, and then passing through a third convolutional neural network to obtain a third characteristic diagram; performing information distribution-based probabilistic interpretation on the first feature map, the second feature map and the third feature map respectively to obtain a corrected first feature map, a corrected second feature map and a corrected third feature map, wherein the information distribution-based probabilistic interpretation on the feature maps is performed based on logarithmic values of Softmax function values of positions in the feature maps; fusing the corrected first feature map, the corrected second feature map and the corrected third feature map to obtain a classification feature map; and passing the classification feature map through a classifier with a plurality of classification labels to obtain a classification result, wherein the classification result is used for representing an emotion recognition result.

Fig. 1 illustrates an application scenario of an emotion recognition method based on time-frequency domain feature level fusion according to an embodiment of the present application. As shown in fig. 1, in this application scenario, first, an electroencephalogram signal of an object to be recognized (for example, P as illustrated in fig. 1) is acquired by an electroencephalogram sensor (for example, an electroencephalograph T as illustrated in fig. 1), and in this application scenario, the electroencephalograph acquires the electroencephalogram signal through electrode pads pasted on the head of the object to be recognized. Then, the obtained electroencephalogram signal of the object to be identified is input into a server (for example, S as illustrated in fig. 1) deploying an emotion identification algorithm with time-frequency domain feature level fusion, wherein the server can process the electroencephalogram signal of the object to be identified by the emotion identification algorithm with time-frequency domain feature level fusion to generate a classification result representing an emotion identification result.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

FIG. 2 illustrates a flow chart of an emotion recognition method of time-frequency domain feature level fusion. As shown in fig. 2, the emotion recognition method based on time-frequency domain feature level fusion according to the embodiment of the present application includes: s110, acquiring an electroencephalogram signal of an object to be identified; s120, intercepting a plurality of sampling window data from the electroencephalogram signal along a time sequence dimension by using a preset sampling window; s130, extracting a plurality of time domain statistical characteristics in each sampling window data, wherein the time domain statistical characteristics comprise a mean value, a median, a maximum value, a minimum value, a standard deviation, a variance, a skewness and a kurtosis; s140, passing the plurality of time domain statistical characteristics of each sampling window through a context encoder comprising an embedded layer to obtain a plurality of characteristic vectors, and cascading the plurality of characteristic vectors to obtain a first characteristic vector corresponding to each sampling window data; s150, acquiring a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal by using a power spectral density analysis method; s160, passing the power spectral density sequence through a first convolution neural network using a one-dimensional convolution kernel to obtain a first characteristic map; s170, passing the power spectral density distribution graph through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature graph; s180, performing two-dimensional arrangement on the first eigenvectors of the sampling window data to form an eigenvector matrix, and then obtaining a third characteristic diagram through a third convolutional neural network; s190, performing information distribution-based probabilistic interpretation on the first feature map, the second feature map and the third feature map respectively to obtain a corrected first feature map, a corrected second feature map and a corrected third feature map, wherein the information distribution-based probabilistic interpretation on the feature maps is performed based on logarithm values of Softmax function values of all positions in the feature maps; s200, fusing the corrected first feature map, the corrected second feature map and the corrected third feature map to obtain a classification feature map; and S210, passing the classification feature map through a classifier with a plurality of classification labels to obtain a classification result, wherein the classification result is used for representing an emotion recognition result.

FIG. 3 is a schematic diagram illustrating an architecture of a method for emotion recognition with time-frequency domain feature level fusion according to an embodiment of the present application. As shown in fig. 3, in the network architecture of the emotion recognition method based on time-frequency domain feature level fusion, firstly, a plurality of sampling window data (for example, P1 as illustrated in fig. 3) are intercepted from the acquired electroencephalogram signal (for example, P as illustrated in fig. 3) along a time sequence dimension by using a preset sampling window; then, extracting a plurality of time domain statistical features (for example, as illustrated in fig. 3, P2) in each of the sampling window data; then, passing the plurality of time-domain statistical features of each of the sampling windows through a context encoder (e.g., E as illustrated in fig. 3) including an embedding layer to obtain a plurality of feature vectors (e.g., VF1 as illustrated in fig. 3), and concatenating the plurality of feature vectors to obtain a first feature vector (e.g., VF2 as illustrated in fig. 3) corresponding to each of the sampling window data; next, acquiring a power spectral density profile (e.g., Q1 as illustrated in fig. 3) and a power spectral density sequence (e.g., Q2 as illustrated in fig. 3) from the brain electrical signal using a power spectral density analysis method; then, passing the power spectral density sequence through a first convolutional neural network (e.g., CNN1 as illustrated in fig. 3) using a one-dimensional convolutional kernel to obtain a first signature (e.g., F1 as illustrated in fig. 3); next, passing the power spectral density profile through a second convolutional neural network (e.g., CNN2 as illustrated in fig. 3) using a three-dimensional convolution kernel to obtain a second signature (e.g., F2 as illustrated in fig. 3); then, two-dimensionally arranging the first eigenvectors of each of the sampling window data into an eigenvector matrix (e.g., MF as illustrated in fig. 3) and then passing through a third convolutional neural network (e.g., CNN3 as illustrated in fig. 3) to obtain a third eigenvector (e.g., F3 as illustrated in fig. 3); next, performing probabilistic interpretation based on information distribution on the first feature map, the second feature map, and the third feature map to obtain a modified first feature map (e.g., FC1 as illustrated in fig. 3), a modified second feature map (e.g., FC2 as illustrated in fig. 3), and a modified third feature map (e.g., FC3 as illustrated in fig. 3), respectively; then, fusing the modified first feature map, the modified second feature map, and the modified third feature map to obtain a classification feature map (e.g., F as illustrated in fig. 3); and finally, passing the classification feature map through a classifier having a plurality of classification labels (e.g., a classifier as illustrated in fig. 3) to obtain a classification result, which is used to represent an emotion recognition result.

In step S110 and step S120, an electroencephalogram signal of an object to be identified is acquired, and a plurality of sampling window data are intercepted from the electroencephalogram signal along a time sequence dimension with a preset sampling window. As described above, in the technical solution of the present application, it is desirable to construct an emotion recognition scheme by using an electroencephalogram signal as input data. However, the electroencephalogram signal contains rich human physiological characteristic information, which also includes a lot of noise interferences, so in the technical scheme of the present application, how to encode the electroencephalogram signal to quickly and accurately extract the characteristics beneficial to emotion pattern recognition is the key of the technical scheme of the present application.

Specifically, in the technical solution of the present application, firstly, an electroencephalogram sensor acquires an electroencephalogram signal of an object to be identified, and in a specific example, the electroencephalograph acquires the electroencephalogram signal through an electrode patch attached to a head of the object to be identified. It should be understood that, in order to reduce the amount of calculation, the acquired electroencephalogram signal is further subjected to sampling processing, and specifically, a plurality of sampling window data are intercepted from the electroencephalogram signal along a time sequence dimension by a preset sampling window.

In steps S130 and S140, a plurality of time-domain statistical features including a mean, a median, a maximum, a minimum, a standard deviation, a variance, a skewness, and a kurtosis in each of the sampling window data are extracted, and the plurality of time-domain statistical features of each of the sampling windows are passed through a context encoder including an embedding layer to obtain a plurality of feature vectors, and the plurality of feature vectors are concatenated to obtain a first feature vector corresponding to each of the sampling window data. That is, in the technical solution of the present application, a plurality of time domain statistical features in each of the sampling window data are further extracted based on an existing time domain feature statistical analysis method, where the time domain statistical features include a mean, a median, a maximum, a minimum, a standard deviation, a variance, a skewness, and a kurtosis.

It should be understood that, considering that there is a correlation between multiple time domain features of each of the sampling windows, inspired by semantic coding of a text sequence, in the technical solution of the present application, the multiple time domain statistical features are regarded as a text-like sequence with high-dimensional semantic information, and a context encoder is used to extract a high-dimensional semantic information expression of each time domain statistical feature. That is, specifically, in the technical solution of the present application, a context encoder performs context semantic-based coding on a plurality of time domain statistical features of each sampling window, so as to extract a high-dimensional implicit feature of each time domain statistical feature in the plurality of time domain statistical features, where the global of the plurality of time domain statistical features is used as a semantic background, and further obtain a plurality of feature vectors corresponding to the plurality of time domain statistical features. Then, the plurality of feature vectors are concatenated to obtain a first feature vector corresponding to each of the sampling window data, where the first feature vector is used to represent a time-domain high-dimensional implicit feature of each of the sampling window data.

Specifically, in this embodiment of the present application, the process of passing the plurality of time-domain statistical features of each of the sampling windows through a context encoder including an embedded layer to obtain a plurality of feature vectors includes: firstly, a plurality of time domain statistical features of each sampling window are respectively converted into embedded vectors through an embedding layer of the context encoder so as to obtain a sequence of the embedded vectors. It should be understood that, here, the embedding layer may be constructed by a knowledge graph of the time domain statistical features of emotion recognition, so that in the embedding process, a priori knowledge of the knowledge graph may be introduced to increase the information amount of each time domain statistical feature. Then, a globally context-based semantic encoding is performed on the sequence of embedded vectors using a Bert model of the context encoder to obtain the plurality of feature vectors. In particular, the Bert model performs global context coding on each of the sequence of embedded vectors with a global context of the sequence of embedded vectors as a semantic background based on an intrinsic mask structure of a transformer to obtain the plurality of feature vectors. Wherein each of the plurality of feature vectors corresponds to an embedded vector, i.e. a high-dimensional feature representation corresponding to a time-domain statistical feature.

In steps S150 and S160, a power spectral density analysis method is used to obtain a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal, and the power spectral density sequence is passed through a first convolutional neural network using a one-dimensional convolution kernel to obtain a first feature map. That is, in the technical solution of the present application, in order to improve the accuracy of classification, it is further necessary to perform frequency domain feature analysis on the electroencephalogram signals. Specifically, in the technical solution of the present application, a power spectral density analysis method is used to obtain a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal, where the power spectral density analysis method is an existing frequency domain analysis means. Then, the power spectral density sequence is passed through a first convolutional neural network using a one-dimensional convolutional kernel to obtain a first feature map, that is, the power spectral density sequence is subjected to one-dimensional convolutional encoding using a first convolutional neural network model to extract a high-dimensional implicit expression of the associated features of the power spectral density of the electroencephalogram signal in a time sequence dimension.

Specifically, in this embodiment of the present application, a process of passing the power spectral density sequence through a first convolution neural network using a one-dimensional convolution kernel to obtain a first feature map includes: performing one-dimensional convolutional encoding on the power spectral density sequence by using the first convolutional neural network according to the following formula to obtain the first characteristic map;

wherein the formula is:

wherein, a is the width of the convolution kernel in the x direction, F is the parameter vector of the convolution kernel, G is the local vector matrix operated with the convolution kernel function, and w is the size of the convolution kernel.

In step S170, the power spectral density profile is passed through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature map. That is, regarding the power spectral density profile, considering that the power spectral density profile is a four-dimensional tensor in terms of data structure, in the technical solution of the present application, the power spectral density profile is further passed through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature map, that is, the power spectral density profile is subjected to convolutional encoding based on a three-dimensional convolutional kernel with the three-dimensional convolutional neural network to extract a high-dimensional implicit feature of the power spectral density profile.

Specifically, in the embodiment of the present application, the process of profiling the power spectral density by using a second convolutional neural network of a three-dimensional convolutional kernel to obtain a second feature map includes: each layer of the second convolutional neural network using the three-dimensional convolutional kernel performs the following operations on input data in forward transfer of the layer: performing three-dimensional convolution processing based on the three-dimensional convolution kernel on the input data to obtain a convolution characteristic diagram; pooling the convolution characteristic map to obtain a pooled characteristic map; and performing nonlinear activation on the pooled feature map to obtain an activated feature map; wherein the activation feature map output by the last layer of the second convolutional neural network is the second feature map.

In step S180, the first eigenvectors of each of the sampling window data are two-dimensionally arranged into an eigenvector matrix, and then pass through a third convolutional neural network to obtain a third eigen map. It should be understood that, since it is considered that the emotion pattern of the human is more represented on the fluctuation pattern of the electroencephalogram signal, in the embodiment of the present application, the first eigenvector of each of the sampling window data is further associated at the data level, and specifically, the first eigenvector of each of the sampling window data is arranged as a two-dimensional eigenvector matrix according to the sample dimension of the sampling window data. Then, the feature matrix is explicitly and spatially encoded by using a convolutional neural network model to extract high-dimensional local implicit associated features of the feature matrix, namely, associated high-dimensional implicit associated features of time domain features of sampling window data of the electroencephalogram signals of all time points in time sequence dimensions to obtain a third feature map.

In step S190, the first feature map, the second feature map, and the third feature map are subjected to information distribution-based probabilistic interpretation based on logarithmic values of Softmax function values at respective positions in the feature maps to obtain a corrected first feature map, a corrected second feature map, and a corrected third feature map, respectively. It should be understood that, for the first feature map containing frequency domain power spectral density sequence information, the second feature map containing frequency domain power spectral density distribution information, and the third feature map containing correlation information of time domain statistical features in context semantics and time sequence, if weighted fusion is performed only from the aspect of feature distribution of the feature maps in the high-dimensional feature space, the weighting coefficients as the hyper-parameters are difficult to converge in the training process, and the accuracy is low. Therefore, in the technical solution of the present application, the first feature map, the second feature map, and the third feature map are first subjected to probabilistic interpretation based on information distribution, and then fused.

Specifically, in this embodiment of the present application, a process of performing probabilistic interpretation based on information distribution on the first feature map, the second feature map, and the third feature map to obtain a corrected first feature map, a corrected second feature map, and a corrected third feature map respectively includes: performing probabilistic interpretation based on information distribution on the first feature map, the second feature map and the third feature map respectively according to the following formulas to obtain the corrected first feature map, the corrected second feature map and the corrected third feature map;

wherein the formula is:

wherein, exp (f) _i，j，k ) A function value, Σ, of natural exponent expressed by the power of the feature value of each position in the first to third feature maps is calculated _i，j，k exp(f _i，j，k ) Means for calculating a sum of natural exponent function values raised to powers of feature values of respective positions in the first to third feature maps. It should be understood that, by performing probabilistic interpretation on feature values of feature maps according to positions, feature distributions extracted by a neural network can adapt themselves along with iteration so as to gradually approach real information distributions, so that, because the real information distributions of the first feature map, the second feature map and the third feature map are closer than feature distributions in a high-dimensional feature space, the compatibility of the first feature map, the second feature map and the third feature map after modification can be effectively improved, and thus, a training process as a weighting coefficient of a hyper-parameter becomes easier and a result is more accurate.

In step S200 and step S210, the modified first feature map, the modified second feature map and the modified third feature map are fused to obtain a classification feature map, and the classification feature map is passed through a classifier having a plurality of classification labels to obtain a classification result, where the classification result is used to represent an emotion recognition result. That is, in the technical solution of the present application, the three corrected feature maps are further fused to obtain a classification feature map for classification, and then the classification feature map is passed through a classifier having a plurality of classification labels to obtain a classification result representing an emotion recognition result. Accordingly, in one specific example, the classification feature map is processed using the classifier in the following formula to obtain the classification result; wherein the formula is: softmax { (W) _n ，B _n )：...：(W ₁ ，B ₁ ) L project (F), where project (F) represents the projection of the classification feature map as a vector, W ₁ To W _n As a weight matrix for each fully connected layer, B ₁ To B _n A bias matrix representing the layers of the fully connected layer.

Specifically, in this embodiment of the present application, the process of fusing the corrected first feature map, the corrected second feature map, and the corrected third feature map to obtain the classification feature map includes: fusing the corrected first feature map, the corrected second feature map and the corrected third feature map to obtain the classification feature map according to the following formula;

wherein the formula is:

F _c ＝λF ₁ +αF ₂ +βF ₃

wherein, F _c For the classification feature map, F ₁ For the corrected first profile, F ₂ Is the corrected second feature map, F ₃ And "+" represents the addition of elements at the corresponding positions of the corrected first feature map, the corrected second feature map and the corrected third feature map, and λ, α and β are weighting parameters for controlling the balance among the corrected first feature map, the corrected second feature map and the corrected third feature map in the classification feature map.

In summary, the emotion recognition method of time-frequency domain feature level fusion of the embodiments of the present application is illustrated, which encodes a plurality of time domain statistical features collected by a sampling window by using a deep neural network model, to extract a global high-dimensional implicit feature of each of the plurality of temporal statistical features, and the characteristics of the sampling window data are correlated at a data level, and further, the electroencephalogram signal is subjected to frequency domain characteristic analysis based on the power spectral density of a frequency domain, and then the time domain and the frequency domain are combined to carry out comprehensive classification, and in the process of feature fusion, the characteristic information is further subjected to probabilistic interpretation based on information distribution, so that the characteristic distribution extracted by the neural network can be self-adapted along with iteration, thereby gradually approaching the real information distribution and improving the accuracy of emotion recognition.

Exemplary System

FIG. 4 illustrates a block diagram of a emotion recognition system with time-frequency domain feature level fusion according to an embodiment of the present application. As shown in fig. 4, an emotion recognition system 400 with time-frequency domain feature level fusion according to an embodiment of the present application includes: a physiological information acquisition unit 410 for acquiring an electroencephalogram signal of an object to be identified; the sampling unit is used for intercepting a plurality of sampling window data from the electroencephalogram signal along a time sequence dimension by using a preset sampling window; a time domain analyzing unit 420, configured to extract a plurality of time domain statistical features in each of the sampling window data, where the time domain statistical features include a mean, a median, a maximum, a minimum, a standard deviation, a variance, a skewness, and a kurtosis; a time-domain semantic encoding unit 430, configured to pass the multiple time-domain statistical features of each of the sampling windows through a context encoder including an embedded layer to obtain multiple feature vectors, and concatenate the multiple feature vectors to obtain a first feature vector corresponding to each of the sampling window data; a frequency domain analysis unit 440, configured to obtain a power spectral density distribution graph and a power spectral density sequence from the electroencephalogram signal by using a power spectral density analysis method; a first frequency-domain feature encoding unit 450, configured to pass the power spectral density sequence through a first convolutional neural network using a one-dimensional convolutional kernel to obtain a first feature map; a second frequency-domain feature encoding unit 460 for passing the power spectral density profile through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature map; the time domain correlation feature coding unit 470 is configured to perform two-dimensional arrangement on the first feature vectors of the sampling window data to obtain a feature matrix, and then obtain a third feature map through a third convolutional neural network; a probabilistic interpretation unit 480 configured to perform probabilistic interpretation based on information distribution on the first feature map, the second feature map, and the third feature map to obtain a corrected first feature map, a corrected second feature map, and a corrected third feature map, respectively, wherein the probabilistic interpretation based on information distribution on the feature maps is performed based on logarithmic values of Softmax function values of respective positions in the feature maps; a feature map fusion unit 490, configured to fuse the modified first feature map, the modified second feature map, and the modified third feature map to obtain a classification feature map; and an emotion recognition result generation unit 500, configured to pass the classification feature map through a classifier having a plurality of classification labels to obtain a classification result, where the classification result is used to represent an emotion recognition result.

In an example, in the emotion recognition system 400 with time-frequency domain feature level fusion, the time-domain semantic coding unit 430 includes: an embedding subunit, configured to convert, by an embedding layer of the context encoder, the plurality of time-domain statistical features of each of the sampling windows into an embedding vector, respectively, to obtain a sequence of embedding vectors; and a context semantic coding subunit, configured to perform global context semantic coding on the sequence of embedded vectors using a converter-based Bert model of the context encoder to obtain the plurality of feature vectors.

In an example, in the emotion recognition system 400 with time-frequency domain feature level fusion, the first frequency-domain feature encoding unit 450 is further configured to: performing one-dimensional convolutional encoding on the power spectral density sequence by using the first convolutional neural network according to the following formula to obtain the first characteristic map;

wherein the formula is:

In an example, in the emotion recognition system 400 with time-frequency domain feature level fusion, the second frequency-domain feature encoding unit 460 is further configured to: each layer of the second convolutional neural network using the three-dimensional convolutional kernel performs the following operations on input data in forward transfer of the layer: performing three-dimensional convolution processing on the input data based on the three-dimensional convolution kernel to obtain a convolution characteristic diagram; pooling the convolution characteristic map to obtain a pooled characteristic map; and performing nonlinear activation on the pooled feature map to obtain an activated feature map; wherein the activation feature map output by the last layer of the second convolutional neural network is the second feature map.

In an example, in the emotion recognition system 400 with time-frequency domain feature level fusion, the probabilistic interpretation unit 480 is further configured to: respectively performing information distribution-based probabilistic interpretation on the first feature map, the second feature map and the third feature map by the following formulas to obtain the corrected first feature map, the corrected second feature map and the corrected third feature map;

wherein the formula is:

wherein, exp (f) _i，j，k ) A function value, Σ, of natural exponent expressed by the power of the feature value of each position in the first to third feature maps is calculated _i，j，k exp(f _i，j，k ) Means for calculating a sum of natural exponent function values raised to powers of feature values of respective positions in the first to third feature maps.

In an example, in the emotion recognition system 400 with time-frequency domain feature level fusion, the feature map fusion unit 490 is further configured to: fusing the corrected first feature map, the corrected second feature map and the corrected third feature map to obtain the classification feature map according to the following formula;

wherein the formula is:

F _c ＝λF ₁ +αF ₂ +βF ₃

wherein, F _c For the classification feature map, F ₁ Is the corrected first characteristic map, F ₂ For the corrected second profile, F ₃ For the corrected third feature map, "+" indicates the addition of elements at the corresponding positions of the corrected first feature map, the corrected second feature map and the corrected third feature map, and λ, α and β are for controlling the correction of the third feature mapA weighted parameter of a balance among the modified first feature map, the modified second feature map, and the modified third feature map in the classification feature map.

In an example, in the emotion recognition system 400 with time-frequency domain feature level fusion, the emotion recognition result generation unit 500 includes: processing the classification feature map by using the classifier according to the following formula to obtain the classification result; wherein the formula is: softmax { (W) _n ，B _n )：...：(W ₁ ，B ₁ ) L project (F), where project (F) represents the projection of the classification feature map as a vector, W ₁ To W _n As a weight matrix for each fully connected layer, B ₁ To B _n A bias matrix representing the layers of the fully connected layer.

Here, it can be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the emotion recognition system 400 with time-frequency domain feature level fusion described above have been described in detail in the description of the emotion recognition method with time-frequency domain feature level fusion described above with reference to fig. 1 to 3, and thus, a repeated description thereof will be omitted.

As described above, the emotion recognition system 400 based on time-frequency domain feature level fusion according to the embodiment of the present application can be implemented in various terminal devices, for example, a server of emotion recognition algorithm based on time-frequency domain feature level fusion, and the like. In one example, the emotion recognition system 400 for time-frequency domain feature level fusion according to the embodiment of the present application can be integrated into a terminal device as a software module and/or a hardware module. For example, the emotion recognition system 400 for time-frequency domain feature level fusion may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the emotion recognition system 400 with time-frequency domain feature level fusion can also be one of the hardware modules of the terminal device.

Alternatively, in another example, the emotion recognition system 400 with time-frequency domain feature level fusion and the terminal device may be separate devices, and the emotion recognition system 400 with time-frequency domain feature level fusion and the terminal device may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to the agreed data format.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is provided for purposes of illustration and understanding only, and is not intended to limit the application to the details which are set forth in order to provide a thorough understanding of the present application.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. As used herein, the words "or" and "refer to, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method for emotion recognition by means of time-frequency domain feature level fusion is characterized by comprising the following steps:

acquiring an electroencephalogram signal of an object to be identified;

performing two-dimensional arrangement on the first eigenvectors of each sampling window data to obtain an eigenvector matrix, and then passing through a third convolutional neural network to obtain a third characteristic diagram;

respectively performing information distribution-based probabilistic interpretation on the first feature map, the second feature map and the third feature map to obtain a corrected first feature map, a corrected second feature map and a corrected third feature map, wherein the information distribution-based probabilistic interpretation on the feature maps is performed on the basis of logarithmic values of Softmax function values of all positions in the feature maps;

2. The method for emotion recognition with time-frequency domain feature level fusion according to claim 1, wherein the passing of the plurality of time domain statistical features of each sampling window through a context encoder including an embedded layer to obtain a plurality of feature vectors comprises:

respectively converting a plurality of time domain statistical characteristics of each sampling window into embedded vectors through an embedded layer of the context encoder to obtain a sequence of the embedded vectors; and

globally context-based semantic encoding the sequence of embedded vectors using a transformer-based Bert model of the context encoder to obtain the plurality of feature vectors.

3. The emotion recognition method of time-frequency domain feature level fusion according to claim 2, wherein the step of passing the power spectral density sequence through a first convolution neural network using a one-dimensional convolution kernel to obtain a first feature map comprises:

performing one-dimensional convolutional encoding on the power spectral density sequence by using the first convolutional neural network according to the following formula to obtain the first feature map;

wherein the formula is:

4. The emotion recognition method of time-frequency domain feature level fusion according to claim 3, wherein the passing of the power spectral density profile through a second convolutional neural network using a three-dimensional convolutional kernel to obtain a second feature map comprises: each layer of the second convolutional neural network using the three-dimensional convolutional kernel performs the following operations on input data in forward transfer of the layer:

performing three-dimensional convolution processing on the input data based on the three-dimensional convolution kernel to obtain a convolution characteristic diagram;

pooling the convolution characteristic map to obtain a pooled characteristic map; and

performing nonlinear activation on the pooled feature map to obtain an activated feature map;

wherein the activation feature map output by the last layer of the second convolutional neural network is the second feature map.

5. The emotion recognition method based on time-frequency domain feature level fusion as claimed in claim 4, wherein performing probabilistic interpretation based on information distribution on the first feature map, the second feature map and the third feature map to obtain the modified first feature map, the modified second feature map and the modified third feature map respectively comprises:

respectively performing information distribution-based probabilistic interpretation on the first feature map, the second feature map and the third feature map by the following formulas to obtain the corrected first feature map, the corrected second feature map and the corrected third feature map;

wherein the formula is:

wherein, exp (f) _i，j，k ) A natural exponent expressing a calculation raised to a feature value of each of the positions in the first to third feature mapsFunction value, Σ _i，j，k exp(f _i，j，k ) Means for calculating a sum of natural exponent function values raised to powers of feature values of respective positions in the first to third feature maps.

6. The emotion recognition method based on time-frequency domain feature level fusion as claimed in claim 5, wherein the step of fusing the modified first feature map, the modified second feature map and the modified third feature map to obtain a classification feature map comprises:

fusing the corrected first feature map, the corrected second feature map and the corrected third feature map to obtain the classification feature map according to the following formula;

wherein the formula is:

F _c ＝λF ₁ +αF ₂ +βF ₃

wherein, F _c For the classification feature map, F ₁ Is the corrected first characteristic map, F ₂ Is the corrected second feature map, F ₃ And "+" represents the addition of elements at the corresponding positions of the corrected first feature map, the corrected second feature map and the corrected third feature map, and λ, α and β are weighting parameters for controlling the balance among the corrected first feature map, the corrected second feature map and the corrected third feature map in the classification feature map.

7. The emotion recognition method of time-frequency domain feature level fusion according to claim 6, wherein the classification feature map is passed through a classifier having a plurality of classification labels to obtain a classification result, and the classification result is used for representing emotion recognition result, and the method comprises:

processing the classification feature map by using the classifier according to the following formula to obtain the classification result; wherein the formula is: softmax { (W) _n ，B _n )：...：(W ₁ ，B ₁ ) L project (F), where project (F) represents the projection of the classification feature map as a vector, W ₁ To W _n As a weight matrix for each fully connected layer, B ₁ To B _n A bias matrix representing the layers of the fully connected layer.

8. A emotion recognition system for time-frequency domain feature level fusion is characterized by comprising:

a first frequency-domain feature encoding unit, configured to pass the power spectral density sequence through a first convolutional neural network using one-dimensional convolutional kernel to obtain a first feature map;

and the emotion recognition result generation unit is used for enabling the classification feature map to pass through a classifier with a plurality of classification labels to obtain a classification result, and the classification result is used for representing the emotion recognition result.

9. The emotion recognition system of time-frequency domain feature level fusion according to claim 8, wherein the time domain semantic coding unit comprises:

an embedding subunit, configured to convert, by an embedding layer of the context encoder, the plurality of time-domain statistical features of each of the sampling windows into an embedding vector, respectively, to obtain a sequence of embedding vectors; and

a context semantic coding subunit, configured to perform global context semantic coding on the sequence of embedded vectors using a converter-based Bert model of the context encoder to obtain the plurality of feature vectors.

10. The emotion recognition system of time-frequency domain feature level fusion of claim 9, wherein the probabilistic interpretation unit is further configured to: performing probabilistic interpretation based on information distribution on the first feature map, the second feature map and the third feature map respectively according to the following formulas to obtain the corrected first feature map, the corrected second feature map and the corrected third feature map;

wherein the formula is:

wherein, exp (f) _i,j,k ) A function value, Σ, of natural exponent expressed by the power of the feature value of each position in the first to third feature maps is calculated _i,j,k exp(f _i,j,k ) Means for calculating a sum of natural exponent function values raised to powers of feature values of respective positions in the first to third feature maps.