CN110301920B

CN110301920B - Multi-mode fusion method and device for psychological pressure detection

Info

Publication number: CN110301920B
Application number: CN201910567398.XA
Authority: CN
Inventors: 冯铃; 张慧君; 曹檑
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2020-06-02
Anticipated expiration: 2039-06-27
Also published as: CN110301920A

Abstract

The embodiment of the invention provides a multi-mode fusion method and device for psychological pressure detection, and the method is based on an attention-enhancing feature matrix of physiological data- > text, physiological data- > picture, text- > physiological data, text- > picture, picture- > physiological data and picture- > text, and based on a feedforward full-link neural network, a fusion feature matrix of the text, the picture and the physiological data is obtained; then acquiring fusion expression matrixes of the three modes based on the importance weight values of the text, the picture and the physiological data and the fusion characteristic matrix of the text, the picture and the physiological data; and finally, acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and the feedforward full-connection network. According to the invention, through fusing the text image data and the physiological relevant data, the defects caused by subjectivity of the text image data and the image data are made up, and some inherent problems of the physiological relevant data are solved.

Description

Multi-mode fusion method and device for psychological pressure detection

Technical Field

The invention relates to the technical field of computers, in particular to a multi-mode fusion method and device for psychological stress detection.

Background

With the increase of the social competitive pressure, the psychological stress problem of teenagers gradually becomes a more serious problem. Excessive psychological stress can cause many physiological and psychological problems, which make psychological stress detection more and more important.

The existing mental stress detection work focusing on social media only focuses on text and picture contents, but the text and picture contents are subjective and sometimes cannot express a real mental state.

There is some work related to physiological signals that have proven their effectiveness in detecting psychological stress, such as heart rate variability, electrocardiograms, galvanic skin reactions, electroencephalograms, blood pressure, and electromyograms. However, there are inherent problems in the physiological signal related data, such as that the physiological related data in the extreme excitability state and the extreme stress state are very similar, and therefore, the real psychological state cannot be fully expressed according to the physiological signal related data.

As can be seen from the above description, there is currently no effective psychological stress detection method and apparatus.

Disclosure of Invention

To solve the problems in the prior art, embodiments of the present invention provide a multi-modal fusion method and apparatus for psychological stress detection.

In a first aspect, an embodiment of the present invention provides an attention weight correspondence method for feature interactive fusion of two modality data, including:

based on the feature matrix of the two modal data, acquiring an incidence relation matrix reflecting information incidence between different features of the two modal data by utilizing matrix multiplication;

acquiring an influence weight matrix of the characteristic matrix of one modal data to the characteristic matrix of the other modal data based on the incidence relation matrix and the feedforward full-connection network model;

and acquiring an attention-strengthening feature matrix containing the mutual influence weight of the feature matrices of the two modal data by utilizing matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data.

In a second aspect, an embodiment of the present invention provides a multimodal fusion method for detecting psychological stress based on the attention weight correspondence method for feature interactive fusion of two modality data according to the first aspect, including:

respectively acquiring a physiological data related characteristic matrix reflecting the physiological state of a user and a text characteristic matrix and an image characteristic matrix reflecting the psychological activity state of the user;

based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix, by utilizing the attention weight corresponding method, a first attention-enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the text feature matrix, a second attention-enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the picture feature matrix, a third attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the physiological data related feature matrix, a fourth attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the picture feature matrix, a fifth attention-enhanced feature matrix comprising the influence weight of the picture feature matrix on the physiological data related feature matrix and a sixth attention-enhanced feature matrix comprising the influence weight of the picture feature matrix on the text feature matrix are obtained Characterizing the matrix;

acquiring a text fusion feature matrix, a picture fusion feature matrix and a physiological data fusion feature matrix based on the first attention enhancement feature matrix, the second attention enhancement feature matrix, the third attention enhancement feature matrix, the fourth attention enhancement feature matrix, the fifth attention enhancement feature matrix and the sixth attention enhancement feature matrix and based on a feedforward full-connection neural network;

based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix, obtaining a text, a picture and a physiological data feature value based on a feedforward full-connection neural network;

based on the text, the picture and the physiological data characteristic value, acquiring importance weight values of the text, the picture and the physiological data based on vector splicing and attention mechanism;

acquiring fusion expression matrixes of three modes based on the importance weight values of the text, the picture and the physiological data and the text fusion characteristic matrix, the picture fusion characteristic matrix and the physiological data fusion characteristic matrix;

and acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and the feedforward full-connection network.

In a third aspect, an embodiment of the present invention further provides an attention weight corresponding apparatus for feature interactive fusion of two modality data, including:

the first acquisition module is used for acquiring an incidence relation matrix reflecting information incidence between different characteristics of the two modal data by utilizing matrix multiplication based on the characteristic matrix of the two modal data;

the second acquisition module is used for acquiring an influence weight matrix of the characteristic matrix of one modal data to the characteristic matrix of the other modal data based on the incidence relation matrix and the feedforward full-connection network model;

and the third acquisition module is used for acquiring an attention-enhanced feature matrix containing the mutual influence weight of the feature matrices of the two modal data by utilizing matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data.

In a fourth aspect, an embodiment of the present invention further provides a multi-modal fusion apparatus for mental stress detection based on the attention weight correspondence apparatus for feature interactive fusion of two modal data according to the third aspect, including:

the fourth acquisition module is used for respectively acquiring a physiological data related characteristic matrix reflecting the physiological state of the user and a text characteristic matrix and an image characteristic matrix reflecting the psychological activity state of the user;

a fifth obtaining module, configured to obtain, based on the physiological data related feature matrix, the text feature matrix, and the image feature matrix, a first attention-enhanced feature matrix including a weight of an influence of the physiological data related feature matrix on the text feature matrix, a second attention-enhanced feature matrix including a weight of an influence of the physiological data related feature matrix on the image feature matrix, a third attention-enhanced feature matrix including a weight of an influence of the text feature matrix on the physiological data related feature matrix, a fourth attention-enhanced feature matrix including a weight of an influence of the text feature matrix on the image feature matrix, a fifth attention-enhanced feature matrix including a weight of an influence of the image feature matrix on the physiological data related feature matrix, and a fifth attention-enhanced feature matrix including a weight of an influence of the image feature matrix on the text feature matrix by using the attention-weight correspondence method The sixth attention-enhancing feature matrix of (1);

a sixth obtaining module, configured to obtain a text fusion feature matrix, an image fusion feature matrix, and a physiological data fusion feature matrix based on a feedforward fully-connected neural network based on the first attention-enhanced feature matrix, the second attention-enhanced feature matrix, the third attention-enhanced feature matrix, the fourth attention-enhanced feature matrix, the fifth attention-enhanced feature matrix, and the sixth attention-enhanced feature matrix;

a seventh obtaining module, configured to obtain a text, a picture, and a physiological data feature value based on the physiological data related feature matrix, the text feature matrix, and the picture feature matrix, and based on a feedforward fully-connected neural network;

the eighth acquiring module is used for acquiring importance weight values of the text, the picture and the physiological data based on the text, the picture and the physiological data characteristic values and based on a vector splicing and attention mechanism;

a ninth obtaining module, configured to obtain a fusion expression matrix of three modalities based on the importance weight values of the text, the picture and the physiological data, and the text fusion feature matrix, the picture fusion feature matrix and the physiological data fusion feature matrix;

and the tenth acquisition module is used for acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and the feedforward full-connection network.

In a fifth aspect, embodiments of the present invention further provide an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the program to implement the steps of the attention weight correspondence method for feature interactive fusion of two modality data according to the first aspect, and/or the steps of the multi-modality fusion method for mental stress detection according to the second aspect.

In a sixth aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the attention weight correspondence method for feature interactive fusion of two modality data as described in the first aspect, and/or the steps of the multi-modality fusion method for mental stress detection as described in the second aspect.

It can be known from the above technical solutions that the attention weight corresponding method and apparatus for feature interactive fusion of two modal data according to the embodiments of the present invention obtains an incidence relation matrix reflecting information incidence between different features of the two modal data by using matrix multiplication based on feature matrices of the two modal data, obtains an influence weight matrix of the feature matrix of one modal data to a feature matrix of the other modal data based on the incidence relation matrix and a feedforward full-connection network model, and finally obtains an attention-enhanced feature matrix including mutual influence weights of the feature matrices of the two modal data by using matrix point multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data, and the embodiments of the present invention implement the attention weight corresponding method for feature interactive fusion of the two modal data by the above processing procedure, based on the method, another embodiment of the present invention provides a multi-modal fusion method and apparatus for psychological stress detection, in which a first attention-enhanced feature matrix including a weight of the physiological data-related feature matrix affecting the text feature matrix, a second attention-enhanced feature matrix including a weight of the physiological data-related feature matrix affecting the image feature matrix, a third attention-enhanced feature matrix including a weight of the text feature matrix affecting the physiological data-related feature matrix, a fourth attention-enhanced feature matrix including a weight of the text feature matrix affecting the image feature matrix, and a corresponding method of the attention weights are obtained based on the physiological data-related feature matrix, the text feature matrix, and the image feature matrix, A fifth attention-enhanced feature matrix containing the mutual influence weight of the picture feature matrix on the physiological data related feature matrix and a sixth attention-enhanced feature matrix containing the mutual influence weight of the picture feature matrix on the text feature matrix are obtained, and then a text fusion feature matrix, a picture fusion feature matrix and a physiological data fusion feature matrix are obtained based on a feedforward full-connection neural network based on the first attention-enhanced feature matrix, the second attention-enhanced feature matrix, the third attention-enhanced feature matrix, the fourth attention-enhanced feature matrix, the fifth attention-enhanced feature matrix and the sixth attention-enhanced feature matrix; then based on the physiological data related characteristic matrix, the text characteristic matrix and the picture characteristic matrix, based on a feedforward full-connection neural network, obtaining characteristic values of texts, pictures and physiological data, and then based on the characteristic values of the texts, the pictures and the physiological data, based on vector splicing and an attention mechanism, obtaining importance weighted values of the texts, the pictures and the physiological data; then acquiring fusion expression matrixes of three modes based on the importance weight values of the text, the picture and the physiological data and the text fusion characteristic matrix, the picture fusion characteristic matrix and the physiological data fusion characteristic matrix; and finally, acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and a feedforward full-connection network. According to the embodiment of the invention, through fusing the text picture data and the physiological related data, the defects caused by the subjectivity of the text and the picture data are made up, some inherent problems of the physiological related data are solved (for example, the physiological related data in an extreme excitation state and an extreme stress state are very similar), and a psychological detection window period caused by certain data loss is made up to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of an attention weight correspondence method for feature interactive fusion of two modality data according to an embodiment of the present invention;

fig. 2 is a model structure diagram of an attention weight mapping method for feature interactive fusion of two modality data according to an embodiment of the present invention;

FIG. 3 is a flowchart of a multi-modal fusion method for mental stress detection according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a text feature extraction process according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a physiological feature extraction process provided by an embodiment of the invention;

FIG. 6 is a block diagram of a fusion method for multi-modal detection of psychological stress problems for text, pictures and physiological related data according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a multi-modal fusion apparatus for mental stress detection according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an attention weight mapping apparatus for feature interactive fusion of two modality data according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Before describing the scheme provided by the embodiment of the present invention, the origin of the present invention will be briefly described. When a teenager suffers from psychological stress, the amount of daily activities (such as counting steps) and the sleeping conditions (difficulty in falling asleep, early awakening, etc.) of the teenager often become abnormal. On the other hand, the word expression and the picture expression can greatly represent the mental state and daily activities of the teenagers. The embodiment of the invention aims to detect the psychological stress of teenagers by fusing text data, picture data and physiological relevant data. Because the embodiment of the invention needs to solve the problem of multi-modal fusion, the embodiment of the invention firstly provides an attention weight corresponding method which enables feature interactive fusion of two modal data. In order to detect the psychological stress of teenagers on texts, pictures and physiological relevant data, the embodiment of the invention provides a fusion method for detecting the psychological stress of the texts, the pictures and the physiological relevant data in a multi-mode. The attention weight corresponding method for feature interactive fusion of two modality data and the multi-modality fusion method and apparatus for psychological stress detection provided by the embodiment of the invention will be described in detail through specific embodiments.

Fig. 1 shows a flowchart of an attention weight correspondence method for feature interactive fusion of two modality data according to an embodiment of the present invention. As shown in fig. 1, the attention weight correspondence method for feature interactive fusion of two modality data according to the embodiment of the present invention includes the following steps:

step 101: based on the feature matrix of the two modal data, an incidence relation matrix reflecting the information incidence between different features of the two modal data is obtained by matrix multiplication.

In this embodiment, the modal data refers to text data (such as a text questionnaire, a diary, a sketch, a composition, and the like) for detecting the psychological stress of the teenager, picture data (such as a picture questionnaire, a favorite cartoon, a handscribble, and the like) for detecting the psychological stress of the teenager, or physiologically relevant data, such as a sports situation, a sleep situation, and the like.

In this embodiment, the two modality data may refer to two modality data, namely text data and picture data, may refer to two modality data, namely text data and physiological related data, and may refer to two modality data, namely picture data and physiological related data.

Step 102: and acquiring an influence weight matrix of the characteristic matrix of one modal data to the characteristic matrix of the other modal data based on the incidence relation matrix and the feedforward full-connection network model.

Step 103: and acquiring an attention-strengthening feature matrix containing the mutual influence weight of the feature matrices of the two modal data by utilizing matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data.

In this embodiment, the main purpose is to obtain the association relationship between the feature matrices of the data of the two modalities, and to correspond the association relationship back to the original feature matrix, so that the processed feature matrix of each modality includes the information about the influence of the other modality on the association relationship. As shown in fig. 2, the following are specific:

assume the feature matrices of the two modality data are A and B, where

Multiplying the rank conversion matrixes of A and B by matrix multiplication to obtain a correlation relation matrix containing each characteristic in A and each characteristic in B

Through a layer of fully connected network, will

Is mapped back to

Of vector space, get A_B→A，

A_B→ARepresents the weight of influence of mode B on mode A, W₁Representing a first preset training parameter.

Using dot product operation to get A_B→AMultiplying the obtained result by A and obtaining an attention-strengthening feature matrix after residual connection

Contains the information of B and the influence of B on A.

By the same way, the attention-strengthening feature matrix can be obtained

The information of A and the influence of A on B are included:

to facilitate the later embodiments to invoke this method, use f_AMMTo express the method that

It should be noted that the two modality data in this embodiment may refer to two modality data, namely, text data and picture data for detecting psychological stress, may refer to two modality data, namely, text data and physiological related data, for detecting psychological stress, and may refer to two modality data, namely, picture data and physiological related data, for detecting psychological stress, in the embodiment of the present invention, the attention weight corresponding method for performing feature interactive fusion on the two modality data, namely, the text data and the picture data, is implemented through the above processing procedure, the attention weight corresponding method for performing feature interactive fusion on the two modality data, namely, the text data and the physiological related data, is implemented, the attention weight corresponding method for performing feature interactive fusion on the two modality data, namely, the embodiment enables the processed text data to include the influence of the picture data and the physiological related data on the association relationship thereof, the processed image data comprises the incidence relation influence of the text data and the physiological relevant data on the image data, and the processed physiological relevant data comprises the incidence relation influence of the text data and the image data on the image data. That is, in this embodiment, the processed feature matrix of each modality includes the influence information of another modality on the association relationship thereof, so that the multi-modality feature data is fused to obtain the result of the comprehensive influence of the multi-modality feature data. The following embodiments provide a multi-modal fusion method and apparatus for mental stress detection based on the method, and perform multi-modal fusion on text data, picture data and physiologically relevant data for mental stress detection, so as to make up for the deficiency caused by the subjectivity of the text and picture data, solve some inherent problems of physiologically relevant data (for example, physiologically relevant data in an extreme excitable state and an extreme stressed state are very similar), and make up for a mental detection window period caused by some data loss to some extent.

As can be seen from the above technical solution, the attention weight corresponding method for performing feature interaction fusion on two modality data provided in the embodiment of the present invention aims to obtain an association relationship between two modality feature matrices, and to correspond the association relationship back to an original feature matrix, so that the processed feature matrix of each modality includes information about an influence of another modality on its association relationship, and the following processing means is adopted: based on the feature matrix of two modal data, acquiring an incidence relation matrix reflecting information incidence degree between different features of the two modal data by utilizing matrix multiplication, acquiring an influence weight matrix of the feature matrix of one modal data to the feature matrix of the other modal data based on the incidence relation matrix and a feedforward full-connection network model, and finally acquiring an attention-enhanced feature matrix containing mutual influence weight of the feature matrices of the two modal data by utilizing matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data The embodiment of the invention realizes an attention weight corresponding method for carrying out feature interactive fusion on the two modal data of the text data and the image data, an attention weight corresponding method for carrying out feature interactive fusion on the two modal data of the text data and the physiological relevant data, and an attention weight corresponding method for carrying out feature interactive fusion on the two modal data of the image data and the physiological relevant data, namely, the embodiment leads the processed text data to contain the incidence relation influence of the image data and the physiological relevant data on the image data and the physiological relevant data, the processed picture data contains the incidence relation influence of the text data and the physiological relevant data on the picture data, the processed physiological relevant data comprises the incidence relation influence of the text data and the picture data on the physiological relevant data. That is, in this embodiment, the processed feature matrix of each modality includes the influence information of another modality on the association relationship thereof, so that the multi-modality feature data is fused to obtain the result of the comprehensive influence of the multi-modality feature data. The following embodiments provide a multi-modal fusion method and apparatus for mental stress detection based on the method, and perform multi-modal fusion on text data, picture data and physiologically relevant data for mental stress detection, so as to make up for the deficiency caused by the subjectivity of the text and picture data, solve some inherent problems of physiologically relevant data (for example, physiologically relevant data in an extreme excitable state and an extreme stressed state are very similar), and make up for a mental detection window period caused by some data loss to some extent.

Further, based on the content of the foregoing embodiment, in the present embodiment, the foregoing step 101 may be implemented as follows:

acquiring an incidence relation matrix reflecting information incidence degrees between different characteristics of two modal data by using a following first relation model:

wherein the content of the first and second substances,

representing a correlation matrix, A representing a characteristic matrix of data of one modality, B representing a characteristic matrix of data of another modality,

representing real space, k representing the dimension of the two modal data, B^TExpressing the rank conversion matrix of B, multiplying the feature matrix A by the rank conversion matrix of the feature matrix B by using matrix multiplication to obtain an incidence relation matrix containing each feature in the feature matrix A and each feature in the feature matrix B

Further, based on the content of the foregoing embodiment, in this embodiment, the foregoing step 102 may be implemented as follows:

and acquiring an influence weight matrix of the feature matrix B on the feature matrix A by using a following second relation model:

and acquiring an influence weight matrix of the feature matrix A on the feature matrix B by using a following third relation model:

wherein A is_B→ARepresenting the weight matrix of the influence of the feature matrix B on the feature matrix A, B_A→BRepresenting the weight matrix of the influence of the feature matrix a on the feature matrix B,

softmax denotes the normalized exponential function, W₁Representing a first predetermined training parameter, W, of a first class of training parameters₂

Representing a second preset training parameter in the first class of training parameters, and connecting the incidence relation matrix through a layer of fully-connected network

Is mapped back to

Obtaining the influence weight matrix A of the feature matrix B on the feature matrix A_B→AAnd the influence of the feature matrix A on the feature matrix B_A→B。

Further, based on the content of the foregoing embodiment, in the present embodiment, the foregoing step 103 can be implemented as follows:

obtaining an attention-strengthening feature matrix by using the following fourth relation model

And obtaining an attention-strengthening feature matrix by using the following fifth relation model

wherein an indicates a dot product operation, using the dot product operation to form A_B→AMultiplying the obtained result by A and obtaining an attention-strengthening feature matrix after residual connection

The information of B and the influence of B on A are contained; using dot product operation to get B_A→BMultiplying by B and obtaining the attention-strengthening feature matrix after residual connection

The information of A and the influence of A on B are contained;

wherein the content of the first and second substances,

f_AMMrepresenting feature matrix A and feature matrix B to attention-enhancing feature matrix

And attention-enhancing feature matrix

The processing process specifically comprises the following steps: features are aligned using the first to fifth relational modelsProcessing the feature matrix A and the feature matrix B to obtain an attention-strengthening feature matrix

And attention-enhancing feature matrix

The process of (1).

Fig. 3 shows a flowchart of a multi-modal fusion method for mental stress detection according to an embodiment of the present invention. As shown in fig. 3, the multi-modal fusion method for detecting mental stress according to the embodiment of the present invention is implemented based on the attention weight correspondence method for performing feature interactive fusion on two modality data according to the foregoing embodiment, and the multi-modal fusion method for detecting mental stress according to the embodiment of the present invention includes the following steps:

step 201: and respectively acquiring a physiological data related characteristic matrix reflecting the physiological state of the user and a text characteristic matrix and an image characteristic matrix reflecting the psychological activity state of the user.

In this step, a feature matrix of text, pictures and physiologically relevant data needs to be obtained. For the process of acquiring the text feature matrix, see the schematic diagram of the acquiring process shown in fig. 4. For the acquisition process of the physiological data related feature matrix, refer to the schematic diagram of the acquisition process shown in fig. 5. The process of obtaining the feature matrix of text, picture and physiological related data is described in detail below.

for text, each text is denoted by w, w ═ w₁,w₂,···,w_n}，

w_iRepresenting a word. For example, a pre-trained 300-dimensional vector from Chinese Word Vectors is selected as the initial Word vector for each Word, thus representing the text as X ═ { X ═ X₁,x₂,···,x_n}，

x_iIs a 1 x 300 vector representing the meaning of a word.

The LSTM (Long Short-Term Memory) network layer aims to calculate a text representation that can express context information, because a model cannot directly understand natural language, a text representation that can be understood by the model must be calculated first, and the text representation is particularly in a matrix form H. Text representation X ═ X₁,x₂,···,x_nEnters the LSTM layer as input, where n denotes the number of said words contained in said text vocabulary, and n is taken 20 in the present invention. Obtaining hidden layer outputs of two LSTMs through forward LSTM and backward LSTM respectively

And

adding hidden layer outputs at corresponding positions to obtain a text expression matrix H:

applying an attention mechanism to obtain the contribution degree distribution weight of the text expression matrix H:

Attn_T＝softmax(HW₃+b₁)

wherein, Attn_TIs a contribution-degree distribution weight vector representing the distribution of contribution weights of the text representation of each word. Attn (gamma-_TMultiplying by H and connecting through residual errors to obtain a text representation matrix subjected to weight distribution readjustment

Through a layer of fully connected network, will

Mapping to k multiplied by 1 vector space to obtain text feature matrix F_T：

the number of channels is 3, each picture is represented by a vector of 32 x 3, the feature vector of 4 x 512 is obtained through the first three-part structure of a pre-trained ResNet network, the convolution kernel size is 1 x 1, the feature vector of 4 x 32 is obtained through one layer of convolution layer, the image of 4 x 32 is expanded into a vector C with the length of 512, the initial image feature C is represented, and the dimension of the image feature is mapped to a space of n x 1 by one full-connection layer, so that a picture feature matrix F is obtained_V：

F_V＝ReLU(W₅C+b₃)。

thirdly, for physiological related data, the related sleep data and motion data can be collected through a bracelet, feature extraction is carried out on the sleep data and the motion data, the related sleep feature vectors and the motion feature vectors are obtained and spliced to be used as the feature vectors of the physiological related data, for example, the sleep condition of 8:00 at night to 10:00 in the next morning is taken into consideration of the law of work and rest time of teenagers, 9 features are extracted, wherein the features are respectively a sleep starting segment, a sleep ending segment, a sleep segment, a deep sleep proportion, a sleep total amount, a unit segment sleep amount, a sleep fluctuation amount and the number of waking times in sleep, for the measurement of time features, every 15 minutes is taken as a segment, for example, 20:00-20:15 is segment 1, 20:15-20:30 is segment 2, and the rest is carried out by analogy, the segment set is represented by T,

T＝{t₁,t₂,···,t₅₆}，t_ie T represents the sleep quantity of the ith segment。

Sleep onset segment: generating a starting segment with at least 4 continuous segments of continuous sleep data all larger than 0 at the earliest in the sleep interval as a sleep starting segment, namely when t_i*t_i+1*t_i+2*t_i+3At > 0, t_i,t_i+1,t_i+2,t_i+3E.t, the sleep onset segment is taken as the segment of the minimum value in i.

Sleep end time: the latest segment of at least 4 continuous sleep segments in the sleep interval, i.e. t_i*t_i-1*t_i-2*t_i-3Is greater than 0 and

t_i,t_i-1,t_i-2,t_i-3∈T。

sleep segment: and the number of fragments with the sleep quantity larger than 0 in the sleep metering interval.

Deep sleep segment: when the sleep amount in the segment is higher than the threshold value theta, the segment is a deep sleep segment, the value of theta is generally 230, the threshold value is a bracelet parameter, and the value is variable according to different bracelets.

Deep sleep ratio: ratio of deep sleep segment to sleep segment.

Total sleep amount: sum of sleep amount between sleep onset segment and sleep end segment.

Sleep amount per unit segment: the ratio of the total sleep amount to the sleep segments is the unit segment sleep amount.

Amount of sleep fluctuation: the standard deviation of the amount of sleep between the sleep onset section and the sleep termination section is taken as the amount of sleep fluctuation.

the number of waking times in sleep is that the number of the segments from the sleep starting segment to the sleep ending segment is less than the threshold beta, and the beta value is 25, and when the number of the sleep between the sleep starting segment and the sleep ending segment is less than 25, the waking is indicated.

Regarding the motion feature vector, 5 motion features, which are the number of motion steps per day, the calorie consumption value per day, the distance of motion per day, the length of motion per day, and the length of active motion per day, are extracted.Wherein the exercise steps per day, the calorie consumption value per day, the exercise distance per day and the exercise duration per day can be directly obtained through the bracelet. Daily exercise activity duration: and equally dividing 24 hours each day into 96 segments, wherein the segments in each segment, of which the exercise step number, the calorie consumption value, the exercise distance and the exercise duration are higher than the average values of the corresponding items, are exercise active segments, and the total number of the exercise active segments per day is the exercise active duration per day. Splicing 9 sleep characteristics and 5 motion characteristics together to form 14 x 1 physiological related data characteristics E_S。

A better physiological relevant data representation matrix E is obtained through a two-layer fully-connected network:

E＝ReLU(W₇(ReLU(W₆E_S+b₄)+b₅))

and applying an attention mechanism to obtain the contribution degree distribution weight of the physiological relevant data representation matrix E:

Attn_E＝softmax(W₈E+b₆)

wherein, Attn_EIs a contribution degree distribution weight vector which represents the distribution of the contribution weight represented by each physiological characteristic. Attn (gamma-_EMultiplying by E and connecting through residual errors to obtain a text representation matrix subjected to weight distribution readjustment

Through a layer of fully connected network, will

Mapping to k multiplied by 1 vector space to obtain a physiological data related characteristic matrix F_E：

It should be noted that the above example is only an illustration, the physiological related data is not limited to the sleep data and the exercise data, and may be data such as blood pressure, pulse, galvanic skin response, electrocardiogram, electromyogram, etc. according to actual needs, which is not limited by the present invention.

Step 202: based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix, by utilizing the attention weight corresponding method, a first attention-enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the text feature matrix, a second attention-enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the picture feature matrix, a third attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the physiological data related feature matrix, a fourth attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the picture feature matrix, a fifth attention-enhanced feature matrix comprising the influence weight of the picture feature matrix on the physiological data related feature matrix and a sixth attention-enhanced feature matrix comprising the influence weight of the picture feature matrix on the text feature matrix are obtained And (5) characterizing the matrix.

In this step, an attention weight correspondence method is applied between every two feature vectors to obtain an attention-enhanced feature matrix

Wherein the first attention-enhancing feature matrix isPhysiological data->Text attention-enhancing feature matrix

The second attention-enhancing feature matrix is physiological data->Picture attention-enhancing feature matrix

The third attention-enhancing feature matrix is text->Attention-enhancing feature matrix for physiological data

The fourth attention-enhancing feature matrix is text->Picture attention-enhancing feature matrix

The fifth attention-enhancing feature matrix is a picture->Attention-enhancing feature matrix for physiological data

The sixth attention-enhancing feature matrix is a picture->Text attention-enhancing feature matrix

Thus, for each modality, two attention-strengthening feature matrixes are obtained, and the two attention-strengthening feature matrixes contain the associated information of the other two modalities.

Step 203: based on the first attention-enhanced feature matrix, the second attention-enhanced feature matrix, the third attention-enhanced feature matrix, the fourth attention-enhanced feature matrix, the fifth attention-enhanced feature matrix and the sixth attention-enhanced feature matrix, acquiring a text fusion feature matrix, a picture fusion feature matrix and a physiological data fusion feature matrix based on a feedforward fully-connected neural network.

In this step, the two attention-enhancing feature matrices of each modality are further merged into a fused feature matrix through a layer of fully-connected network

Each fusion feature matrix contains the correlation and influence information of the other two modalities:

step 204: and obtaining the characteristic values of the text, the picture and the physiological data based on the related characteristic matrix of the physiological data, the text characteristic matrix and the picture characteristic matrix and based on a feedforward full-connection neural network.

In the step, the text, the picture and the physiological relevant characteristic matrix are mapped between (0,1), and then the text, the picture and the physiological relevant characteristic value S are obtained through one layer of full connection_T，S_VAnd S_E：

S_T＝ReLU(W₁₆softmax(F_T)+b₁₁)

S_V＝ReLU(W₁₇softmax(F_V)+b₁₂)

S_E＝ReLU(W₁₈softmax(F_E)+b₁₃)

Step 205: and acquiring importance weight values of the text, the picture and the physiological data based on the text, the picture and the physiological data characteristic values and based on a vector splicing and attention mechanism.

In the step, the text, the picture and the physiological data characteristic value S are processed_T，S_VAnd S_ESpliced together, and through an attention mechanism, the importance weight values weight of texts, pictures and physiological data are obtained_T，weight_VAnd weight_E：

(weight_T,weight_V,weight_E)＝softmax([S_T,S_V,S_E]W₁₉)

Step 206: and acquiring fusion expression matrixes of three modes based on the importance weight values of the text, the picture and the physiological data and the text fusion characteristic matrix, the picture fusion characteristic matrix and the physiological data fusion characteristic matrix.

In this step, the three modalities are a text data modality, a picture data modality, and a physiological related data modality. In the step, the importance weight values of the text, the picture and the physiological data are weighted_T，weight_VAnd weight_EFusing feature matrices with the text

The picture fusion feature matrix

And the physiological data fusion feature matrix

Corresponding multiplication and addition are carried out to obtain a fusion expression matrix R of three modes_W：

Step 207: and acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and the feedforward full-connection network.

In this step, a linear classifier is used to obtain a 1 x 2 pressure classification vector y representing the existence of mental pressure, and two dimensions represent the existence of pressure and the nonexistence of pressure respectively, wherein the meaning corresponding to the position with the highest numerical value is used as the final classification result. For example, the pressure classification vector y may be obtained specifically by the following model:

y＝softmax(W₂₃R_W+b₁₄)

wherein, W₁～W₂₃Representing first to twenty-third preset training parameters of the first class of training parameters, b₁～b₁₄Representing the first to fourteenth preset training parameters in the second class of training parameters. The first type of training parameters and the second type of training parameters are both in accordance with a normal distribution U (-0.001, 0.001), and the first to twenty-third preset training parameters in the first type of training parameters and the first to fourteenth preset training parameters in the second type of training parameters are set according to actual needs.

Referring to fig. 6, as can be seen from a model structure diagram of a fusion method for performing multi-modal detection on a psychological stress problem of text, picture and physiological related data, in the multi-modal fusion method for detecting a psychological stress provided in an embodiment of the present invention, based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix, a first attention-enhanced feature matrix including a weight of the physiological data related feature matrix affecting the text feature matrix mutually, a second attention-enhanced feature matrix including a weight of the physiological data related feature matrix affecting the picture feature matrix mutually, a third attention-enhanced feature matrix including a weight of the text feature matrix affecting the physiological data related feature matrix mutually, a fourth attention-enhanced feature matrix including a weight of the text feature matrix affecting the picture feature matrix mutually, and a method for corresponding attention weights are obtained, A fifth attention-enhanced feature matrix containing the mutual influence weight of the picture feature matrix on the physiological data related feature matrix and a sixth attention-enhanced feature matrix containing the mutual influence weight of the picture feature matrix on the text feature matrix are obtained, and then a text fusion feature matrix, a picture fusion feature matrix and a physiological data fusion feature matrix are obtained based on a feedforward full-connection neural network based on the first attention-enhanced feature matrix, the second attention-enhanced feature matrix, the third attention-enhanced feature matrix, the fourth attention-enhanced feature matrix, the fifth attention-enhanced feature matrix and the sixth attention-enhanced feature matrix; then based on the physiological data related characteristic matrix, the text characteristic matrix and the picture characteristic matrix, based on a feedforward full-connection neural network, obtaining characteristic values of texts, pictures and physiological data, and then based on the characteristic values of the texts, the pictures and the physiological data, based on vector splicing and an attention mechanism, obtaining importance weighted values of the texts, the pictures and the physiological data; then acquiring fusion expression matrixes of three modes based on the importance weight values of the text, the picture and the physiological data and the text fusion characteristic matrix, the picture fusion characteristic matrix and the physiological data fusion characteristic matrix; and finally, acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and a feedforward full-connection network. According to the embodiment of the invention, through fusing the text picture data and the physiological related data, the defects caused by the subjectivity of the text and the picture data are made up, some inherent problems of the physiological related data are solved (for example, the physiological related data in an extreme excitation state and an extreme stress state are very similar), and a psychological detection window period caused by certain data loss is made up to a certain extent.

Further, based on the content of the foregoing embodiment, in the present embodiment, the foregoing step 201 may be implemented as follows:

acquiring a text feature matrix reflecting the psychological activity state of the user by using the following sixth processing model:

Attn_T＝softmax(HW₃+b₁)

wherein, F_TRepresenting a text feature matrix, H representing a text representation matrix,

representing the text representation matrix, Attn, readjusted by the weight distribution_TA distribution weight vector representing the contribution degree of the text representation matrix H, and the text representation X is set to { X }₁,x₂,···,x_nGet into long and short term memory network as input

An LSTM layer for respectively obtaining two hidden layer outputs of the LSTM via forward LSTM and backward LSTM

And

adding the hidden layer outputs at the corresponding positions to obtain a text representation matrix H; applying attention mechanism to obtain contribution degree distribution weight vector Attn of text expression matrix H_T：Attn_T＝softmax(HW₃+b₁)，Attn_TDistribution of contribution weights representing the text representation of each word, Attn_TMultiplying by H and connecting through residual errors to obtain a text representation matrix subjected to weight distribution readjustment

Through a layer of fully connected network, will

Mapping to k multiplied by 1 vector space to obtain text feature matrix F_T：

Wherein, W₃Representing a third pre-set training parameter, W, of the first class of training parameters₄Representing a fourth pre-set training parameter of the first class of training parameters, b₁Representing a first preset training parameter in a second class of training parameters, b₂Representing second class of training parametersA second preset training parameter in the number, ReLU representing an activation function; softmax represents the normalized exponential function, textual representation

x_iA vector representing the meaning of the word, n representing the number of words contained in the text;

and acquiring a picture characteristic matrix reflecting the psychological activity state of the user by using the following seventh processing model:

F_V＝ReLU(W₅C+b₃)

wherein, F_VRepresenting a picture characteristic matrix, C representing picture characteristics, and mapping the dimension of the picture characteristics C to an n multiplied by 1 vector space by using a full connection layer to obtain a picture characteristic matrix F_V(ii) a Wherein, W₅Representing a fifth pre-set training parameter of the first class of training parameters, b₃Representing a third preset training parameter in the second class of training parameters;

and acquiring a physiological data related feature matrix reflecting the physiological state of the user by using the following eighth processing model:

E＝ReLU(W₇(ReLU(W₆E_S+b₄)+b₅))

Attn_E＝softmax(W₈E+b₆)

wherein, F_ERepresenting a matrix of physiological data-related features, E_SRepresenting a matrix of characteristics of physiologically relevant data, E_SThe inside of the body contains a plurality of preset physiological characteristics, E represents a pair E_SPhysiological related data expression matrix E, Attn obtained by two-layer full-connection network_ERepresenting the contribution degree distribution weight vector of the physiological relevant data representation matrix E,

representing a text representation matrix readjusted by a weight distribution

To E_SAnd (3) performing a physiological related data representation matrix E obtained by two layers of fully-connected networks: e ═ ReLU (W)₇(ReLU(W₆E_S+b₄)+b₅) Applying an attention mechanism to obtain a contribution degree distribution weight vector Attn of the physiological-related data representation matrix E_E：Attn_E＝softmax(W₈E+b₆) Attn (r) to_EMultiplying by E and connecting through residual errors to obtain a text representation matrix subjected to weight distribution readjustment

Through a layer of fully connected network, will

Wherein, Attn_ERepresents the distribution of the contribution weights, W, of each physiological characteristic representation₆Representing a sixth pre-set training parameter, W, of the first class of training parameters₇Representing a seventh preset training parameter, W, of the first class of training parameters₈Represents an eighth pre-set training parameter, W, of the first class of training parameters₉Representing a ninth pre-set training parameter of the first class of training parameters, b₄Representing a fourth pre-set training parameter in the second class of training parameters, b₅Representing a fifth preset training parameter in the second class of training parameters, b₆Representing a sixth preset training parameter in the second class of training parameters, b₇Representing a seventh preset training parameter in the second class of training parameters.

Further, based on the content of the foregoing embodiment, in the present embodiment, the foregoing step 202 may be implemented as follows:

based on the physiological data related feature matrix and the text feature matrix, applying the attention weight corresponding method for performing feature interactive fusion on two modal data, to obtain a first attention-enhanced feature matrix including the influence weight of the physiological data related feature matrix on the text feature matrix; the first attention-enhancing feature matrix is physiological data->Text attention-enhancing feature matrix

Based on the physiological data related feature matrix and the picture feature matrix, applying the attention weight corresponding method for carrying out feature interactive fusion on two modal data to obtain a second attention enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the picture feature matrix; the second attention-enhancing feature matrix is physiological data->Picture attention-enhancing feature matrix

Based on the physiological data related feature matrix and the text feature matrix, applying the attention weight corresponding method for performing feature interactive fusion on two modal data, to obtain a third attention-enhanced feature matrix including the influence weight of the text feature matrix on the physiological data related feature matrix; the third attention-enhancing feature matrix is text->Attention-enhancing feature matrix for physiological data

Based on the text feature matrix and the picture feature matrix, obtaining a fourth attention-enhanced feature matrix including the mutual influence weight of the text feature matrix on the picture feature matrix by applying the attention weight corresponding method for performing feature interactive fusion on two modal data, which is described in the embodiment; the fourth attention-enhancing feature matrix is text->Picture attention-enhancing feature matrix

Based on the physiological data related feature matrix and the picture feature matrix, applying the attention weight corresponding method for performing feature interactive fusion on two modal data, to obtain a fifth attention-enhanced feature matrix including the weight of the picture feature matrix on the mutual influence of the physiological data related feature matrix; the fifth attention-enhancing feature matrix is a picture->Attention-enhancing feature matrix for physiological data

Based on the text feature matrix and the picture feature matrix, obtaining a sixth attention-enhanced feature matrix including the mutual influence weight of the picture feature matrix on the text feature matrix by applying the attention weight corresponding method for performing feature interactive fusion on the two modal data, which is described in the embodiment; the sixth attention-enhancing feature matrix is a picture->Text attention-enhancing feature matrix

Further, based on the content of the foregoing embodiment, in this embodiment, the foregoing step 203 may be implemented as follows:

acquiring a text fusion feature matrix through a layer of full-connection network based on the first attention-enhanced feature matrix and the sixth attention-enhanced feature matrix by using a following ninth processing model

Acquiring a picture fusion feature matrix through a layer of full-connection network based on the second attention-enhanced feature matrix and the fourth attention-enhanced feature matrix by using the following tenth processing model

Acquiring a physiological data fusion feature matrix through a layer of full-connection network based on the third attention-enhanced feature matrix and the fifth attention-enhanced feature matrix by using the following eleventh processing model

Wherein, W₁₀～W₁₅Representing the tenth to tenth of the first class of training parametersFive preset training parameters, b₈～b₁₀Represent the eighth to tenth preset training parameters in the second class of training parameters.

Further, based on the content of the foregoing embodiment, in the present embodiment, the foregoing step 204 may be implemented as follows:

acquiring characteristic values of texts, pictures and physiological data by using the following twelfth processing model:

S_T＝ReLU(W₁₆softmax(F_T)+b₁₁)

S_V＝ReLU(W₁₇softmax(F_V)+b₁₂)

S_E＝ReLU(W₁₈softmax(F_E)+b₁₃)

wherein the physiological data is correlated with a feature matrix F_EThe text feature matrix F_TAnd the picture feature moments F_VMapping the arrays between (0,1), and obtaining the characteristic values S of texts, pictures and physiological data through one layer of full connection_T，S_VAnd S_E(ii) a Wherein, W₁₆～W₁₈Representing the sixteenth to eighteenth preset training parameters in the first class of training parameters, b₁₁～b₁₃Representing the eleventh to thirteenth preset training parameters in the second class of training parameters.

Further, based on the content of the foregoing embodiment, in this embodiment, the foregoing step 205 may be implemented as follows:

using the following thirteenth processing model to process the text, the picture and the physiological data characteristic value S_T，S_VAnd S_ESpliced together, and through an attention mechanism, the importance weight values weight of texts, pictures and physiological data are obtained_T，weight_VAnd weight_E：

(weight_T,weight_V,weight_E)＝softmax([S_T,S_V,S_E]W₁₉)

Wherein, W₁₁₉Representing a nineteenth preset training parameter in the first class of training parameters.

Further, based on the content of the foregoing embodiment, in the present embodiment, the foregoing step 206 may be implemented as follows:

utilizing the following fourteenth processing model to weight the importance values weight of the text, the picture and the physiological data_T，weight_VAnd weight_EFusing feature matrices with the text

The picture fusion feature matrix

And the physiological data fusion feature matrix

Wherein, W₂₀～W₂₂Representing the twentieth to twenty-second preset training parameters in the first class of training parameters.

Fig. 7 is a schematic structural diagram illustrating an attention weight mapping apparatus for feature interactive fusion of two modality data according to an embodiment of the present invention. As shown in fig. 7, the attention weight correspondence apparatus for feature interactive fusion of two modality data according to the embodiment of the present invention includes: a first obtaining module 11, a second obtaining module 12 and a third obtaining module 13, wherein:

the first obtaining module 11 is configured to obtain, based on a feature matrix of two types of modal data, an incidence relation matrix reflecting information relevance between different features of the two types of modal data by using matrix multiplication;

a second obtaining module 12, configured to obtain, based on the incidence relation matrix and the feedforward full-connection network model, a weight matrix of influence of a feature matrix of one modal data on a feature matrix of another modal data;

a third obtaining module 13, configured to obtain an attention-enhanced feature matrix including mutual influence weights of feature matrices of the two modality data by using matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modality data.

The attention weight corresponding device for performing feature interactive fusion on two modality data provided in the embodiment of the present invention may be used to execute the attention weight corresponding method for performing feature interactive fusion on two modality data described in the above embodiment, and the working principle and the beneficial effect are similar, so detailed description is omitted here, and specific contents may refer to the description of the above embodiment.

Fig. 8 is a schematic structural diagram of a multi-modal fusion apparatus for mental stress detection according to an embodiment of the present invention. As shown in fig. 8, the multi-modal fusion apparatus for mental stress detection according to an embodiment of the present invention is implemented based on the attention weight correspondence apparatus for feature interactive fusion of two modality data described in the above embodiment, and includes: a fourth obtaining module 21, a fifth obtaining module 22, a sixth obtaining module 23, a seventh obtaining module 24, an eighth obtaining module 25, a ninth obtaining module 26, and a tenth obtaining module 27, wherein:

a fourth obtaining module 21, configured to obtain a physiological data related feature matrix reflecting a physiological state of the user, and a text feature matrix and an image feature matrix reflecting a psychological activity state of the user, respectively;

a fifth obtaining module 22, configured to obtain, based on the physiological data related feature matrix, the text feature matrix and the image feature matrix, a first attention-enhanced feature matrix including a weight of an influence of the physiological data related feature matrix on the text feature matrix, a second attention-enhanced feature matrix including a weight of an influence of the physiological data related feature matrix on the image feature matrix, a third attention-enhanced feature matrix including a weight of an influence of the text feature matrix on the physiological data related feature matrix, a fourth attention-enhanced feature matrix including a weight of an influence of the text feature matrix on the image feature matrix, a fifth attention-enhanced feature matrix including a weight of an influence of the image feature matrix on the physiological data related feature matrix, and a fifth attention-enhanced feature matrix including a weight of an influence of the image feature matrix on the text feature matrix by using the attention-weight correspondence method A sixth weighted attention-enhancing feature matrix;

a sixth obtaining module 23, configured to obtain a text fusion feature matrix, an image fusion feature matrix, and a physiological data fusion feature matrix based on the feedforward fully-connected neural network based on the first attention-enhanced feature matrix, the second attention-enhanced feature matrix, the third attention-enhanced feature matrix, the fourth attention-enhanced feature matrix, the fifth attention-enhanced feature matrix, and the sixth attention-enhanced feature matrix;

a seventh obtaining module 24, configured to obtain a text, an image, and a physiological data feature value based on the physiological data related feature matrix, the text feature matrix, and the image feature matrix, and based on a feedforward fully-connected neural network;

an eighth obtaining module 25, configured to obtain importance weight values of the text, the picture, and the physiological data based on the text, the picture, and the physiological data feature values and based on a vector splicing and attention mechanism;

a ninth obtaining module 26, configured to obtain a fusion expression matrix of three modalities based on the importance weight values of the text, the picture and the physiological data, and the text fusion feature matrix, the picture fusion feature matrix and the physiological data fusion feature matrix;

a tenth obtaining module 27, configured to obtain a pressure classification vector reflecting a psychological pressure problem based on the fusion representation matrix of the three modalities and the feedforward full-connection network.

Since the multi-modal fusion apparatus for psychological stress detection provided by the embodiment of the present invention can be used to perform the multi-modal fusion method for psychological stress detection described in the above embodiment, and the operation principle and the beneficial effect are similar, detailed descriptions are not provided herein, and specific contents can be referred to the description of the above embodiment.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which specifically includes the following components, with reference to fig. 9: a processor 301, a memory 302, a communication interface 303, and a communication bus 304;

the processor 301, the memory 302 and the communication interface 303 complete mutual communication through the communication bus 304;

the processor 301 is configured to call a computer program in the memory 302, and when the processor executes the computer program, the processor implements the above-mentioned attention weight corresponding method for feature interactive fusion of two modality data, and/or all the steps of a multi-modality fusion method for mental stress detection, for example, when the processor executes the computer program, the processor implements the following processes:

based on the feature matrix of the two modal data, acquiring an incidence relation matrix reflecting information incidence between different features of the two modal data by utilizing matrix multiplication; acquiring an influence weight matrix of the characteristic matrix of one modal data to the characteristic matrix of the other modal data based on the incidence relation matrix and the feedforward full-connection network model; and acquiring an attention-strengthening feature matrix containing the mutual influence weight of the feature matrices of the two modal data by utilizing matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data.

As another example, the processor, when executing the computer program, implements the following:

respectively acquiring a physiological data related characteristic matrix reflecting the physiological state of a user and a text characteristic matrix and an image characteristic matrix reflecting the psychological activity state of the user; based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix, by utilizing the attention weight corresponding method, a first attention-enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the text feature matrix, a second attention-enhanced feature matrix comprising the influence weight of the physiological data related feature matrix on the picture feature matrix, a third attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the physiological data related feature matrix, a fourth attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the picture feature matrix, a fifth attention-enhanced feature matrix comprising the influence weight of the picture feature matrix on the physiological data related feature matrix and a sixth attention-enhanced feature matrix comprising the influence weight of the picture feature matrix on the text feature matrix are obtained Characterizing the matrix; acquiring a text fusion feature matrix, a picture fusion feature matrix and a physiological data fusion feature matrix based on the first attention enhancement feature matrix, the second attention enhancement feature matrix, the third attention enhancement feature matrix, the fourth attention enhancement feature matrix, the fifth attention enhancement feature matrix and the sixth attention enhancement feature matrix and based on a feedforward full-connection neural network; based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix, obtaining a text, a picture and a physiological data feature value based on a feedforward full-connection neural network; based on the text, the picture and the physiological data characteristic value, acquiring importance weight values of the text, the picture and the physiological data based on vector splicing and attention mechanism; acquiring fusion expression matrixes of three modes based on the importance weight values of the text, the picture and the physiological data and the text fusion characteristic matrix, the picture fusion characteristic matrix and the physiological data fusion characteristic matrix; and acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and the feedforward full-connection network.

Based on the same inventive concept, yet another embodiment of the present invention provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements the above-mentioned attention weight correspondence method for feature interactive fusion of two modality data, and/or all steps of a multimodal fusion method for psychological stress detection, for example, the processor implements the following processes when executing the computer program:

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions may be essentially or partially implemented in the form of software products, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the multi-modal fusion method for mental stress detection according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multi-modal fusion method for psychological stress detection based on an attention weight correspondence method for feature interactive fusion of two modal data,

the attention weight corresponding method for carrying out feature interactive fusion on two modal data comprises the following steps:

based on the influence weight matrix and the feature matrices of the two modal data, acquiring an attention-enhanced feature matrix containing the mutual influence weights of the feature matrices of the two modal data by utilizing matrix dot multiplication and residual connection;

accordingly, the multi-modal fusion method for psychological stress detection comprises the following steps:

2. The multi-modal fusion method for mental stress detection according to claim 1, wherein the obtaining of the incidence relation matrix reflecting the information correlation degree between different features of the two modal data by using matrix multiplication based on the feature matrix of the two modal data specifically comprises:

wherein the content of the first and second substances,

3. The multi-modal fusion method for mental stress detection according to claim 2, wherein the obtaining of the influence weight matrix of the feature matrix of one modal data on the feature matrix of another modal data based on the incidence relation matrix and the feedforward full-connection network model specifically comprises:

softmax denotes the normalized exponential function, W₁Representing a first predetermined training parameter, W, of a first class of training parameters₂Representing a second preset training parameter in the first class of training parameters, and connecting the incidence relation matrix through a layer of fully-connected network

Is mapped back to

4. The multi-modal fusion method for mental stress detection according to claim 3, wherein the obtaining of the attention-enhanced feature matrix including the feature matrix mutual influence weight of the two modal data by using matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modal data specifically comprises:

The middle bag containsB information and the effect of B on A; using dot product operation to get B_A→BMultiplying by B and obtaining the attention-strengthening feature matrix after residual connection

The information of A and the influence of A on B are contained;

wherein the content of the first and second substances,

And attention-enhancing feature matrix

The processing process specifically comprises the following steps: processing the feature matrix A and the feature matrix B by using the first to fifth relational models to obtain an attention-strengthening feature matrix

And attention-enhancing feature matrix

The process of (1).

5. The multi-modal fusion method for detecting psychological stress according to claim 1, wherein the obtaining of the physiological data related feature matrix reflecting the physiological status of the user and the text feature matrix and the picture feature matrix reflecting the psychological activity status of the user respectively comprises:

Attn_T＝soft max(HW₃+b₁)

representing the text representation matrix, Attn, readjusted by the weight distribution_TA distribution weight vector representing the contribution degree of the text representation matrix H, and the text representation X is set to { X }₁,x₂,…,x_nEnters the LSTM layer of the long-short term memory network as input, and obtains two hidden layer outputs of the LSTM through the forward LSTM and the backward LSTM respectively

And

Through a layer of fully connected network, will

Mapping to k multiplied by 1 vector space to obtain text feature matrix F_T：

Wherein, W₃Representing a third pre-set training parameter, W, of the first class of training parameters₄Representing a fourth pre-set training parameter of the first class of training parameters, b₁Representing a first preset training parameter in a second class of training parameters, b₂Representing a second preset training parameter in the second class of training parameters, wherein the ReLU represents an activation function; softmax denotes normalized exponential function, text denotes X ═ X₁,x₂,…,x_n}

F_V＝ReLU(W₅C+b₃)

E＝ReLU(W₇(Re LU(W₆E_S+b₄)+b₅))

Attn_E＝soft max(W₈E+b₆)

representing a text representation matrix readjusted by a weight distribution

To E_sAnd (3) performing a physiological related data representation matrix E obtained by two layers of fully-connected networks: e ═ ReLU (W)₇(ReLU(W₆E_S+b₄)+b₅) Applying an attention mechanism to obtain a contribution degree distribution weight vector Attn of the physiological-related data representation matrix E_E：Attn_E＝soft max(W₈E+b₆) Attn (r) to_EMultiplying by E and connecting through residual errors to obtain a text representation matrix subjected to weight distribution readjustment

Through a layer of fully connected network, will

6. The multi-modal fusion method for psychological stress detection as set forth in claim 5, wherein the attention weight correspondence method is utilized to obtain a first attention-enhanced feature matrix comprising the mutual influence weight of the physiological data-related feature matrix on the text feature matrix, a second attention-enhanced feature matrix comprising the mutual influence weight of the physiological data-related feature matrix on the picture feature matrix, a third attention-enhanced feature matrix comprising the mutual influence weight of the text feature matrix on the physiological data-related feature matrix, a fourth attention-enhanced feature matrix comprising the mutual influence weight of the text feature matrix on the picture feature matrix, and a fifth attention-enhanced feature matrix comprising the mutual influence weight of the picture feature matrix on the physiological data-related feature matrix The matrix and a sixth attention-enhancing feature matrix including the mutual influence weight of the image feature matrix on the text feature matrix specifically include:

based on the physiological data related feature matrix and the text feature matrix, applying the attention weight corresponding method for carrying out feature interactive fusion on the two modal data to obtain a first attention-enhanced feature matrix containing the influence weight of the physiological data related feature matrix on the text feature matrix; the first attention-enhancing feature matrix is physiological data->Text attention-enhancing feature matrix

Based on the physiological data related feature matrix and the picture feature matrix, applying the attention weight corresponding method for carrying out feature interactive fusion on the two modal data to obtain a second attention reinforced feature matrix comprising the interaction weight of the physiological data related feature matrix on the picture feature matrix; the second attention-enhancing feature matrix is physiological data->Picture attention-enhancing feature matrix

Based on the physiological data related feature matrix and the text feature matrix, applying the attention weight corresponding method for carrying out feature interactive fusion on the two modal data to obtain a third attention-enhanced feature matrix comprising the influence weight of the text feature matrix on the physiological data related feature matrix; the third attention-enhancing feature matrix is text->Attention-enhancing feature matrix for physiological data

Based on the text feature matrix and the picture feature matrix, applying the attention weight corresponding method for carrying out feature interactive fusion on the two modal data to obtain a fourth attention reinforced feature matrix containing the mutual influence weight of the text feature matrix on the picture feature matrix; the fourth attention-enhancing feature matrix is text->Picture attention-enhancing feature matrix

Based on the physiological data related feature matrix and the picture feature matrix, applying the attention weight corresponding method for carrying out feature interactive fusion on the two modal data to obtain a fifth attention reinforced feature matrix comprising the interaction weight of the picture feature matrix on the physiological data related feature matrix; the fifth attention-enhancing feature matrix is a picture->Attention-enhancing feature matrix for physiological data

Based on the text feature matrix and the picture feature matrix, a sixth attention-enhanced feature matrix containing the mutual influence weight of the picture feature matrix on the text feature matrix is obtained by applying the attention weight corresponding method for carrying out feature interactive fusion on the two modal data; the sixth attention-enhancing feature matrix is a picture->Text attention-enhancing feature matrix

7. The multi-modal fusion method for mental stress detection according to claim 6, wherein the obtaining of the text fusion feature matrix, the picture fusion feature matrix and the physiological data fusion feature matrix based on the feedforward fully-connected neural network based on the first attention-strengthening feature matrix, the second attention-strengthening feature matrix, the third attention-strengthening feature matrix, the fourth attention-strengthening feature matrix, the fifth attention-strengthening feature matrix and the sixth attention-strengthening feature matrix specifically comprises:

Wherein, W₁₀～W₁₅Representing the tenth to fifteenth pre-set training parameters in the first class of training parameters, b₈～b₁₀Represent the eighth to tenth preset training parameters in the second class of training parameters.

8. The multi-modal fusion method for mental stress detection according to claim 7, wherein the obtaining of the text, picture and physiological data feature values based on the physiological data related feature matrix, the text feature matrix and the picture feature matrix and based on a feedforward fully-connected neural network specifically comprises:

S_T＝ReLU(W₁₆soft max(F_T)+b₁₁)

S_V＝ReLU(W₁₇soft max(F_V)+b₁₂)

S_E＝ReLU(W₁₈soft max(F_E)+b₁₃)

wherein the physiological data is correlated with a feature matrix F_EThe text feature matrix F_TAnd the picture feature moments F_VMapping the arrays between (0,1), and obtaining the characteristic values S of texts, pictures and physiological data through one layer of full connection_T，S_VAnd S_E(ii) a Wherein, W₁₆～W₁₈Represents the sixteenth to eighteenth preset training parameters in the first class of training parameters, b₁₁～b₁₃Representing the eleventh to thirteenth preset training parameters in the second class of training parameters.

9. The multi-modal fusion method for mental stress detection according to claim 8, wherein the obtaining the importance weight values of the text, the picture and the physiological data based on the text, the picture and the physiological data feature values and based on a vector splicing and attention mechanism specifically comprises:

(weight_T，weight_V，weight_E)＝softmax([S_T，S_V，S_E]W₁₉)

10. The multi-modal fusion method for mental stress detection according to claim 9, wherein the obtaining a fusion representation matrix of three modalities based on the importance weight values of the text, the picture and the physiological data and the text fusion feature matrix, the picture fusion feature matrix and the physiological data fusion feature matrix specifically comprises:

The picture fusion feature matrix

And the physiological data fusion feature matrix

11. A multi-modal fusion apparatus for mental stress detection based on an attention weight correspondence apparatus for performing feature interactive fusion on two modal data, wherein the attention weight correspondence apparatus for performing feature interactive fusion on two modal data comprises:

a third obtaining module, configured to obtain an attention-enhanced feature matrix including mutual influence weights of feature matrices of the two modality data by using matrix dot multiplication and residual connection based on the influence weight matrix and the feature matrices of the two modality data

Accordingly, the multi-modal fusion device for psychological stress detection comprises:

12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the multimodal fusion method for mental stress detection as claimed in any one of claims 1 to 10.