CN116030271A

CN116030271A - Depression emotion prediction system based on deep learning and bimodal data

Info

Publication number: CN116030271A
Application number: CN202310146956.1A
Authority: CN
Inventors: 蔡莉; 沈先发; 杨文洁; 刘俊晖
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-04-28

Abstract

The invention discloses a depression emotion prediction system based on deep learning and bimodal data, which relates to the field of data processing, and comprises the following components: the to-be-predicted bimodal data acquisition module is used for determining a plurality of image-text pairs according to user bimodal data acquired from social media, wherein each image-text pair comprises a text and a picture; the depression emotion prediction model is connected with the bimodal data acquisition module to be predicted and is used for: extracting features of a plurality of image-text pairs to obtain text features, picture features and user global information; fusing the characteristic information of each image-text pair of the user by adopting a cross-modal attention mechanism; adopting a self-adaptive graph rolling network to perform feature fusion on the feature information of a plurality of graphics and texts to obtain fusion features; fusing the user global information and the fused features by adopting a multi-head self-attention mechanism to obtain a feature representation; the feature representation is input into the classifier to obtain the depression probability of the user, and the invention improves the accuracy of depression emotion prediction.

Description

Depression emotion prediction system based on deep learning and bimodal data

Technical Field

The invention relates to the technical field of data processing, in particular to a depression emotion prediction system based on deep learning and bimodal data.

Background

Depression is one of the most common mental disorders. Accurate diagnosis of a depressive patient is a prerequisite for treatment, but the depressive patient must actively contact a mental health professional to be able to be diagnosed. In clinical diagnosis, psychologists and doctors typically refer to standard diagnostic guidelines for disease and conduct face-to-face interviews. While this is the most effective method of diagnosing depression, more than 70% of early-stage depression patients are untreated and the condition worsens because most people lack medical knowledge and are not aware of the risk of disease.

The related studies show that: the use of social media websites has a correlation with the mental illness of the user, which to some extent gives the patient an opportunity for early depressed mood prediction. While actual social activity of a depressed patient may be reduced, a person in a frustrated state may have a closely related network where contacts may have similar emotional or behavioral patterns, such as: negative emotion enhancement, emphasis on self-care enhancement, increase of interpersonal problems, increase of religious ideas, and the like. Thus, the topics related to depression on social media and the text and pictures released by users with a tendency to depression provide effective materials for early research of depression. At present, the existing method does not fully consider the corresponding relation between the image and the text, and the accuracy of depression emotion prediction is required to be improved.

Disclosure of Invention

The invention aims to provide a depression emotion prediction system based on deep learning and bimodal data, which improves the accuracy of depression emotion prediction.

In order to achieve the above object, the present invention provides the following solutions:

a depression mood prediction system based on deep learning and bimodal data, comprising:

the to-be-predicted bimodal data acquisition module is used for determining a plurality of image-text pairs according to user bimodal data acquired from social media, wherein each image-text pair comprises a text and a picture;

the depression emotion prediction model is connected with the bimodal data acquisition module to be predicted and is used for:

extracting features of a plurality of image-text pairs to obtain text features, picture features and user global information;

fusing the characteristic information of each image-text pair of the user by adopting a cross-modal attention mechanism;

adopting a self-adaptive graph rolling network to perform feature fusion on the feature information of a plurality of graphics and texts to obtain fusion features;

fusing the user global information and the fused features by adopting a multi-head self-attention mechanism to obtain a feature representation;

and inputting the feature representation into a classifier to obtain the depression probability of the user.

Optionally, the depression emotion prediction model includes a feature extraction module, where the feature extraction module is configured to perform feature extraction on each image-text pair to obtain text features, picture features and global user information;

the feature extraction module comprises a BERT encoder, and the BERT encoder is used for extracting features of text data to obtain text features.

Optionally, the feature extraction module further comprises a residual network, a fast R-CNN and a splicing module;

the residual error network is used for extracting the characteristics of the picture to obtain the deep characteristics of the picture;

the fast R-CNN is used for carrying out target detection on the picture to obtain picture region representation;

the splicing module is used for carrying out linear transformation after splicing the deep features of the picture and the picture region representation to obtain the picture features.

Optionally, the user global information includes behavioral characteristics, language characteristics and text summaries;

the behavior characteristics comprise the number of fan-shaped users, the number of attention of users, the average number of points and praise of posts, the average number of comments of posts, the number of posts, the standard deviation of posting time and the number of pictures;

the language features include the number of occurrences of depression-related words, the readability of each text, and the emotion of each text.

Optionally, the feature extraction module is further configured to calculate the readability of the text using a chinese text readability formula or a Dale-Chall readability formula.

Optionally, the feature extraction module is further used for calculating emotion degrees of the text by adopting a SnowNLP chinese text processing library or a TextBlob library.

Optionally, the feature extraction module is further configured to extract a text abstract from the text using a TextRank algorithm.

Optionally, the classifier is used to calculate the probability of depression using a Softmax function.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the invention, the text and the picture information of the user are fused and associated through the depression emotion prediction model, the corresponding relation between the image and the text is considered, effective information is provided for emotion prediction, and the global information of the user is adopted, so that the accuracy of depression emotion prediction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a depression emotion prediction system based on deep learning and bimodal data according to an embodiment of the present invention;

fig. 2 is a schematic workflow diagram of a depression emotion prediction system based on deep learning and bimodal data according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a model for predicting depressed emotion provided by an embodiment of the present invention;

fig. 4 is a schematic diagram of an ablation experiment result on a chinese social media data set according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an ablation experiment result on an english social media dataset according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, the present embodiment provides a depression emotion prediction system based on deep learning and bimodal data, including:

the to-be-predicted bimodal data acquisition module is used for determining a plurality of image-text pairs according to user bimodal data acquired from social media, and each image-text pair comprises a text and a picture.

The bimodal data includes text data and picture data.

The to-be-predicted bimodal data acquisition module specifically comprises: and carrying out data preprocessing on user bimodal data acquired from social media, wherein the data preprocessing comprises the steps of synthesizing a plurality of pictures into a picture by adopting a mode of carrying out superposition and averaging on the pictures if the pictures correspond to a text, so as to form a picture-text pair.

and extracting the characteristics of the plurality of image-text pairs to obtain text characteristics, picture characteristics and user global information.

And fusing each image-text pair characteristic information of the user by adopting a Cross-modal attention mechanism (Cross-modal attention).

And adopting a self-adaptive graph rolling network (Graph Convolutional Network, GCN) to perform feature fusion on the feature information of the plurality of images and texts to obtain fusion features.

And fusing the user global information and the fused features by adopting a multi-head self-attention mechanism to obtain the feature representation.

The depression emotion prediction model is based on a Cross-modal graph convolution fusion network model (Cross-modal graph convolution fusionnetworkmodel, CGFNT).

The depression emotion prediction model comprises a feature extraction module, wherein the feature extraction module is used for carrying out feature extraction on each image-text pair to obtain text features, picture features and user global information.

The feature extraction module comprises a BERT (Bidirectional Encoder Representation from Transformers) encoder, and the BERT encoder is used for extracting features of text data to obtain text features.

The feature extraction module further comprises a Residual Network (ResNet), a fast R-CNN and a splicing module.

And the residual error network is used for extracting the characteristics of the picture to obtain the deep characteristics of the picture.

The fast R-CNN is used for carrying out target detection on the picture to obtain picture region representation, and according to the body, the fast R-CNN is adopted to identify the object type in the picture and obtain the region representation corresponding to the object type, so that the picture region representation is obtained.

The user global information includes behavioral characteristics, language characteristics and text summaries.

The behavior characteristics comprise the number of fan-out of the user, the number of attention of the user, the average praise number of posts, the average comment number of posts, the post posting number, the standard deviation of posting time and the picture posting number.

The feature extraction module is also used for calculating the readability of the text by adopting a Chinese text readability formula or a Dale-Chall readability formula.

The feature extraction module is also used for calculating the emotion degree of the text by adopting a SnowNLP Chinese text processing library or a textBlob library. The emotion degree is emotion score.

The feature extraction module is also used for extracting a text abstract from the text by adopting a TextRank algorithm.

The classifier is used to calculate the probability of depression using a Softmax function.

The structure of the predicted model of depressed emotion is shown in fig. 3, where feed forward represents feed forward. Add indicates addition and Norm indicates normalization.

As shown in fig. 2, a workflow of a depression emotion prediction system based on deep learning and bimodal data of the present invention comprises the following steps.

Step 1: and acquiring information of the social media user, wherein the information of the user comprises texts and pictures.

Social media includes microblogs and twitter.

The screening principle for the common users is to exclude marketing account numbers, public character account numbers, studio account numbers, news media account numbers and the like on the microblog platform, and select microblog contents to be biased to daily user groups as much as possible. After the screening of the user is completed, the invention utilizes the crawler technology to acquire the text and picture data of the user from the social media.

Step 2: feature extraction of text and picture information, given text T, text feature H is extracted using BERT encoder _T . Given picture V, the output vector H of the last convolutional layer of the pretrained 152 ResNet is first _V1 As a deep feature of the picture, the FasterR-CNN object detection algorithm is then used to detect object classes and obtain their regional representations. Let H _V2 =fast R-CNN (V) represents the region representation of the picture. Subsequently H is taken up _V1 ⊕H _V2 Splicing to obtain H _V3 Then H is obtained by linear transformation _V As final picture features:

H _V ＝W ^T H _V3 (1)

wherein W is ^T Is a weight matrix. Next, by analyzing the microblog and Twitter datasets, global features of the user are proposed, including behavioral features of the user, linguistic features of the user, and user text summaries, as shown in table 1:

TABLE 1 user global features

/>

The readability scores of the microblog user and Twitter user text are calculated using a chinese text readability formula and a Dale-Chall readability formula, respectively, the formulas being shown below.

Y _{In (a)} ＝-11.946+0.123X ₁ +0.198X ₂ +0.811X ₃ (2)；

Wherein Y is _{In (a)} Formula for Chinese text readabilityCalculated readability score, Y _{English (English)} The readability score, X, calculated for the Dale-Chall readability formula ₁ For average sentence length, X ₂ X is the number of difficult words ₃ Words is the number of words, sentences is the number of sentences, and difficultwords is the number of difficult words.

Calculating emotion score of the text of the microblog user by using a SnowNLP Chinese text processing library, wherein the probability that a emotion function of the SnowNLP returns one emotion to each input text is between [0,1], and the closer to 1, the more positive the emotion expression is; the closer to 0, the more negative the emotional manifestation. For Twitter users, the textBlob library is used for corresponding emotion score value calculation. Finally, a TextRank algorithm is used to extract the user's text excerpt. The TextRank algorithm is a graph-based ranking algorithm for keyword extraction and document summarization, and can extract a keyword sentence of a given text, and the calculation formula is as follows:

wherein WS (V) _i ) Representing the weight of sentence i, the summation on the right represents the degree of contribution of each adjacent sentence to the sentence, W _ji Representing the similarity of sentences i and j, W _jk Representing the similarity of sentences k and j, WS (V _j ) The weight, V, representing the sentence j iterated last time _i Is to consider sentence i as a node, V _i Representing nodes i, V _j Represents nodes j, in (V _i ) Representing all pointing nodes V _i Is set by the point set, out (V _j ) Representing node V _j The set of points pointed to, d, is the damping coefficient, typically 0.85. According to formula (4), iteratively propagating weights to calculate the scores of all sentences, and extracting the set number of sentences with the highest importance as candidate abstract sentences.

Step 3: cross-modal attention mechanism feature fusion, the embodiment provides a multi-modal interaction module for learning image-aware word representations and word-aware image representations.

Image perception word representation: the book is provided withThe invention applies an m-head Cross-ModalAttention layer which represents an image H _V As query vector Q, and express context H _T Viewed as a queried vector K and a content vector V, as follows:

MH_CA(H _V ,H _T )＝W’[CA ₁ (H _V ,H _T ),...,CA _m (H _V ,H _T )] ^T (6)；

wherein CA _i Ith head, CA, representing cross-modal attention _m Mth head representing cross-modal attention, { W _qi ,W _ki ,W _vi }∈R ^d/m*d The learnable parameters corresponding to Q, K, V, respectively, W' represents the weight matrix of Q, K, V and multi-head attention, respectively, and CA represents CA _i D is the dimension of the word vector, d=768, m represents the number of heads in the multi-head attention mechanism, mh_ca (H _V ,H _T ) Representing the connection of m heads across modal attention. Then, the final image perception word expression is obtained through two normalization layers and a feedforward neural network, and the formula is as follows:

H _VT '＝LN(H _T +MH_CA(H _V ,H _T )) (7)；

H _VT ＝LN(H _VT '+FFN(H _VT ')) (8)

wherein H is _T For the original context representation, FFN is the feed-forward neural network and LN is the normalization layer. First, H _T Obtaining MH_CA (H) through Cross-ModalAttention layer _V ,H _T ) Then the result is combined with H _T Adding and passing through a normalization layer to obtain H _VT '。H _VT The linear change is carried out through a feedforward neural network, the linear change is added with the feedforward neural network, and then the final image perception word representation H is obtained through another normalization layer _VT 。

Word-aware image representation: continuing to use the m-head Cross-ModalAttention layer to carry out H _T Regarded as Q, H _V Considered as K and VAnd a feedforward neural network layer and a normalization layer are added, the corresponding formulas are consistent with the image perception word expression formulas, and the difference is Q, K, V of the formulas is that the formulas are different:

MH_CA(H _T ,H _V )＝W’[CA ₁ (H _T ,H _V ),...,CA _m (H _T ,H _V )] ^T (10)；

H _TV '＝LN(H _V +MH_CA(H _T ,H _V )) (11)；

H _TV ＝LN(H _TV '+FFN(H _TV ') (12)；

H＝H _VT ⊕H _TV (13)；

wherein H is _TV And H _VT Similarly, the description is omitted here. To integrate image-aware word representations and word-aware image representations, H will be _VT And H _TV And connecting to obtain the final hidden representation H of the single image-text of the user.

Step 4: if the image-text pair H of the user is regarded as a node, the connected edges represent the similarity between the image-text pairs, namely the weight, so as to construct an adjacent matrix of the image-text pairs. To better capture hidden relationships between user pairs of images, an adaptive adjacency matrix graph packing network is used to automatically learn the relationships between user pairs of images, adaptive adjacency matrix A _adp The following is shown:

A _adp ＝Softmax(ReLU(E ₁ ,E ₂ ^T )) (14)；

wherein parameter E ₁ For source node embedding, parameter E ₂ Embedded for the target node. By E ₁ And E is ₂ Multiplying to obtain the space dependence weight between the source node and the target node, E ₁ And E is ₂ And for different graphic pairs H, eliminating weak connection by using a ReLU activation function, and finally normalizing the adaptive adjacency matrix by using a softMax function. NormalizationThe adaptive adjacency matrix can be regarded as a transfer matrix for the implicit diffusion process.

By combining predefined spatial dependencies with self-learned hidden graph dependencies, the definition of the graph convolutional layer is as follows:

wherein P is _f =a/rowsum (a) represents forward transfer matrix, P _b ＝A ^T /rowsum(A ^T ) Represents a backward transition matrix, k power series, X represents an input value, Z represents an output value, W _k1 ,W _k2 ,W _k3 A represents an adjacency matrix as a matrix parameter of the model. Assuming that each user publishes n graph-text pairs, obtaining a set H' = [ H ] of hidden representations of the graph-text pairs of the user through a Cross-ModalAttention feature fusion module ₁ ；H ₂ ；...H _n ]And finally, obtaining the fusion characteristic Z of the multi-text and multi-picture of the final user through a GCN fusion module. K represents a constant.

Step 5: for the user text abstract, the invention firstly uses a TextRank algorithm to extract a plurality of key sentences of the user, then uses a BERT model to generate sentence vector representation for each key sentence, and then generates final user text abstract representation by a weighted summation method; for the extracted user behavior and language features, firstly, normalization processing is carried out, and the purpose is to hope that features with different dimensions have similar value ranges, so that an optimal solution can be found out through gradient descent more quickly, and then the optimal solution is connected with a text abstract to form global information G of a user. And finally, fusing the user global information G and the fusion characteristic Z obtained by the self-adaptive GCN characteristic fusion module by using a multi-head self-attention mechanism to obtain the final user characteristic representation U.

Step 6: the result predicts that the user characteristic representation U is input into the classifier, the probability of depression and normal is calculated by using the Softmax function, and the related formulas are as follows:

P＝Softmax(W _p U+b _p ) (16)；

wherein W is _p And b _p Is a weight parameter and a bias parameter, softmax maps the output of multiple neurons to the value of the (0, 1) interval, and normalizes that the sum of all elements adds up to equal 1. Since the task of predicting depressed emotion is a classification problem, only the probabilities of depression and normal are output.

Step 7: comparing the final model with a plurality of reference models under the Twitter dataset and the microblog dataset respectively; the effectiveness of the different modules of CGFNet was verified, and the ablation experiment is shown in fig. 4.

The invention provides an object detection algorithm FasterR-CNN to identify the object category in the picture, and the object category corresponds to the object described in the text, so that more semantic information is provided for the subsequent fusion of the image-text pairs.

The invention utilizes the cross-modal attention mechanism to learn the image perception word representation and word perception image representation, and effectively fuses the bimodal information of texts and pictures; and learn the relation between all the graph-text pairs by using the graph-convolution network.

According to the invention, the text abstract representation and the user behavior characteristics of the user are provided, redundant information irrelevant to depression emotion in depression corpus is eliminated, global information of the user is provided for the model, and experimental results show that CGFNT can well utilize the global information and multi-modal information of the user, so that the accuracy of predicting depression emotion by the model is improved.

To verify the effectiveness of CGFNet, two data sets of Twitter and microblog were used, a variety of reference methods were compared, and precision (P), recall (R), harmonic mean of precision and recall (F1), and accuracy (Acc) were used as evaluation indexes for depression emotion prediction, and the results are shown in tables 2 and 3:

table 2 experimental results of methods on microblog datasets

Table 3 experimental results of each method on the Twitter dataset

/>

Three baseline methods, when using text information only, can find: the performance of the BERT+LSTM based on the BERT is improved to a certain extent compared with that of the traditional SVM and the traditional random forest method on the two data sets, which shows that the text characteristic information of the User can be more effectively mined by using the BERT, and three baseline methods comprise the SVM+user_ Behaviour, randomForest +user_Behaviour and the BERT+LSTM, and only the text single-mode information is used by the three baseline methods. If depression prediction is performed using only picture information, it can be found that the ResNet152 model works better than the VGG16 model, and therefore, the present embodiment uses the RenNet152 to extract deep features of the picture. In addition, the effect of the VGG16 model and the ResNet152 model is improved after the user behavior characteristics are added, which shows that the user behavior characteristics are helpful to the performance improvement of the depression emotion prediction task; meanwhile, the single-text experimental effect is obviously better than that of single pictures, because the number of texts in the data set is obviously larger than that of the pictures, the texts contain more useful information, and the depression classification can be effectively performed in comparison with the single-picture mode.

When the contrast method uses both text and picture information, it can be found that: on two data sets, the performance of methods such as EF-LSTM, MTAL and the like is obviously better than that of a single-mode experiment effect, which shows that the pictures also contain important information, and simultaneously, the performance of depression emotion prediction can be effectively improved by using multi-mode data. CGFNet showed the best performance over both data sets compared to the baseline method. On a microblog data set, the P value, the R value, the F1-Score and the accuracy are respectively improved by 2.7%, 3.1%, 2.2% and 2.1% compared with a DDSM method, and are respectively improved by 1.9%, 3.1%, 2.6% and 2.5% compared with a MulHierDR method; on the Twitter dataset, the method has a certain improvement on each index compared with the DDSM method and the MulHierDR method. The cross-modal diagram convolution fusion network model based on the invention can fully mine text and picture information of users, remove redundant text information, and well fuse multi-modal characteristics to improve the performance of depression emotion prediction by using a cross-modal attention mechanism and a diagram convolution neural network.

To further explore the effectiveness of the different modules of CGFNet, 5 variant models were designed and ablation experiments were performed on the microblog data set and the Twitter data set, respectively. The description of these 5 models is shown in table 4:

table 4 description of various models in ablation experiments

/>

The results of the ablation experiments are shown in fig. 4 and 5, and after the GCN is added to the two data sets, various indexes of the model are obviously improved, which means that the GCN feature fusion module can effectively fuse multiple text and multiple picture features of a user. Meanwhile, after the user text abstract and the user behavior characteristics are added, the performance of the model is improved to a certain extent, which indicates that the user text abstract and the user behavior characteristics can provide additional useful information for the depression emotion prediction task. Finally, on the basis of base+GCN, after the user text abstract and the user behavior characteristics are added, the P, R, F value and the Acc value of the model are both optimal, which shows that the model performance can be obviously improved by combining the user image-text pair information and the user global information, and the depression emotion can be predicted better.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A depression mood prediction system based on deep learning and bimodal data, comprising:

2. The depressed emotion prediction system based on deep learning and bimodal data according to claim 1, wherein the depressed emotion prediction model comprises a feature extraction module for feature extraction of each image-text pair to obtain text features, picture features and user global information;

3. The deep learning and bimodal data based depression mood prediction system as claimed in claim 2 wherein the feature extraction module further comprises a residual network, a FasterR-CNN and a stitching module;

the FasterR-CNN is used for carrying out target detection on the picture to obtain picture region representation;

4. The deep learning and bimodal data based depression mood prediction system as claimed in claim 1 wherein the user global information comprises behavioral characteristics, linguistic characteristics and text summaries;

5. The deep learning and bimodal data based depression emotion prediction system of claim 4 wherein the feature extraction module is further configured to calculate text readability using a chinese text readability formula or Dale-Chall readability formula.

6. The deep learning and bimodal data based depressed emotion prediction system of claim 4 wherein said feature extraction module is further configured to calculate emotion degrees of text using a SnowNLP chinese text processing library or a TextBlob library.

7. The deep learning and bimodal data based depression emotion prediction system of claim 4 wherein the feature extraction module is further configured to extract a text summary from the text using TextRank algorithm.

8. The deep learning and bimodal data based depression mood prediction system as claimed in claim 1 wherein the classifier is used to calculate the probability of depression using a Softmax function.