CN116258147A

CN116258147A - Multimode comment emotion analysis method and system based on heterogram convolution

Info

Publication number: CN116258147A
Application number: CN202310083964.6A
Authority: CN
Inventors: 陈羽中; 万宇杰
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-06-13

Abstract

The invention relates to a multi-mode comment emotion analysis method based on heterography convolution, which comprises the following steps: step A: collecting user comments and related images, extracting aspect words of products or services related in the user comments, and labeling emotion polarities of the user comments aiming at specific aspects of the products or services so as to construct a training setDBThe method comprises the steps of carrying out a first treatment on the surface of the And (B) step (B): using training setsDBDeep learning network model based on knowledge graph and heterogeneous graph convolution network is trainedDLMThe method comprises the steps of analyzing emotion polarities of user comments and related images on specific aspects of products or services; step C: inputting user comments and related images and related aspect words of the product or service in question to a trainingAnd obtaining emotion polarities of user comments and related images aiming at specific aspects in the product or service in the trained deep learning network model. The method and the system are beneficial to improving the accuracy of emotion classification.

Description

Multimode comment emotion analysis method and system based on heterogram convolution

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a multi-mode comment emotion analysis method and system based on heterographic convolution.

Background

Emotion analysis (Sentiment Analysis, SA), also known as opinion mining, is a fundamental task in the field of Natural Language Processing (NLP). The basic goal of this task is to mine the emotion information in the given text and analyze its emotion polarity. However, in the face of the ever-increasing comment text data, collection and analysis tasks of massive information on a network cannot be completed by a manual method, so that scientific research institutions pay attention to emotion analysis technology. The task is classified into document-level emotion analysis, sentence-level emotion analysis and aspect-level emotion analysis according to different classification granularities. Early emotion analysis studies focused mainly on document-level (document-level) and Sentence-level (segmentation-level), i.e., assuming that emotion is expressed for only one entity (entity) in a document or Sentence. Although document-level and sentence-level tasks are widely studied in the emotion analysis field, due to the limitations of the two tasks themselves, conventional document-level or sentence-level emotion analysis models can only analyze the entire document or sentence to identify its emotion polarity. The requirements cannot be met in practical applications. With the rapid development of the internet, a large number of comment texts are presented by the online social media platform and the online shopping platform, and when one comment sentence has multiple aspects and emotion polarities of each aspect are different, the document-level or sentence-level emotion analysis models obviously cannot correctly interpret emotion information in the comment sentences. Thus, fine-grained emotion analysis methods for specific entity aspects have become a major research problem for current emotion analysis tasks. The Aspect-level emotion analysis (Aspect-level Sentiment Analysis) aims at judging emotion polarity corresponding to a specific Aspect of a target in comment text, and the task relates to natural language processing core problems such as vocabulary semantics, reference resolution, viewpoint extraction and the like, so that the method has strong theoretical research significance and application value.

In addition, in the network age of rapid development, people tend to express their own views and moods in the form of graphic combinations or videos. The multi-modal language data takes on the richer and more attractive expression form, so that the multi-modal language data takes on the overwhelming advantages of various large social media websites, and meanwhile, sufficient data resources are provided for researching multi-modal language calculation. In recent years, multi-modal emotion analysis has become a key task in the emotion analysis field, and emotion analysis in multi-modal contexts brings a machine close to more realistic human emotion processing. For aspect-level emotion analysis tasks, image information is generally indicative as text information. In one aspect, in multimodal data, both text and images are highly relevant to aspect moods. Furthermore, different aspects may be associated with different portions of each modality data. In other words, the customer may write different text or attach different images for different aspects. Text and image information, on the other hand, may be mutually facilitated, complemented, enhancing analysis of emotion of a particular aspect. In summary, there are various correlations in multimodal data for aspect-level emotion analysis.

In recent years, with the rise of deep learning technology, the technology is widely applied by aspect-level emotion analysis tasks. Among the most common are Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Among them, CNNs perform well in capturing semantic information from text, while RNNs, especially Long and Short Term Memories (LSTM) and gate-controlled loops (GRU), can better extract emotional features of specific aspects for emotion classification based on context information. However, these neural network models ignore syntactic dependencies between given aspects and their context words, which represent dependencies between words in the comment sentence, and are particularly important for correctly judging the emotion polarity of the aspects. More recently, scholars have used Graph Neural Networks (GNNs) and variants thereof to consider syntactic information when learning aspect-level representations of emotion classification. Zhang et al combine a graph convolution network (Graph Convolutional Network, GCN) with an attention mechanism to obtain semantic correlation between context information and aspects. Hang et al encode sentences using Bi-LSTM and then extract the dependencies between context words using a graph attention network (Graph Attention Network, GAT). However, most of the prior aspect emotion analysis models ignore implicit emotion information in text data, and some data with the implicit emotion information exist in the prior data set, if the data are not processed pertinently, the models may not be recognized correctly when conditions such as a irony or an implicit expression exist, and further improvement of the performance of the models is hindered. In addition, because the model with smaller data set is difficult to learn such patterns of implicit emotion expression, most of the efforts to solve the implicit emotion problem mainly assist the model in identifying the implicit emotion information in the text by introducing external knowledge. However, most of their knowledge selection algorithms are based on rules or combiner attention mechanisms, not taking context semantics into account comprehensively.

With the popularity of multi-modal user-generated content (e.g., text, images, speech, or video), emotion analysis has outperformed traditional text-based analysis. Multimodal emotion analysis is an emerging field of research that integrates textual and non-textual information into user emotion analysis. Text-image pairs are the most common form of multimodal data. With the development of deep learning technology, some neural network-based models are proposed for multi-modal emotion analysis and have made significant progress. Yu et al train a logistic regression model by pre-training text CNN and image CNN to extract feature representations from the text and image, respectively, and combining these multi-modal features. In order to fully capture visual semantic information, xu et al extract scene and object features from images and utilize the visual semantic features to absorb text words to simulate the influence of the images on the text, which also proves that the text and the images can mutually promote and mutually supplement in emotion analysis tasks, and the performance of the model is improved. Thus, xu and Chen et al [38] propose a common memory attention mechanism that interactively simulates interactions between text and images. Their models take into account the impact of one modality on another modality (i.e., text-to-image and image-to-text) and achieve better performance than other related methods. However, at the intersection of aspect level and multi-modal emotion analysis, the work on the present is relatively small, the above model structure is relatively simple, and there is a problem that the relationship between the image state and the text state is not further fully utilized.

Disclosure of Invention

The invention aims to provide a multi-mode comment emotion analysis method and system based on heterogram convolution, which are beneficial to improving the accuracy of emotion classification.

In order to achieve the above purpose, the invention adopts the following technical scheme: a multi-mode comment emotion analysis method based on heterogram convolution comprises the following steps:

step A: collecting user comments and related images, extracting aspect words of products or services related in the user comments, and labeling emotion polarities of the user comments aiming at specific aspects of the products or services so as to construct a training set DB:

and (B) step (B): training a deep learning network model DLM based on a knowledge graph and a heterogeneous graph convolution network by using a training set DB, wherein the training set DB is used for analyzing emotion polarities of user comments and related images on specific aspects of products or services;

step C: inputting the user comments and related images and related aspect words of the product or service into a trained deep learning network model to obtain the emotion polarity of the user comments and related images aiming at the specific aspect in the product or service.

Further, the step B specifically includes the following steps:

Step B1: coding each training sample in the training set DB to obtain a comment semantic representation vector X ^r Semantic representation vector X of aspects ^a Syntax dependency adjacency matrix A ^r Image region characterization vector X ^im ；

Step B2: selecting a Set of knowledge triples from a knowledge graph that are relevant to a comment context according to a dynamic knowledge selection mechanism ^skt Then coding the knowledge words to obtain a Set ^skt Knowledge word characterization vector X ^kg ；

Step B3: representing vector X by semantics of aspects ^a And image region characterization vector X ^im Obtaining aspect-dependent image region characterization vector X using interactive attention mechanisms ^air The method comprises the steps of carrying out a first treatment on the surface of the Through the imageThe tag selection mechanism obtains a Set of tag t tuples related to comment context ^sit ；

Step B4: for X ^a Averaging pooling to obtain aspect average characterization vector

For X ^r Performing position coding to obtain a comment characterization vector X with enhanced positions ^pw By connection X ^pw and />

Obtaining a characterization vector C ^sd，0 The method comprises the steps of carrying out a first treatment on the surface of the Generating a text-knowledge-image heterogeneous graph TKIHG according to a text-knowledge-image heterogeneous composition strategy to obtain an adjacency matrix A of the text-knowledge-image heterogeneous graph TKIHG ^hg Then the nodes are encoded by using a cross-modal attention mechanism to obtain a node characterization vector C of the heterogeneous graph TKIHG ^hg，0 ；

Step B5: will characterize vector C ^sd，0 And node characterization vector C of heterogeneous graph TKIHG ^hg，0 Respectively inputting the text graph convolution vector C into two different L-layer graph convolution networks, respectively recording the text graph convolution network RGCN and the text-knowledge-image heterograph convolution network HGCN as comments, respectively learning and extracting syntactic dependency relationship and heterogeneous information of context semantics, image labels and external knowledge to obtain a text graph convolution characterization vector C ^sd，L And a isomerous map volume characterization vector C ^hg，L ；

Step B6: convolving the text map with a token vector C ^sd，L Performing aspect masking operation to obtain a text graph convolution mask characterization vector C of comments ^mask，L Semantic representation vector X of remarks ^r Using the interactive attention mechanism, further enhancing the context representation with aspect information of the aggregate syntax information, obtaining an aspect enhanced token vector X of comment r ^ea ；

Step B7: wrapping the isomerism map by a token vector C ^hg，L Enhancing the token vector X with respect to comment r, respectively ^ea Aspect-dependent image region characterization vector X ^air Enhancement of heterogeneous information such as image tags and external knowledge using a cross-modal attention mechanismLearning emotion characteristics of a text mode and an image mode to obtain a heterogeneous enhanced text characterization vector X ^hm And image characterization vector X ^hair Finally connect X ^hm and X^hair Obtaining a final characterization vector X ^fin ；

Step B8: will ultimately characterize vector X ^fin Inputting a final prediction layer, calculating the gradient of each parameter in the deep learning network model by using a back propagation method according to a target loss function loss, and updating each parameter by using a random gradient descent method;

step B9: and when the iteration change of the loss value generated by the deep learning network model is smaller than a given threshold value or the maximum iteration number is reached, terminating the training process of the deep learning network model.

Further, the step B1 specifically includes the following steps:

step B11: traversing a training set DB, performing word segmentation processing on user comments and aspects in the training set DB, removing stop words, and adjusting an image related to the comments into an image with 3 multiplied by 224 pixels, wherein each training sample in the DB is expressed as d= (r, im, a, p);

wherein r is a user comment, im is an adjusted image related to the user comment, a is an aspect word or phrase of a product or service related to the user comment extracted from the user comment r, and p epsilon (positive, negative and neutral) is the emotion polarity of the comment on the aspect;

comment r is expressed as:

wherein ,

i=1, 2, …, n, n being the number of words in comment r;

Aspect a is represented as:

wherein ,

for the i-th word in aspect a, i=1, 2, …, m, m is the number of words of aspect a;

step B12: comment on step B11

Coding by using a pre-training model BERT and reducing the dimension by using a full-connection layer to obtain a semantic representation vector X of the comment r ^r Expressed as:

wherein ,

for comment the i-th word->

The corresponding semantic representation vector, d, represents the output dimension;

step B13: aspect obtained for step B11

Encoding by using a pre-training model BERT and reducing dimension by using a full-connection layer to obtain a semantic representation vector X of the aspect a ^a Expressed as:

wherein ,

representing aspect the i-th word->

The corresponding word vector, d, represents the output dimension;

step B14: the adjusted image is encoded and smoothed by using a pre-training model ResNet-152, and thenObtaining the representation vector of the image area after the dimension reduction of the full connection layer

Wherein z represents the feature map size of the output and d represents the output dimension;

step B15: carrying out syntactic dependency analysis on the comment r to obtain a syntactic dependency tree SDT;

wherein ,

representing words ∈in comments>

He word->

A syntactic dependency exists between the two;

step B16: encoding the syntax-dependent tree SDT obtained in the step B15 into an n-order adjacency matrix A ^r ，A ^r Expressed as:

wherein ,

a1 indicates the word ++in the comment>

He word->

There is a syntactic dependency between->

A0 indicates the word->

He word->

There is no syntactic dependency between them.

Further, the step B2 specifically includes the following steps:

step B21: selecting a context word with edge connection with the aspect word from the syntax dependency tree SDT obtained in the step B15, and combining the context word with the aspect a to form a seed node Set with the length of m' ^sn Expressed as:

wherein ,

is an aspect word or a context word with edge connection, m is less than or equal to m' is less than or equal to n;

step B22: for each node in the seed node set, selecting its emotion polarity word and 5 words similar to its semantic meaning from the knowledge graph to respectively form 5 candidate knowledge triples CKT of the seed node, which are expressed as:

CKT＝<w ^sn ，w ^sp ，w ^ss >

wherein ,w^sn Is a seed node word, w ^sp Is emotion polar word, w ^ss Is a semantically similar word;

step B23: constructing each candidate knowledge triplet CKT into one candidate knowledge sentence r ^cks Inputting a pre-training model BERT and taking an average value to obtain an average semantic representation vector X ^cks ；r ^cks Expressed as:

r ^cks = "word' w ^sn 'emotional polarity is' w ^sp 'the word most similar to its semantics is' w ^ss ”’

Step B24: semantic representation vector X of comments ^r Average semantic representation vector X is obtained after averaging ^rm Then calculate X ^rm and X^cks Cosine similarity between the two to obtain comment text r and candidate knowledge sentences r ^cks The similarity score of (2) is calculated as follows:

Similarity_Score(r，r ^cks )＝CosineSimilarity(X ^rm ，X ^cks )

wherein ,X^rm ，

Step B25: b24, calculating similarity scores of candidate knowledge sentences constructed by all candidate knowledge triples CKT, and selecting the original candidate knowledge triples of the top k candidate knowledge sentences with the highest scores to form a knowledge triplet set of the seed node as the external knowledge of the seed node most relevant to the context;

step B26: for seed node Set ^sn Repeating the steps to obtain a Set containing knowledge triples of all seed nodes ^skt Expressed as:

wherein ,

is the i seed node->

Is a knowledge triplet set;

step B27: set of knowledge triples by knowledge graph embedding ^skt All emotion polarity words and semantic similar words of the tree are encoded to obtain node characterization vectors of the tree as follows

wherein ,/>

The knowledge representation vector corresponding to the ith emotion polarity word or semantic similarity word is obtained by training a knowledge word vector matrix in advance

Where d represents the dimension of the knowledge word vector and V is the number of words in which the knowledge word is embedded.

Further, the step B3 specifically includes the following steps:

step B31: semantic representation vector X for a pair of aspects ^a And image region characterization vector X ^im Obtaining aspect-dependent image region characterization vector X using interactive attention mechanisms ^air ，X ^air The calculation process of (2) is as follows:

wherein ,

(·) ^T representing a transpose operation;

step B32: inputting the image im into a pre-training model ResNet-152 to obtain 10 related image tags tag= { tag ₁ ，…，tag _i ，…，tag ₁₀ And then adding the same to the seed node Set obtained in the step 21 ^sn Each seed node word in the tree is respectively input into a pre-training model BERT to obtain semantic characterization vectors of the seed node words

and />

wherein ,/>

Is seed node->

Semantic token vector of>

Is the ith image tag _i Semantic token vectors of (a);

step B33: calculating an image tag _i Semantic characterization vectors and seed nodes of (a)

Cosine similarity of semantic representation vectors of (2) to obtain seed node +.>

And image tag _i The similarity score between the two is calculated as follows:

/>

step B34: calculating seed nodes according to step B33

And similarity scores of 10 labels, and selecting top t image labels with highest scores and seed nodes +.>

Forming a tag t tuple TT together; expressed as:

step B35: for seed node Set ^sn Repeating the steps to obtain a Set containing the label t-tuple of all the seed nodes ^sit Expressed as:

Set ^sit ＝{TT ₁ ，…，TT _i ，…，TT _m′ }

wherein ,TT_i Is the ith seed node

Is included in the tag t tuple of (c).

Further, the step B4 specifically includes the following steps:

step B41: semantic representation vector X for a pair of aspects ^a Carrying out average pooling operation to obtain

The calculation formula is as follows:

wherein ,

step B42: semantic representation vector X for comments ^r Performing position coding to obtain a comment position-enhanced characterization vector X ^pw ，X ^pw Expressed as:

wherein ,

reinforcing a characterization vector for the position corresponding to the ith word in the comment r, wherein "·" represents that real numbers and the vector are multiplied, and pw _i The calculation mode of the position weight corresponding to the i-th word in the comment r is as follows:

wherein θ and θ+m-1 represent the starting and ending positions of aspect a in the comment r, respectively;

step B43: averaging the aspect characterization vectors obtained in step B41

X obtained in step B42 ^pw Connecting to obtain a characterization vector C ^sd，0 ，/>

Expressed as:

wherein ,

i=1, 2, …, n, ", which is a token vector corresponding to the i-th word in comment r that is input into the graph convolution network; "means vector join operations;

Step B44: based on the Set obtained in step 26 ^skt Knowledge triplet sets of each seed node in the tree, and emotion polarity w in each knowledge triplet in turn ^sp And semantically similar words w ^ss Respectively serving as knowledge expansion nodes and adding an edge to be connected with the relevant seed nodes in the syntax dependency tree SDT obtained in the step B15; similarly, a Set is obtained based on step 35 ^sit The label t tuple of each seed node in the tree is used for sequentially connecting each image label serving as an image label expansion node and adding an edge with the seed node related to the syntax dependency tree SDT; to avoid redundancy, if a new knowledge expansion node or image tag expansion node already exists in the graph, only one edge is added between it and the relevant seed node; finally, obtaining a text-knowledge-image heterogeneous graph TKIHG;

wherein ,

representing words ∈in comments>

He word->

There is a syntactic dependency between->

Representing words ∈in comments>

The emotion polar word is w ^sp ，/>

Representing words ∈in comments>

The semantically similar word is w ^ss ，

Word ∈in representation and comment>

The relevant image label is tag _j ；

Step B45: encoding the text-knowledge-image iso-composition TKIHG obtained in the step B44 into a u-order adjacency matrix A ^hg ，A ^hg Expressed as:

wherein ,

1 indicates that a connection relationship exists between two words, < ->

If 0, no connection relationship exists between the two words, and u=n+m' × (t+k); the first n nodes in the figure are words in comment text, and the (n+1) -th to (n+th)The m 'x t nodes are words of related image label nodes, and finally the n+m' x t+1 to the u node are words of related knowledge expansion nodes;

step B46: image tag node words in a text-knowledge-image heterograph TKIHG are spliced to comment text r in sequence to form a new sequence, and then a pre-training model BERT is input to perform dimension reduction by using a full-connection layer to obtain an upper and lower Wen Yuyi characterization vector X of a text mode ^rt Expressed as:

wherein

Step B47: representation vector X for upper and lower Wen Yuyi of text modality ^rt And step 27, obtaining a characterization vector X of the knowledge expansion node ^kg Obtaining semantic guided external knowledge representation vector X by using cross-modal attention mechanism ^kg′ The calculation process is as follows:

wherein ,

W ₁ ，W ₂ and />

Is a learnable weight matrix;

step B48: representation vector X for upper and lower Wen Yuyi of text modality ^rt Mapping it to an X using a self-attention mechanism ^kg′ Under the same feature space, a transformed upper and lower Wen Yuyi characterization vector X is obtained ^rt′ The calculation process is as follows:

wherein ,

is the upper and lower Wen Yuyi characterization vector of the mapped text mode, W ₄ ，W ₅ and />

Is a learnable weight matrix;

step B49: characterizing the transformed upper and lower Wen Yuyi to vector X ^rt′ And semantically guided external knowledge representation vector X ^kg′ Connecting to obtain a node characterization vector C of the text-knowledge-image heterograph TKIHG ^hg，0 It is represented as follows:

wherein

Further, the step B5 specifically includes the following steps:

step B51: for the comment text graph rolling network RGCN, the characterization vector C obtained in the step B43 is obtained ^sd，0 Input first layer graph rolling network utilizing adjacency matrix A ^r Updating the characterization vector of each word node, and outputting C ^sd，1 And is used as the input of the next layer graph rolling network;

wherein ,C^sd，1 Expressed as:

wherein ,

is the output of the ith node in the first layer graph roll-up network,/for>

The calculation formula of (2) is as follows:

wherein ,

is a weight matrix that can be learned, +.>

Is a bias vector; reLU is an activation function; the ith node in the graph roll-up network and the ith word in comment r +.>

Correspondingly, d _i Representing the degree of the ith node, d _i +1 is to prevent calculation errors when the degree of the ith node is 0;

step B52: for a text-knowledge-image heterograph convolution network HGCN, the node characterization vector C of the heterograph TKIHG obtained in the step B49 is calculated ^hg，0 Input first layer graph rolling network utilizing adjacency matrix A ^hg Updating the characterization vector of each node, and outputting C ^hg，1 And is used as the input of the next layer graph rolling network;

wherein ,C^hg，1 Expressed as:

wherein ,

is the output of the ith node in the first layer graph roll-up network,/for>

The calculation formula of (2) is as follows:

/>

wherein

Is a weight matrix that can be learned, +.>

Is a bias vector; reLU is an activation function; edges between nodes in a graph rolling network represent that connection relations exist between the nodes, d _i Representing the degree of the ith node, d _i +1 is to prevent calculation errors when the degree of the ith node is 0;

step B53: respectively C ^sd，1 and C^hg，1 Inputting to the next layer graph rolling network of RGCN and HGCN, repeating steps B51 and B52;

wherein, for comment text graph rolling network RGCN, the output of the first layer graph rolling network

As input of the layer 1 (l+1) graph convolution network, obtaining a text graph convolution token vector ++after iteration is finished>

For the text-knowledge-image heterograph convolution network HGCN, the output of the layer I graph convolution network is +.>

As input of the layer 1 graph convolution network, obtaining the isomerous graph convolution characterization vector +.>

L is the layer number of the graph rolling network, and L is more than or equal to 1 and less than or equal to L.

Further, the step B6 specifically includes the following steps:

Step B61: for the text obtained in the step B53The volume of this figure characterizes vector C ^sd，L Performing aspect masking operation, masking text graph convolution output which does not belong to aspect words, and obtaining a text graph convolution aspect characterization vector C of comments r ^mask，L The calculation process is as follows:

where θ represents the start position of the aspect in the comment sentence, θ+m-1 represents the end position of the aspect in the comment sentence,

representing a graph convolution aspect characterization vector corresponding to an ith word in the comment, wherein 0 represents a zero vector with a dimension d;

step B62: the semantic characterization vector X of the comment r obtained in the step B12 is obtained ^r And the graph convolution aspect characterization vector C of comment r obtained in step B61 ^mask，L Inputting an interactive attention network, enhancing the context representation by using aspect information of aggregation syntax information through an interactive attention mechanism, and obtaining an aspect enhancement characterization vector X of comments r ^ea The calculation formula is as follows:

wherein ,(·)^T Representing the transpose operation, beta _i Is the attention weight of the i-th word in comment r.

Further, the step B7 specifically includes the following steps:

step B71: the volume characterization vector C of the isomerism map obtained in the step B53 ^hg，L And comment r aspect enhancement characterization vector X ^ea By using a cross-modal attention mechanism, the heterogeneous information such as image labels, external knowledge and the like is further utilized to enhance the learning text mode, and the heterogeneous enhanced text characterization vector X is obtained ^hm The calculation process is as follows:

wherein ,

W ₉ ，W ₁₀ and />

A learnable weight matrix;

step B72: characterizing vector C for a volume of heterogeneous images ^hg，L Aspect-dependent image region characterization vector X ^air Using a cross-mode attention mechanism, further utilizing heterogeneous information such as image labels, external knowledge and the like to strengthen emotion characteristics of a learning image mode, and obtaining a heterogeneous enhanced image characterization vector X ^hair The calculation process is as follows:

wherein ,

W ₁₂ ，W ₁₃ and />

A learnable weight matrix;

step B73: text token vector X for heterogeneous enhancement ^hm And image characterization vector X ^hair Performing connection operation to obtain final representation final characterization vector X ^fin The calculation process is as follows:

X ^fin ＝[X ^hm ；X ^hair ]。

the invention also provides a multi-mode comment emotion analysis system adopting the method, which comprises the following steps:

the data collection module is used for extracting user comments and related images, aspect words in the comments and position information of the aspect words, marking emotion polarities of the aspects and constructing a training set;

the preprocessing module is used for preprocessing training samples in a training set, and comprises word segmentation processing, stop word removal, image size adjustment, syntactic dependency analysis, selection of a related knowledge triplet set and an image tag set and generation of a text-knowledge-image heterogram;

The coding module is used for searching word vectors of knowledge words of the knowledge triplet set in the pre-trained knowledge map word vector matrix to obtain knowledge word characterization vectors of the knowledge triplet set;

the network training module is used for inputting the processed user comments, related images, aspects, text-knowledge-image heterograms and knowledge word characterization vectors of the knowledge triplet set into the deep learning network to obtain final characterization vectors of the multi-mode comments, training the deep learning network by using the probability that the characterization vectors belong to a certain category and labels in a training set as losses, and training the whole deep learning network by using the minimum losses as targets to obtain a deep learning network model based on the knowledge map and the heterogeneous map convolution network; and

and the emotion analysis module is used for extracting aspects in the input user comments by using an NLP tool, then analyzing and processing the input comments, images and aspects by using a trained deep learning network model based on the knowledge graph and the heterogeneous graph convolution network, and outputting emotion evaluation polarities related to specific aspects in the user comments and related images.

Compared with the prior art, the invention has the following beneficial effects: the method and the system firstly utilize a pre-training model to respectively encode comment sentences, product aspects and images, and then utilize a knowledge graph and a dynamic knowledge selection mechanism to obtain related knowledge nodes of corresponding comment sentences; next, obtaining an aspect-related image tag by using an image selection mechanism, and then constructing a text-knowledge-image heterogram by using knowledge information and image tag information; and then, carrying out position weighting on comment sentence representation by using position information, learning syntactic dependency relationship and heterogeneous information in the multi-mode comment by using two GCNs, and finally, enhancing emotion characteristics of a learning text mode and an image mode by further using heterogeneous information such as image labels, external knowledge and the like by using a cross-mode attention mechanism, thereby improving the accuracy of model prediction emotion classification.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep learning network model in an embodiment of the invention;

fig. 3 is a schematic diagram of a system structure according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

As shown in fig. 1, the embodiment provides a multi-mode comment emotion analysis method based on heterographic convolution, which includes the following steps:

Step A: and collecting user comments and related images, extracting aspect words of products or services related to the user comments, and labeling emotion polarities of the user comments aiming at specific aspects of the products or services so as to construct a training set DB.

And (B) step (B): the training set DB is used for training a deep learning network model DLM based on a knowledge graph and a heterogeneous graph convolution network and is used for analyzing the emotion polarity of user comments and related images on specific aspects of products or services. The architecture of the deep learning network model is shown in fig. 2.

In this embodiment, the step B specifically includes the following steps:

step B1: coding each training sample in the training set DB to obtain a comment semantic representation vector X ^r Semantic representation vector X of aspects ^a Syntax dependency adjacency matrix A ^r Image region characterization vector X ^im 。

In this embodiment, the step B1 specifically includes the following steps:

step B11: traversing the training set DB, performing word segmentation processing on user comments and aspects in the training set DB, removing stop words, and adjusting an image related to the comments into an image with 3 multiplied by 224 pixels, wherein each training sample in the DB is expressed as d= (r, im, a, p).

Wherein r is a user comment, im is an adjusted image related to the user comment, a is an aspect word or phrase of a product or service related to the user comment extracted from the user comment r, and p epsilon (positive, negative, neutral) is the emotion polarity of the comment on the aspect.

Comment r is expressed as:

wherein ,

for the i-th word in comment r, i=1, 2, …, n, n is the number of words in comment r.

Aspect a is represented as:

wherein ,

for the i-th word in aspect a, i=1, 2, …, m, m is the number of words of aspect a.

Step B12: comment on step B11

wherein ,

for comment the i-th word->

The corresponding semantic token vector, d, represents the output dimension.

Step B13: aspect obtained for step B11

wherein ,

representing aspect the i-th word->

The corresponding word vector, d, represents the output dimension.

Step B14: pre-training for adjusted imagesCoding and smoothing a model ResNet-152, and obtaining an image region characterization vector after full-connection layer dimension reduction

Where z represents the feature map size of the output and d represents the output dimension.

Step B15: and carrying out syntactic dependency analysis on the comment r to obtain a syntactic dependency tree SDT.

wherein ,

representing words ∈in comments>

He word->

There is a syntactic dependency between them.

wherein ,

a1 indicates the word ++in the comment>

He word->

There is a syntactic dependency between->

A0 indicates the word->

He word->

There is no syntactic dependency between them.

Step B2: selecting a Set of knowledge triples from a knowledge graph that are relevant to a comment context according to a dynamic knowledge selection mechanism ^skt Then coding the knowledge words to obtain a Set ^skt Knowledge word characterization vector X ^kg 。

In this embodiment, the step B2 specifically includes the following steps:

wherein ,

is an aspect word or a context word with edge connection, and m is less than or equal to m' is less than or equal to n.

CKT＝<w ^sn ，w ^sp ，w ^ss )

wherein ,w^sn Is a seed node word, w ^sp Is emotion polar word, w ^ss Is a semantically similar word.

r ^cks = "word' w ^sn ' emotion electrodeSex is' w ^sp 'the word most similar to its semantics is' w ^ss ”’

Similarity_Score(r，r ^cks )＝CosineSimilarity(X ^rm ，X ^cks )

wherein ,X^rm ，

Step B25: and B24, calculating similarity scores of candidate knowledge sentences constructed by all candidate knowledge triples CKT, and selecting the original candidate knowledge triples of the top k candidate knowledge sentences with the highest scores to form a knowledge triplet set of the seed node as the external knowledge of the seed node which is most relevant to the context.

wherein ,

is the i seed node->

Is a knowledge triplet set.

wherein ,/>

Step B3: representing vector X by semantics of aspects ^a And image region characterization vector X ^im Obtaining aspect-dependent image region characterization vector X using interactive attention mechanisms ^air The method comprises the steps of carrying out a first treatment on the surface of the Acquiring a Set of tag t tuples related to comment context through an image tag selection mechanism ^sit 。

In this embodiment, the step B3 specifically includes the following steps:

wherein ,

(·) ^T representing the transpose operation.

Step B32: inputting the image im into a pre-training model ResNet-152 to obtain 10 relevant graphs Like tag= { tag ₁ ，…，tag _i ，…，tag ₁₀ And then adding the same to the seed node Set obtained in the step 21 ^sn Each seed node word in the tree is respectively input into a pre-training model BERT to obtain semantic characterization vectors of the seed node words

and />

wherein ,/>

Is seed node->

Semantic token vector of>

Is the ith image tag _i Is described.

And image tag _i The similarity score between the two is calculated as follows:

step B34: calculating seed nodes according to step B33

Forming a tag t tuple TT together; expressed as:

Set ^sit ＝{TT ₁ ，…，TT _i ，…，TT _m′ }

wherein ,TT_i Is the ith seed node

Is included in the tag t tuple of (c).

Obtaining a characterization vector C ^sd，0 The method comprises the steps of carrying out a first treatment on the surface of the Generating a text-knowledge-image heterogeneous graph TKIHG according to a text-knowledge-image heterogeneous composition strategy to obtain an adjacency matrix A of the text-knowledge-image heterogeneous graph TKIHG ^hg Then the nodes are encoded by using a cross-modal attention mechanism to obtain a node characterization vector C of the heterogeneous graph TKIHG ^hg，0 。

In this embodiment, the step B4 specifically includes the following steps:

The calculation formula is as follows:

wherein ,

wherein ,

/>

where θ and θ+m-1 represent the position of the start and end of aspect a in the comment r, respectively.

Step B43: averaging the aspect characterization vectors obtained in step B41

Expressed as:

wherein ,

i=1, 2, …, n, ", which is a token vector corresponding to the i-th word in comment r that is input into the graph convolution network; "means vector join operation.

Step B44: based on the Set obtained in step 26 ^skt Knowledge triplet sets of each seed node in the tree, and emotion polarity w in each knowledge triplet in turn ^sp And semantically similar words w ^ss Respectively serving as knowledge expansion nodes and adding an edge to be connected with the relevant seed nodes in the syntax dependency tree SDT obtained in the step B15; similarly, a Set is obtained based on step 35 ^sit The label t tuple of each seed node in the tree is used for sequentially connecting each image label serving as an image label expansion node and adding an edge with the seed node related to the syntax dependency tree SDT; to avoid redundancy, if a new knowledge expansion node or image tag expansion node already exists in the graph, only one edge is added between it and the relevant seed node; finally, a text-knowledge-image-different composition TKIHG is obtained.

wherein ,

representing words ∈in comments>

He word->

There is a syntactic dependency between->

Representing words ∈in comments>

The emotion polar word is w ^sp ，/>

Representing words ∈in comments>

The semantically similar word is w ^ss ，

Word ∈in representation and comment>

The relevant image label is tag _j 。

wherein ,

1 indicates that a connection relationship exists between two words, < ->

If 0, no connection relationship exists between the two words, and u=n+m' × (t+k); the first n nodes in the graph are words in comment texts, the (n+1) th to (n+m '. Times.t) th nodes are words of related image label nodes, and the (n+m'. Times.t+1) th to (u) th nodes are words of related knowledge expansion nodes.

wherein

/>

wherein ,

W ₁ ，W ₂ and />

Is a learnable weight matrix.

wherein ,

Is a learnable weight matrix.

Step B49: characterizing the transformed upper and lower Wen Yuyi to vector X ^rt′ And semantic indexingGuided extrinsic knowledge characterization vector X ^kg′ Connecting to obtain a node characterization vector C of the text-knowledge-image heterograph TKIHG ^hg，0 It is represented as follows:

wherein

Step B5: will characterize vector C ^sd，0 And node characterization vector C of heterogeneous graph TKIHG ^hg，0 Respectively inputting the text graph convolution vector C into two different L-layer graph convolution networks, respectively recording the text graph convolution network RGCN and the text-knowledge-image heterograph convolution network HGCN as comments, respectively learning and extracting syntactic dependency relationship and heterogeneous information of context semantics, image labels and external knowledge to obtain a text graph convolution characterization vector C ^sd，L And a isomerous map volume characterization vector C ^hg，L 。

In this embodiment, the step B5 specifically includes the following steps:

step B51: for the comment text graph rolling network RGCN, the characterization vector C obtained in the step B43 is obtained ^sd，0 Input first layer graph rolling network utilizing adjacency matrix A ^r Updating the characterization vector of each word node, and outputting C ^sd，1 And serves as input to the next layer of graph rolling network.

wherein ,C^sd，1 Expressed as:

wherein ,

is the output of the ith node in the first layer graph roll-up network,/for>

The calculation formula of (2) is as follows:

wherein ,

is a weight matrix that can be learned, +.>

Correspondingly, d _i Representing the degree of the ith node, d _i +1 is to prevent the calculation error from being caused when the degree of the i-th node is 0.

Step B52: for a text-knowledge-image heterograph convolution network HGCN, the node characterization vector C of the heterograph TKIHG obtained in the step B49 is calculated ^hg，0 Input first layer graph rolling network utilizing adjacency matrix A ^hg Updating the characterization vector of each node, and outputting C ^hg，1 And serves as input to the next layer of graph rolling network.

wherein ,C^hg，1 Expressed as:

wherein ,

is the output of the ith node in the first layer graph roll-up network,/for>

The calculation formula of (2) is as follows:

wherein

Is a weight matrix that can be learned, +.>

Is a bias vector; reLU is an activation function; edges between nodes in a graph rolling network represent that connection relations exist between the nodes, d _i Representing the degree of the ith node, d _i +1 is to prevent the calculation error from being caused when the degree of the i-th node is 0.

Step B53: respectively C ^sd，1 and C^hg，1 The next layer of graph rolling network input to RGCN and HGCN, repeat steps B51 and B52.

Step B6: convolving the text map with a token vector C ^sd，L Performing aspect shielding operation to obtain text graph convolution shielding characterization of commentsVector C ^mask，L Semantic representation vector X of remarks ^r Using the interactive attention mechanism, further enhancing the context representation with aspect information of the aggregate syntax information, obtaining an aspect enhanced token vector X of comment r ^ea 。

In this embodiment, the step B6 specifically includes the following steps:

step B61: the text chart obtained in the step B53 is rolled to represent the vector C ^sd，L Performing aspect masking operation, masking text graph convolution output which does not belong to aspect words, and obtaining a text graph convolution aspect characterization vector C of comments r ^mask，L The calculation process is as follows:

representing a graph convolution aspect characterization vector corresponding to an ith word in the comment, and 0 represents a zero vector with a dimension d.

Step B7: wrapping the isomerism map by a token vector C ^hg，L Enhancing the token vector X with respect to comment r, respectively ^ea Aspect-dependent image region characterization vector X ^air Using a cross-mode attention mechanism, further utilizing heterogeneous information such as image labels and external knowledge to strengthen learning text modes and emotion characteristics of the image modes, and obtaining a heterogeneous enhanced text characterization vector X ^hm And image characterization vector X ^hair Finally connect X ^hm and X^hair Obtaining a final characterization vector X ^fin 。

In this embodiment, the step B7 specifically includes the following steps:

wherein ,

W ₉ ，W ₁₀ and />

A learnable weight matrix.

wherein ,

W ₁₂ ，W ₁₃ and />

A learnable weight matrix.

X ^fin ＝[X ^hm ；X ^hair ]。

Step B8: will ultimately characterize vector X ^fin And inputting a final prediction layer, calculating the gradient of each parameter in the deep learning network model by using a back propagation method according to the target loss function loss, and updating each parameter by using a random gradient descent method.

As shown in fig. 3, this embodiment further provides a multi-mode comment emotion analysis system adopting the above method, including: the system comprises a data collection module, a preprocessing module, a coding module, a network training module and an emotion analysis module.

The data collection module is used for extracting user comments and related images, aspect words in the comments and position information of the aspect words, marking emotion polarities of the aspects and constructing a training set.

The preprocessing module is used for preprocessing training samples in a training set, and comprises word segmentation processing, stop word removal, image size adjustment, syntactic dependency analysis, selection of a related knowledge triplet set and an image tag set and generation of a text-knowledge-image heterogram.

The encoding module is used for searching word vectors of knowledge words of the knowledge triplet set in the pre-trained knowledge map word vector matrix to obtain knowledge word characterization vectors of the knowledge triplet set.

The network training module is used for inputting the processed user comments, related images, aspects, text-knowledge-image heterograms and knowledge word characterization vectors of the knowledge triplet set into the deep learning network to obtain final characterization vectors of the multi-mode comments, training the deep learning network by using the characterization vectors, which belong to a certain category, probability and labels in training sets as losses, and training the whole deep learning network by using the minimum losses as targets to obtain a deep learning network model based on the knowledge map and the heterogeneous map convolution network.

The emotion analysis module extracts aspects in the input user comments by using an NLP tool, analyzes and processes the input comments, images and aspects by using a trained deep learning network model based on a knowledge graph and a heterogeneous graph convolution network, and outputs emotion evaluation polarities related to specific aspects in the user comments and related images.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. The multi-mode comment emotion analysis method based on heterogram convolution is characterized by comprising the following steps of:

step A: collecting user comments and related images, extracting aspect words of products or services related to the user comments, and labeling emotion polarities of the user comments aiming at specific aspects of the products or services so as to construct a training set DB;

2. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 1, wherein the step B specifically includes the steps of:

Step B3: representing vector X by semantics of aspects ^a And image region characterization vector X ^im Obtaining aspect-dependent image region characterization vector X using interactive attention mechanisms ^air The method comprises the steps of carrying out a first treatment on the surface of the Acquiring a Set of tag t tuples related to comment context through an image tag selection mechanism ^sit ；

Obtaining a characterization vector C ^sd，O The method comprises the steps of carrying out a first treatment on the surface of the Generating a text-knowledge-image heterogeneous graph TKIHG according to a text-knowledge-image heterogeneous composition strategy to obtain an adjacency matrix A of the text-knowledge-image heterogeneous graph TKIHG ^hg Then the nodes are encoded by using a cross-modal attention mechanism to obtain a node characterization vector C of the heterogeneous graph TKIHG ^hg，0 ；

Step B7: wrapping the isomerism map by a token vector C ^hg，L Enhancing the token vector X with respect to comment r, respectively ^ea Aspect-dependent image region characterization vector X ^air Using a cross-mode attention mechanism, further utilizing heterogeneous information such as image labels and external knowledge to strengthen learning text modes and emotion characteristics of the image modes, and obtaining a heterogeneous enhanced text characterization vector X ^hm And image characterization vector X ^hair Finally connect X ^hm and X^hair Obtaining a final characterization vector X ^fin ；

3. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 2, wherein the step B1 specifically includes the steps of:

comment r is expressed as:

wherein ,

i=1, 2, …, n, n being the number of words in comment r;

aspect a is represented as:

wherein ,

step B12: comment on step B11

wherein ,

For comment the i-th word->

step B13: aspect obtained for step B11

wherein ,

representing aspect the i-th word->

The corresponding word vector, d, represents the output dimension;

step B14: the adjusted image is encoded and smoothed by using a pre-training model ResNet-152, and an image region representation vector is obtained after full-connection layer dimension reduction

wherein ,

representing words ∈in comments>

He word->

A syntactic dependency exists between the two;

wherein ,

a1 indicates the word ++in the comment>

He word->

There is a syntactic dependency between->

A0 indicates the word->

He word->

There is no syntactic dependency between them.

4. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 3, wherein said step B2 specifically includes the steps of:

wherein ,

CKT＝<w ^sn ，w ^sp ，w ^ss >

Similarity_Score(r，r ^cks )＝CosineSimilarity(X ^rm ，X ^cks )

wherein ,

wherein ,

is the i seed node->

A knowledge triplet set;

wherein ,/>

5. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 4, wherein said step B3 specifically includes the steps of:

/>

wherein ,

(·) ^T representing a transpose operation;

and />

wherein ,/>

Is seed node->

Semantic token vector of>

Is the ith image tag _i Semantic token vectors of (a);

And image tag _i The similarity score between the two is calculated as follows:

step B34: calculating seed nodes according to step B33

Forming a tag t tuple TT together; expressed as:

Set ^sit ＝{TT ₁ ，…，TT _i ，…，TT _m′ }

wherein ,TT_i Is the ith seed node

Is included in the tag t tuple of (c).

6. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 5, wherein said step B4 specifically includes the steps of:

The calculation formula is as follows:

wherein ,

wherein ,

step B43: averaging the aspect characterization vectors obtained in step B41

Expressed as:

wherein ,

wherein ,

representing words ∈in comments>

He word->

There is a syntactic dependency between->

Representing words ∈in comments>

The emotion polar word is w ^sp ，/>

Representing words ∈in comments>

The semantically similar word is w ^ss ，/>

Word ∈in representation and comment>

The relevant image label is tag _j ；

wherein ,

1 indicates that a connection relationship exists between two words, < ->

If 0, no connection relationship exists between the two words, and u=n+m' × (t+k); the first n nodes in the graph are words in comment texts, the (n+1) th to (n+m '. Times.t) th nodes are words of related image label nodes, and the (n+m'. Times.t+1) th to (u) th nodes are words of related knowledge expansion nodes; />

wherein

wherein ,

W ₁ ，W ₂ and />

Is a learnable weight matrix;

wherein ,

is the upper and lower Wen Yuyi characterization vector of the mapped text mode, W ₄ ，W ₅ And

is a learnable weight matrix;

wherein

7. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 6, wherein said step B5 specifically includes the steps of:

wherein ,C^sd，1 Expressed as:

wherein ,

is the output of the ith node in the first layer graph roll-up network,/for>

The calculation formula of (2) is as follows:

/>

wherein ,

is a weight matrix that can be learned, +.>

wherein ,C^hg，1 Expressed as:

wherein ,

is the output of the ith node in the first layer graph roll-up network,/for>

The calculation formula of (2) is as follows:

wherein

Is a weight matrix that can be learned, +.>

Is a bias vector; reLU is an activation function; edge generation between nodes in graph rolling networkThe table nodes have a connection relationship, d _i Representing the degree of the ith node, d _i +1 is to prevent calculation errors when the degree of the ith node is 0;

For the text-knowledge-image heterograph convolution network HGCN, the output of the layer I graph convolution network is +. >

8. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 7, wherein said step B6 specifically includes the steps of:

9. The multi-modal comment emotion analysis method based on heterographic convolution according to claim 8, wherein said step B7 specifically includes the steps of:

step B71: the volume characterization vector C of the isomerism map obtained in the step B53 ^hg，L And comment r aspect enhancement characterization vector X ^ea Further utilizing image tags and external knowledge and other heterogeneous forms by using a cross-modal attention mechanismInformation is used for enhancing the learning text mode to obtain a heterogeneous enhanced text characterization vector X ^hm The calculation process is as follows:

wherein ,

W ₉ ，W ₁₀ and />

A learnable weight matrix;

wherein ,

W ₁₂ ，W ₁₃ and />

A learnable weight matrix;

X ^fin ＝[X ^hm ；X ^hair ]。

10. a multimodal comment emotion analysis system employing the method of any of claims 1-9, comprising: