CN111259141A

CN111259141A - Social media corpus emotion analysis method based on multi-model fusion

Info

Publication number: CN111259141A
Application number: CN202010030785.2A
Authority: CN
Inventors: 徐爽爽
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-06-09

Abstract

The invention discloses a social media corpus sentiment analysis method based on multi-model fusion. The pyspide crawler framework is used to obtain from social media, and the data set obtained by the crawler is processed, and the data set is divided into three categories: only contains text information , only contains image information and text and image information, the present invention uses a cross-media method to process corpus, for text information in corpus, uses SO-PMI algorithm to construct an emotional dictionary, and analyzes the positive, neutral and negative of point-by-point mutual information sex. Use similarity distance to replace PMI between words and build a new formula; for image or video corpus, use visual text joint modeling method to get and analyze the meaning of the image, and then get the meaning of the image or video. Using the analysis results of the plain text and the visual analysis results, weighted fusion is performed to obtain the final sentiment analysis result.

Description

A sentiment analysis method for social media corpus based on multi-model fusion

技术领域technical field

本发明属于情感分析领域，涉及一种基于多模型融合的社交媒体语料情感分析方法。The invention belongs to the field of sentiment analysis, and relates to a social media corpus sentiment analysis method based on multi-model fusion.

背景技术Background technique

近年来，大量的社交平台和软件涌现出来，如微博、微信、QQ等，这些社交平台极大地丰富了人们的生活。越来越多的人积极地与他人分享信息，在社交平台上表达他们的观点和感受，所以每个社交平台慢慢地就会出现大量的语料信息如：图像、文本、视频等。人们分析隐藏在这些信息中的情感可以有益于在线营销、危机公关、监控公众意见、违法行为和发现潜在抑郁症等轻生迹象等。情感分析是平台社交信息的一个趋势，即根据对用户的语料信息进行分类，可分为积极、消极和中性，三种情感倾向。在此之前，有各种方法对于图像或者文本的单一识别分析已经取得了很多成果。但是，单一特征的情感分析有很多局限性，例如用户量比较大的微博，Facebook，Twitter等社交平台，都支持图文同时发布的方法，而现今大部分方法不能全面分析用户在社交平台上发布多种语料而造成判断失误。对于社交平台的多种语料信息，提高情感分析的准确性和全面性，有待于提高。In recent years, a large number of social platforms and software have emerged, such as Weibo, WeChat, QQ, etc. These social platforms have greatly enriched people's lives. More and more people are actively sharing information with others and expressing their opinions and feelings on social platforms, so each social platform will gradually appear a large amount of corpus information such as: images, texts, videos, etc. People analyzing the emotions hidden in this information can benefit online marketing, crisis PR, monitoring public opinion, illegal behavior, and spotting signs of suicide such as potential depression. Sentiment analysis is a trend of platform social information, that is, according to the classification of users' corpus information, it can be divided into positive, negative and neutral, three emotional tendencies. Prior to this, various methods have achieved many results for single-recognition analysis of images or texts. However, sentiment analysis of a single feature has many limitations. For example, social platforms such as Weibo, Facebook, and Twitter, which have a large number of users, all support the method of publishing images and texts at the same time. However, most of the current methods cannot comprehensively analyze users’ social media platforms. Misjudgment caused by publishing multiple corpora. For the various corpus information of social platforms, improving the accuracy and comprehensiveness of sentiment analysis needs to be improved.

本发明基于多模型融合的社交媒体语料情感分析方法，避免单一的特征对于情感分析的不足，针对图像和文本进行结合分析情感，从而更加准确、适用范围更广。通过双重语料对于社区媒体的信息进行语义分析，提高了情感分析的准确性和全面性。The present invention is based on the multi-model fusion social media corpus sentiment analysis method, avoids the insufficiency of a single feature for sentiment analysis, and analyzes sentiment by combining images and texts, thereby being more accurate and having wider application range. Semantic analysis of social media information through dual corpus improves the accuracy and comprehensiveness of sentiment analysis.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提出一种基于多模型融合的社交媒体语料情感分析方法。实验相关数据使用pyspide爬虫框架从社交媒体获取，并对爬虫获取的数据集进行处理，将数据集拆分为三类：只包含文本信息、只包含图像信息以及文本图像信息均包含，本发明着重处理文本图像信息均包含情况，其他两种情况的语料可作为本发明鲁棒性的验证。首先，对于语料中的信息进行识别，识别出的语料信息可分为三类：只包含文本信息、只包含图像信息以及文本图像信息均包含，不管语料信息是上述三类中的那一种，都以包含图文信息的语料进行处理，这样做的好处是不管用户语料是哪种情况都能够合理的进行情感分析，保证模型的鲁棒性。首先，对语料中的文本信息，使用SO-PMI算法(情感倾向点互信息算法)构建情感词典，分析语料的积极性、中性和消极性，但是SO-PMI算法不能够灵活使用中文单词和短语，所以使用相似距离在单词之间替换并构建新的情感词典。其次，对于图像(包含图片和视频的集合)，使用视觉文本联合建模算法对图像进行含义的解析，从而得出图像的情感倾向。最后，使用文本语料分析结果和图像语料分析得出的结果，进行加权融合的到最后的情感分析结果。The purpose of the present invention is to propose a social media corpus sentiment analysis method based on multi-model fusion. The experimental related data is obtained from social media using the pyspide crawler framework, and the data set obtained by the crawler is processed, and the data set is divided into three categories: only containing text information, only containing image information, and containing both text and image information. The present invention focuses on The processing of text and image information includes situations, and the corpus of the other two situations can be used as a verification of the robustness of the present invention. First of all, to identify the information in the corpus, the identified corpus information can be divided into three categories: only text information, only image information, and both text and image information, regardless of whether the corpus information is one of the above three categories, All of them are processed with corpus containing graphic and text information. The advantage of this is that no matter what the user corpus is, sentiment analysis can be carried out reasonably and the robustness of the model can be guaranteed. First, for the text information in the corpus, use the SO-PMI algorithm (emotional tendency point mutual information algorithm) to build a sentiment dictionary to analyze the positivity, neutrality and negativity of the corpus, but the SO-PMI algorithm cannot flexibly use Chinese words and phrases , so use similarity distance to replace between words and build a new sentiment dictionary. Second, for images (including a collection of pictures and videos), the visual text joint modeling algorithm is used to parse the meaning of the images, so as to obtain the emotional tendencies of the images. Finally, use the results of text corpus analysis and image corpus analysis to perform weighted fusion to the final sentiment analysis results.

为了实现上述目的，本发明采用的技术方案为一种基于多模型融合的社交媒体语料情感分析方法，该方法共包含以下步骤：In order to achieve the above purpose, the technical solution adopted in the present invention is a social media corpus sentiment analysis method based on multi-model fusion, and the method comprises the following steps:

步骤1数据预处理：Step 1 Data preprocessing:

使用的数据是从新浪微博等社交平台通过爬虫获取，并过滤广告等无关数据，只保留用带有用户主观性的博文数据，对过滤后的文本数据使用jieba分词器进行分词，分词后的数据存在很多无意义的数据，为提高后期模型训练的难度，所以使用停用词表，将其过滤，采用哈工大的停用词表，得到经过数据预处理后的文本；为方便对图片数据的处理，将图片数据采用归一化的方式处理为256像素*256像素的图片。The data used is obtained from social platforms such as Sina Weibo through crawlers, and irrelevant data such as advertisements are filtered, only the blog post data with user subjectivity is retained, and the filtered text data is segmented using the jieba tokenizer. There is a lot of meaningless data in the data. In order to improve the difficulty of later model training, we use a stop word list to filter it, and use the stop word list of Harbin Institute of Technology to get the text after data preprocessing; Processing, the image data is processed into a 256-pixel*256-pixel image in a normalized manner.

步骤2对文本语料进行SO-PMI模型训练：Step 2: Perform SO-PMI model training on the text corpus:

对步骤(1)中得到的文本进行词语的情感标记，同样分为积极、消极、中性三类。用于模型训练的文本数据占总数据的70％，测试验证数据占30％。首先，对已经分词且过滤停用词的数据，使用70％的处理过的情感词汇用于Word2vec工具，得到一个扩展的情感词典。基于语义定位的点互信息算法(SO-PMI)，利用词与词之间的距离以及情感词典来判断它们属于哪一类。之后考虑否定词，程度副词，感叹词，修辞句和情感图表的影响，权衡所有因素，计算出文本内容的情感倾向得到分类结果。The text obtained in step (1) is marked with sentiment of words, which are also divided into three categories: positive, negative and neutral. Text data for model training accounts for 70% of the total data, and test validation data accounts for 30%. First, using 70% of the processed sentiment words for the Word2vec tool for data that has been tokenized and filtered for stop words, an extended sentiment dictionary is obtained. The semantic localization-based point mutual information algorithm (SO-PMI) uses the distance between words and the sentiment dictionary to determine which category they belong to. Then consider the influence of negative words, degree adverbs, interjections, rhetorical sentences and sentiment graphs, weigh all factors, and calculate the sentiment tendency of the text content to get the classification result.

步骤3对图片数据进行CNN+LSTM模型训练：Step 3: Perform CNN+LSTM model training on the image data:

在图片数据集的基础上，增加对图片的情感描述文本，利用这两个模态的数据提供更高精度的细粒度分类卷积做图像分类，CNN+LSTM做文本分类，两个分类结果合起来得到组后图像的情感含义解释。图像文本方面分类使用的是CNN模型，CNN模型由卷积层和全连接层构成；对于文本方面，采用深度结构化的联合嵌入方法，联合嵌入图像和细粒度的视觉描述。该方法学习了图像与文本的兼容函数，看作是多模态结构拼接嵌入的扩展。不使用双线性相容函数，而是使用深层神经编码器生成的有限元内积，最大限度地提高描述与匹配图像之间的相容性，同时最小化与其他类图像的相容性。给定数据D＝(v_n，t_n，y_n)，n＝1，…，N，其中v∈V表示视觉信息，t∈T表示文本类型，y∈Y表示类标签，然后通过最小化经验风险来学习图像和文本分类器函数f_v：V→Y和f_t：T→Y其中

为0-1损失，然后定义函数F的兼容性

使用特性可学的编码器的功能θ(V)图像和文本Φ(t)函数，其中，N表示数据维度，V表示图像集合，T表示文本集合，Y表示映射空间。下述三个公式是从数学角度对多模型融合的社交媒体语料情感分析方法的解释，其中公式(1.3)为图文融合函数，且F(v，t)为图文融合结果，θ(v)^T为图像函数，Φ(t)为文本函数；公式(1.1)为图像最大期望平均函数，其中F(v，t)为公式(1.3)，y为图像语料，t为文本语料，Y为y的映射空间；公式(1.2)为文本最大期望平均函数，其中F(v，t)为公式(1.3)，y为图像语料，v为图像语料，Y为映射空间。On the basis of the picture data set, the emotional description text for the picture is added, and the data of these two modalities is used to provide higher-precision fine-grained classification convolution for image classification, CNN+LSTM for text classification, and the two classification results are combined. up to get an explanation of the emotional meaning of the images after the group. Image and text classification uses a CNN model, which consists of convolutional layers and fully connected layers; for text, a deep structured joint embedding method is used to jointly embed images and fine-grained visual descriptions. The method learns compatibility functions for images and texts, viewed as an extension of multimodal structure stitching embeddings. Instead of using a bilinear compatibility function, a finite element inner product generated by a deep neural encoder is used, maximizing the compatibility between the description and matching images, while minimizing the compatibility with other classes of images. Given data D = (v _n , t _n , _yn ), n = 1, ..., N, where v ∈ V represents visual information, t ∈ T represents text type, and y ∈ Y represents class label, then minimized by Empirical risk to learn image and text classifier functions f _v : V → Y and f _t : T → Y where

for the 0-1 loss, then define the compatibility of the function F

Use feature-learnable encoder functions θ(V) image and text Φ(t) functions, where N represents the data dimension, V represents the image set, T represents the text set, and Y represents the mapping space. The following three formulas are mathematical explanations of the multi-model fusion social media corpus sentiment analysis method, in which formula (1.3) is the image-text fusion function, and F(v, t) is the image-text fusion result, θ(v ) ^T is the image function, Φ(t) is the text function; formula (1.1) is the maximum expected average function of the image, where F(v, t) is the formula (1.3), y is the image corpus, t is the text corpus, Y is The mapping space of y; formula (1.2) is the maximum expected average function of the text, where F(v, t) is formula (1.3), y is the image corpus, v is the image corpus, and Y is the mapping space.

f_v(v)＝argmax_yE_t～T(y)[F(v，t)]，yεY (1.1)f _v (v) = argmax _y E _{t ~ T(y)} [F(v, t)], yεY (1.1)

f_t(t)＝argmax_yE_v～T(y)[F(v，t)]，yεY (1.2)f _t (t)=argmax _y E _v～T(y) [F(v, t)], yεY (1.2)

F(v，t)＝θ(v)^TΦ(t) (1.3)F(v,t)=θ(v) ^T Φ(t) (1.3)

步骤4多模型融合：Step 4 Multi-model fusion:

通过步骤2、3步骤可以得到两种文本最后的文本情感的分类结果，然后通过加权的方式处理两部分判断最后的分类结果。最后的分类结果y＝am+bn，其中m为纯文本判定的类别距离相似度，n为图像所得文本判定的类别距离相似度，然后根据MATLB工具的GeneticAlgorithm遗传学算法求解得到阈值a和b。Through steps 2 and 3, the classification results of the final text sentiment of the two texts can be obtained, and then the final classification results are judged by processing the two parts in a weighted manner. The final classification result y=am+bn, where m is the category distance similarity determined by the plain text, and n is the category distance similarity determined by the text obtained from the image, and then the thresholds a and b are obtained according to the GeneticAlgorithm genetic algorithm of the MATLB tool.

步骤5最终情感分析结果：Step 5 Final sentiment analysis results:

经过步骤4可以得到y＝am+bn中a和b的值，输入文本类别相似度和图像文本相似度，输出图文分类值y，其值为1，-1以及0，且1为积极，-1为消极，0为中性分类结果。After step 4, the values of a and b in y=am+bn can be obtained, input the text category similarity and image text similarity, and output the image and text classification value y, whose values are 1, -1 and 0, and 1 is positive, -1 is negative, 0 is a neutral classification result.

与现有技术相比较，本发明的技术优势主要体现在：Compared with the prior art, the technical advantages of the present invention are mainly reflected in:

(1)本发明利用跨媒体的方法进行语料处理，首先，对于语料中的文本信息，使用SO-PMI算法构建情感词典，分析逐点互信息积极性、中性和消极性的。但是这种方法不能灵活使用了汉语单词和短语。所以使用相似距离在单词之间替换PMI并构建新的公式。(1) The present invention uses a cross-media method for corpus processing. First, for the text information in the corpus, the SO-PMI algorithm is used to construct a sentiment dictionary, and the positive, neutral and negative points of the point-by-point mutual information are analyzed. But this method cannot flexibly use Chinese words and phrases. So use similarity distance to replace PMI between words and build new formula.

(2)其次，对于图像或者视频的语料(视频可以看作是图像的集合)，利用视觉文本联合建模方法去得到、解析图像的含义，从得出对于图像或者视频的含义。(2) Secondly, for the corpus of images or videos (video can be regarded as a collection of images), the visual-text joint modeling method is used to obtain and analyze the meaning of the image, and then the meaning of the image or video is obtained.

(3)最后，利用纯文本的分析结果和视觉得出的分析结果，进行加权融合得到最后的情感分析的结果。(3) Finally, use the analysis results of the plain text and the analysis results obtained by vision to perform weighted fusion to obtain the final sentiment analysis result.

附图说明Description of drawings

图1是本发明使用语料样例图。FIG. 1 is a sample diagram of the corpus used in the present invention.

图2是基于多模型融合的社交媒体语料情感分析的总结构图。Figure 2 is a general structure diagram of sentiment analysis of social media corpus based on multi-model fusion.

图3是本发明中分词完成后的结果图。Fig. 3 is the result diagram after word segmentation is completed in the present invention.

图4是停用词表图。Figure 4 is a stop word table diagram.

图5是步骤1经处理得到的样例图。Figure 5 is a sample image obtained by processing in step 1.

图6是SO-PMI模型训练过程图。Figure 6 is a diagram of the SO-PMI model training process.

图7是本发明训练CNN+LSTM模型的子图。FIG. 7 is a sub-graph of the training CNN+LSTM model of the present invention.

具体实施方式Detailed ways

以下结合附图和实施例对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

本发明采用的技术方案为一种基于多模型融合的社交媒体语料情感分析方法，该发明的具体分析过程如下The technical solution adopted in the present invention is a social media corpus sentiment analysis method based on multi-model fusion. The specific analysis process of the present invention is as follows

(1)中文分词(1) Chinese word segmentation

中文分词就是将连续的字序列按照一定的规范重新组合成词序列的过程，按照中文理解方法，将其划分为单个的词语，在实施过程中可以使用jieba分词工具对文本进行分词，分词完的句子如图2所示，可以看到这个句子被分割成了单个的词语。Chinese word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. According to the Chinese understanding method, it is divided into individual words. During the implementation process, the jieba word segmentation tool can be used to segment the text. After the word segmentation is completed The sentence is shown in Figure 2. You can see that the sentence is divided into individual words.

(2)去停用词(2) Remove stop words

在一段或一句正常的中文文本中通常会包含逗号、句号、分号等特殊符号。在分词完成后，这些标点符号就不需要继续存在。其次句子中包含了一些对句子重要度影响很小的词语，如的、不仅、而且、了等词语，在后续步骤中不需要使用，因此在预处理对其进行删除处理。A paragraph or sentence of normal Chinese text usually contains special symbols such as commas, periods, and semicolons. These punctuation marks do not need to continue to exist after the word segmentation is complete. Secondly, the sentence contains some words that have little effect on the importance of the sentence, such as words such as , not only, and, and so on, which do not need to be used in the subsequent steps, so they are deleted in the preprocessing.

(3)构建词向量(3) Constructing word vectors

经过(1)(2)两步处理过的大量数据，通过Word2Vec工具提取词向量，降低数据维度且获得扩展的数据词典。After a large amount of data processed in (1) and (2), word vectors are extracted by the Word2Vec tool, which reduces the data dimension and obtains an expanded data dictionary.

(4)训练SO-PMI模型(4) Training the SO-PMI model

经过(1)、(2)、(3)步的处理文本数据信息获得扩展的情感词典，然后通过SO-PMI算法通过词与词之间的距离来确定属于哪一类，构建SO-PMI模型。After (1), (2), (3) steps of processing the text data information to obtain an extended sentiment dictionary, and then using the SO-PMI algorithm to determine which category belongs to by the distance between words, and construct the SO-PMI model .

(5)图像归一化处理(5) Image normalization processing

经过爬虫获取的图像数据具有大小不一致的特性，这样的数据处理起来较为复杂，所以根据所选算法进行大小的归一化，把大小处理为256像素*256像素大小的图片。The image data obtained by the crawler has the characteristics of inconsistent size, which is more complicated to process. Therefore, the size is normalized according to the selected algorithm, and the size is processed into a picture with a size of 256 pixels*256 pixels.

(6)训练CNN+LSTM模型(6) Training CNN+LSTM model

经(5)处理的过的图像数据(标注过的数据)，来训练CNN+LSTM模型。The image data (labeled data) processed by (5) is used to train the CNN+LSTM model.

(7)多模型融合(7) Multi-model fusion

经(4)、(6)训练获得SO-PMI模型和CNN+LSTM模型，图文数据输入会得到两个处理结果，经加权的方式处理两部分判断最后的分类结果，通过使用多模型融合的社交媒体语料情感分析方法实验，验证了该方法的有效性和准确性。相对于单模型和只进行文本情感分析时的准确率有明显提升，结果表明对微博情感分析时，本发明的提出的方法准确率更高。After (4) and (6) training, the SO-PMI model and the CNN+LSTM model are obtained. Two processing results will be obtained from the input of graphic data. The two parts are processed in a weighted way to judge the final classification result. The social media corpus sentiment analysis method experiment verifies the effectiveness and accuracy of the method. Compared with the single model and only the text sentiment analysis, the accuracy is significantly improved, and the results show that the proposed method of the present invention has higher accuracy when the microblog sentiment analysis is performed.

Claims

1. A social media corpus emotion analysis method based on multi-model fusion is characterized by comprising the following steps: the method comprises the following steps in total,

step 1, data preprocessing:

the used data is obtained from a social platform through a crawler, irrelevant advertisement data is filtered, only the blog data with user subjectivity is reserved, the filtered text data is subjected to word segmentation by using a jieba word segmentation device, the segmented data has a plurality of meaningless data, a stop word list is used for filtering the data, and a stop word list with a great work size is adopted to obtain the text subjected to data preprocessing; in order to facilitate the processing of the picture data, the picture data is processed into a picture with 256 pixels by 256 pixels in a normalization mode;

step 2, performing SO-PMI model training on the text corpus:

performing emotion marking on words on the text obtained in the step (1), and dividing the words into positive, negative and neutral categories; text data used for model training accounts for 70% of total data, and test verification data accounts for 30%; firstly, for data which are segmented and stop words are filtered, 70% of processed emotion vocabularies are used for a Word2vec tool to obtain an expanded emotion dictionary; judging which type the words belong to by using the distance between the words and an emotion dictionary based on a point mutual information algorithm SO-PMI of semantic positioning; then considering the influence of negative words, degree adverbs, exclamation words, paraphrases and emotion diagrams, balancing all factors, and calculating the emotional tendency of the text content to obtain a classification result;

step 3, CNN + LSTM model training is carried out on the picture data:

adding emotion description texts for pictures on the basis of a picture data set, providing high-precision fine-grained classification convolution for image classification by using data of the two modes, performing text classification by using CNN + LSTM, and combining the two classification results to obtain emotion meaning explanation of the combined image; the classification in the aspect of image texts uses a CNN model, wherein the CNN model consists of a convolution layer and a full connection layer; for the aspect of text, a deep structured joint embedding method is adopted to jointly embed images and fine-grained visual description; the method learns the compatible function of the image and the text, and is regarded as the extension of multi-modal structure splicing embedding; the compatibility between description and matching images is improved to the maximum extent and the compatibility with other images is minimized by using a finite element inner product generated by a deep neural encoder instead of a bilinear compatibility function;

step 4, multi-model fusion:

the final classification result of the text emotion of the two texts can be obtained through the steps 2 and 3, and then the two parts are processed in a weighting mode to judge the final classification result; obtaining a final classification result y which is am + bn, wherein m is the category distance similarity determined by the plain text, n is the category distance similarity determined by the text obtained by the image, and then solving according to a genetic algorithm of an MATLB tool to obtain threshold values a and b;

step 5, final emotion analysis results:

and 4, obtaining the values of a and b in y ═ am + bn, inputting the text type similarity and the image text similarity, and outputting a graph-text classification value y, wherein the values are 1, -1 and 0, 1 is positive, -1 is negative, and 0 is a neutral classification result.

2. The method for analyzing social media corpus emotion based on multi-model fusion as claimed in claim 1, wherein:

given data D ═ v_n，t_n，y_n) N1, …, N, where V e V denotes visual information, T e T denotes text type, Y e Y denotes class labels, and then the image and text classifier function f is learned by minimizing empirical risk_υ: v → Y and f_t: v → Y wherein

For 0-1 penalty, then define the compatibility of the function F

A function θ (V) image and text Φ (T) function of an encoder using characteristics, where N represents a data dimension, V represents a set of images, T represents a set of texts, and Y represents a mapping space; the following three formulas are derived fromThe interpretation of the social media corpus emotion analysis method of multi-model fusion from the learning perspective, wherein the formula (1.3) is a graph-text fusion function, F (v, t) is a graph-text fusion result, and theta (v)^TPhi (t) is an image function and phi (t) is a text function; the formula (1.1) is a maximum expected average function of the image, wherein F (v, t) is the formula (1.3), Y is the image corpus, t is the text corpus, and Y is the mapping space of Y; the formula (1.2) is a maximum expected average function of the text, wherein F (v, t) is the formula (1.3), Y is an image corpus, v is the image corpus, and Y is a mapping space;

f_v(v)＝argmax_yE_t～T(y)[F(v，t)]，yεY (1.1)

f_t(t)＝argmax_yE_v～T(y)[F(v，t)]，yεY(1.2)

F(v，t)＝θ(v)^TΦ(t) (1.3)。