CN110175325A

CN110175325A - The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature

Info

Publication number: CN110175325A
Application number: CN201910343337.5A
Authority: CN
Inventors: 吕奇; 沈楠楠; 胡新春; 陈可佳
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-08-27
Anticipated expiration: 2039-04-26
Also published as: CN110175325B

Abstract

The present invention proposes a comment analysis method based on word vectors and syntactic features in the field of data analysis, including: obtaining the comment data on the product page of the e-commerce website; preprocessing the acquired target data set; extracting the praise and criticism provided by Hownet and NTU The word set forms the basic emotional dictionary; the preprocessed data set is used for word vector training through the Word2Vec tool; the semantic similarity matrix is used to establish the probability transition matrix; the obtained product review text is processed based on the core sentence rules; Preprocess the obtained redundant text; evaluate the matching pair through part-of-speech extraction <commodity attribute, negative word, degree word, emotional word> on the obtained dependency relationship pair; combine the obtained evaluation matching pair with the sentiment dictionary, and evaluate the evaluation object Computational value calculation, pros and cons ranking, and finally realized through a visual interactive interface to realize accurate, real-time, automatic, and convenient processing and analysis of product review data, which can be used in e-commerce platforms.

Description

Comment analysis method and visual interactive interface based on word vector and syntactic features

技术领域technical field

本发明属于数据分析技术领域，特别是涉及一种使用神经网络模型训练的词向量构建的适用于商品评论的情感词典、属性识别算法和基于词向量和句法特征的评论分析系统。The invention belongs to the technical field of data analysis, and in particular relates to an emotional dictionary suitable for commodity reviews constructed using word vectors trained by a neural network model, an attribute recognition algorithm, and a comment analysis system based on word vectors and syntactic features.

背景技术Background technique

随着互联网的普及与电子商务的发展，京东、淘宝等互联网电子商务网站迅速发展，越来越多的消费者开始选择网上购物；这些电商网站拥有海量的商品，同时也拥有广大的用户群，由此产生了庞大的评论数据。消费者给出的评论往往携带了用户对此次消费的主观感受，包括对购买商品的喜好程度，对商家服务的满意程度等。对消费者而言，这些评论文本可以帮助其更客观地了解到相关商品或服务的信息，从而给出更适合的选择；对商家而言，通过用户反馈的关于商品或服务的体验信息，可以帮助其进一步针对性的改善服务或商品质量，从而获得更多的客户和利润。然而，随着数据量的爆炸性增长，用户从海量评论数据中获取到有用的信息所需付出的成本也越来越大，因此，如何快速有效地对用户评论文本进行处理和分析，并从提取出有价值的信息，具有重要应用价值与研究意义。With the popularization of the Internet and the development of e-commerce, Internet e-commerce websites such as JD.com and Taobao have developed rapidly, and more and more consumers have begun to choose online shopping; these e-commerce websites have a large number of commodities and a large user base , resulting in a huge amount of comment data. The comments given by consumers often carry the user's subjective feelings about the consumption, including the degree of preference for the purchased goods, the degree of satisfaction with the merchant's service, and so on. For consumers, these review texts can help them understand the information of related products or services more objectively, so as to give more suitable choices; Help it to further improve the quality of services or goods in a targeted manner, so as to obtain more customers and profits. However, with the explosive growth of data volume, the cost for users to obtain useful information from massive comment data is also increasing. Therefore, how to process and analyze user comment text quickly and effectively, and extract It has important application value and research significance.

当前，大量的评论数据无法得到充分的利用，消费者难以从海量的评论数据中获取到有价值的信息。因此，我们研究了一种基于词向量和句法特征的评论分析系统，根据分析结果得到用户对于商品各属性的满意度，进而总结出商品的优势、劣势，然后对分析结果进行数据可视化。At present, a large amount of review data cannot be fully utilized, and it is difficult for consumers to obtain valuable information from massive review data. Therefore, we researched a comment analysis system based on word vectors and syntactic features, obtained the user's satisfaction with each attribute of the product according to the analysis results, and then summarized the advantages and disadvantages of the product, and then visualized the analysis results.

发明内容Contents of the invention

本发明所要解决的技术问题是如何实现对商品评论数据进行准确、实时、自动、便利的处理与分析，克服现有技术的不足而提供一种基于词向量和句法特征的评论分析方法。The technical problem to be solved by the present invention is how to realize accurate, real-time, automatic and convenient processing and analysis of commodity review data, and provide a comment analysis method based on word vector and syntactic features to overcome the deficiencies of the prior art.

本发明提供一种基于词向量和句法特征的评论分析方法，包括以下步骤：The present invention provides a comment analysis method based on word vectors and syntactic features, comprising the following steps:

1）获取电商网站商品页面评论数据；1) Obtain the comment data on the product page of the e-commerce website;

2）将获取的目标数据集进行预处理，并构建候选情感词集；2) Preprocess the acquired target data set and construct a candidate emotional word set;

3）提取Hownet和NTU提供的褒贬词集组成基础情感词典；3) Extract the praise and derogation words provided by Hownet and NTU to form the basic sentiment dictionary;

4）将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练，得到词向量并生成语义相似度矩阵；4) Perform word vector training on the obtained preprocessed data set through the Word2Vec tool, obtain word vectors and generate a semantic similarity matrix;

5）使用语义相似度矩阵建立概率转移矩阵，并结合种子词集通过LPA标签传播算法且经过基础情感词典检验后生成最终的情感词典；5) Use the semantic similarity matrix to establish a probability transition matrix, and combine the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm and the basic sentiment dictionary test;

6）将获取的商品评论文本，进行基于核心句规则的处理，得到去除冗余的评论文本；6) Process the obtained product review text based on the core sentence rules to obtain a redundant review text;

7）将所得到的去除冗余的文本进行预处理，对得到的分词数据集合基于依存关系、句法特征形成依存关系树，生成SBV、VOB、ATT、CMP、COO依存关系对；7) Preprocess the obtained redundant text, form a dependency tree based on the dependency and syntactic features of the obtained word segmentation data set, and generate SBV, VOB, ATT, CMP, COO dependency pairs;

8）对所得依存关系对通过词性提取<商品属性，否定词，程度词，情感词>评价搭配对；8) Extracting <commodity attributes, negative words, degree words, emotional words> through part-of-speech evaluation on the resulting dependency pairs;

9）将所得评价搭配对结合情感词典，对评价对象进行褒贬值计算、优劣排序，最终通过可视化交互界面实现。9) Combining the obtained evaluations with the emotional dictionary, calculate the value of the evaluation object, rank the pros and cons, and finally realize it through the visual interactive interface.

作为本发明的进一步限定，步骤2）具体包括：As a further definition of the present invention, step 2) specifically includes:

2-1）使用字符匹配算法去除非法字符；2-1) Use character matching algorithm to remove illegal characters;

2-2）将原始数据集使用LTP进行分词、词性标注；2-2) Use LTP for word segmentation and part-of-speech tagging on the original data set;

2-3）提取符合词性的词，经过去重，组成候选情感词集1；2-3) Extract the words that match the part of speech, and form the candidate emotional word set 1 after deduplication;

2-4）将原始数据集使用NLPIR进行分词、词性标注；2-4) Use NLPIR for word segmentation and part-of-speech tagging on the original data set;

2-5）提取符合词性的词，经过去重，组成候选情感词集2；2-5) Extract the words that match the part of speech, and form the candidate emotional word set 2 after deduplication;

2-6）将候选情感词集1和候选情感词集2组合，经过去重，得到候选情感词集。2-6) Combine the candidate emotional word set 1 and the candidate emotional word set 2, and obtain the candidate emotional word set after deduplication.

作为本发明的进一步限定，步骤3）具体包括：利用hownet情感词典和ntu评价词词典，分别提取其中的褒贬词，合并后去重，组成基础情感词典。As a further limitation of the present invention, step 3) specifically includes: using the hownet emotion dictionary and the ntu evaluation word dictionary to extract the praise and derogatory words in them respectively, and deduplicate them after merging to form a basic emotion dictionary.

作为本发明的进一步限定，步骤4）具体包括：As a further definition of the present invention, step 4) specifically includes:

4-1）利用Word2Vec训练数据集，得到词语的词向量；4-1) Use the Word2Vec training data set to get the word vector of the word;

4-2）结合候选情感词集，采用如下公式计算词语之间的语义相似度：4-2) Combined with the candidate emotional word set, the following formula is used to calculate the semantic similarity between words:

4-3）例如两个n维词向量a (x₁₁, x₁₂, … , x_1n)和b (x₂₁, x₂₂, … , x_2n) ,其语义相似度计算公式如下：4-3) For example, for two n-dimensional word vectors a (x ₁₁ , x ₁₂ , … , x _1n ) and b (x ₂₁ , x ₂₂ , … , x _2n ), the formula for calculating the semantic similarity is as follows:

其中，表示语义相似度值；表示词向量a第k维度数值；表示词向量b第k 维度数值； in, Indicates the semantic similarity value; Indicates the value of the kth dimension of the word vector a; Indicates the value of the kth dimension of the word vector b;

4-4）根据计算出的语义相似度构建语义相似度矩阵。4-4) Construct a semantic similarity matrix based on the calculated semantic similarity.

作为本发明的进一步限定，步骤5）具体包括：As a further definition of the present invention, step 5) specifically includes:

5-1)将每个词看作图的节点，两个节点间边的权重用其所代表词之间的语义相似度表示；5-1) Each word is regarded as a node of a graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words it represents;

5-2)根据如下公式建立概率转移矩阵P：5-2) Establish the probability transition matrix P according to the following formula:

其中，P[i][j]表示词语i到j之间的相似度转移概率，SIM(w_i,w_j)表示词语i和j 的相似度，m表示与词语i语义相似度最高的词的个数；Among them, P[i][j] represents the similarity transfer probability between words i and j, SIM(w _i , w _j ) represents the similarity between words i and j, and m represents the word with the highest semantic similarity with word i the number of

5-3)统计候选情感词集中所有情感词在原始评论数据中的词频，筛选出词频最高的N个词，组成种子词集1；利用情感词汇本体库，筛选出情感词汇本体强度>m且在候选情感词集中的词，组成种子词集2；将种子词集1和种子词集2合并后去重，组成种子词集，进行人工情感标注；5-3) Count the word frequency of all emotional words in the original comment data in the candidate emotional word set, and filter out the N words with the highest word frequency to form the seed word set 1; use the emotional vocabulary ontology database to filter out the emotional vocabulary ontology strength>m and The words in the candidate emotional word set form the seed word set 2; the seed word set 1 and the seed word set 2 are merged and removed to form the seed word set for artificial emotion labeling;

5-4)利用人工标注的少量种子词建立LxC的label矩阵Y_L，其中：L表示种子词个数；C表示类的个数，分为3类，分别为褒义，贬义，中性；5-4) Using a small amount of artificially labeled seed words to establish a label matrix Y _L of LxC, wherein: L represents the number of seed words; C represents the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral;

5-5)同时利用未标注的样本词建立UxC的label矩阵Y_U，其中：U表示未标注样本词个数；C表示类的个数，分为3类，分别为褒义，贬义，中性；5-5) At the same time, use the unlabeled sample words to establish the label matrix Y _U of UxC, where: U indicates the number of unlabeled sample words; C indicates the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral ;

5-6)最后采用LPA标签传播算法对所述样本词进行词性标注，并通过基础情感词典检验后，形成最终的情感词典。5-6) Finally, the part-of-speech tagging of the sample words is carried out using the LPA label propagation algorithm, and the final sentiment dictionary is formed after passing the basic sentiment dictionary test.

作为本发明的进一步限定，步骤6）具体包括：As a further definition of the present invention, step 6) specifically includes:

核心句主要指删除冗余，保留与评价搭配相关的主干成分；若原句不符合任何规则，则保持不变，本方法利用核心句旨在提高评价文本句法依存分析的准确率，其规则包括如下：The core sentence mainly refers to deleting redundancy and retaining the main components related to the evaluation collocation; if the original sentence does not meet any rules, it will remain unchanged. This method uses the core sentence to improve the accuracy of the syntactic dependency analysis of the evaluation text. The rules include the following :

规则1：删除句子中句首状语成分，如“…的优点”、“…的缺点”、“…的不足”、“…的优势”、“…的好处”序列；Rule 1: Delete the initial adverbial components in the sentence, such as the sequence of "advantages of...", "disadvantages of...", "deficiencies of...", "advantages of...", "benefits of...";

规则2：删除带有假设性倾向的句子，如“假如…”、“希望…”、“如果…”、“但愿…”、“建议…”；Rule 2: Delete sentences with hypothetical tendencies, such as "if...", "hope...", "if...", "hope...", "suggest...";

规则3：删除句首为“就是”、“居然是”、“特别是”、“还有就是”、“尤其是”序列；Rule 3: Delete the sequence beginning with "that is", "it is", "especially", "there is", "especially";

规则4：删除“感觉”、“认为”主张词；Rule 4: Delete the words "feeling" and "thinking";

规则5：删除除去第一个标点符号外的连续的标点符号以及如表情、颜文字、括号非正常的字符。Rule 5: Delete consecutive punctuation marks except the first punctuation mark and abnormal characters such as emoticons, emoticons, and brackets.

作为本发明的进一步限定，步骤7）具体包括：As a further definition of the present invention, step 7) specifically includes:

依存句法的五条公理：The five axioms of dependency syntax:

（1）一个句子只能有且只有一个独立成分；(1) A sentence can only have one and only one independent component;

（2）句子中任何成分都必须同时依存于某一成分；(2) Any element in the sentence must be dependent on a certain element at the same time;

（3）句子中任何成分不能同时依存于两个或两个以上的成分；(3) Any element in a sentence cannot depend on two or more elements at the same time;

（4）句子中如果成分a直接依存于成分b，成分c位于成分a和b之间，那么成分c依存于a或b或a、b之间的其它成分；(4) In a sentence, if component a directly depends on component b, and component c is located between components a and b, then component c depends on a or b or other components between a and b;

（5）中心成分左右两边的成分之间相互不存在依存关系；(5) There is no interdependent relationship between the components on the left and right sides of the central component;

依存关系树的特点有：The characteristics of the dependency tree are:

（1）树中的结点由句子中的各个成分充当；(1) The nodes in the tree are acted by the components in the sentence;

（2）树的根节点为整个句子中心成分；(2) The root node of the tree is the central component of the entire sentence;

（3）树中的结点之间构成的边具有方向性，反映了成分之间不对称的依存关系；(3) The edges formed between the nodes in the tree are directional, which reflects the asymmetric dependency between components;

（4）满足依存句法的五条公理；(4) Satisfy the five axioms of dependency syntax;

评论中绝大部分句子依存关系为主谓关系（SBV）、动宾关系（VOB/FOB）、定中关系（ATT）、动补关系（CMP）、并列关系（COO）这五类，可以通过LTP依存句法分析器进行依存句法分析，并结合识别并列评价对象、并列评价词的COO算法提取依存关系对；所述的识别并列评价对象、并列评价词的COO算法，具体包括：Most of the sentences in the comments depend on five types of relationship: main-predicate relationship (SBV), verb-object relationship (VOB/FOB), fixed-center relationship (ATT), verb-complement relationship (CMP), and parallel relationship (COO). The LTP dependency syntax analyzer performs dependency syntax analysis, and extracts the dependency relationship in conjunction with the COO algorithm that identifies parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying parallel evaluation objects and parallel evaluation words specifically includes:

遍历基于依存关系、句法特征所得到的SBV、VOB、ATT、CMP依存关系对中两个结点之间以及依存句法树中与之左右相关的所有词；Traversing all words between two nodes in the SBV, VOB, ATT, and CMP dependency pairs obtained based on the dependency relationship and syntactic features, and all words related to it in the dependency syntax tree;

判断所遍历的所有词中是否有COO关系；Determine whether there is a COO relationship in all the words traversed;

扩充COO关系的并列评价对象和评价词。Expand the parallel evaluation objects and evaluation words of the COO relationship.

作为本发明的进一步限定，步骤8）具体包括：As a further definition of the present invention, step 8) specifically includes:

8-1）根据中文语言特点，评价对象多为名词或动词，评价词多为形容词或动词；8-1) According to the characteristics of the Chinese language, evaluation objects are mostly nouns or verbs, and evaluation words are mostly adjectives or verbs;

8-2）根据词性提取评价对象与评价词，即商品属性与情感词；8-2) Extract evaluation objects and evaluation words according to part of speech, that is, commodity attributes and emotional words;

8-3）根据依存句法树，遍历所得的评价对象与评价词之间是否有否定词，如果有，否定词个数+1，若遍历到多个否定词累计相加，直至遍历结束，对否定词个数进行奇偶性判断。若为奇数，对应的否定词privative赋值为-1，若为偶数，对应的否定词privative赋值为+1；8-3) According to the dependency syntax tree, whether there is a negative word between the evaluation object obtained by traversal and the evaluation word, if there is, the number of negative words is +1, and if multiple negative words are traversed, they are accumulated and added until the end of the traversal. The number of negative words is used for parity judgment. If it is an odd number, the corresponding negative word privative is assigned a value of -1, and if it is an even number, the corresponding negative word privative is assigned a value of +1;

8-4）根据依存句法树，遍历所得的评价对象与评价词之间是否有程度词，若遍历到多个，进行个数累加，得到此搭配对的程度词个数；8-4) According to the dependency syntax tree, whether there are degree words between the evaluation object obtained by traversing and the evaluation word, if more than one is traversed, the number is accumulated to obtain the number of degree words of this matching pair;

8-5）最终形成<商品属性，否定词，程度词，情感词>评价搭配对。8-5) Finally form <commodity attributes, negative words, degree words, emotional words> evaluation collocation pairs.

作为本发明的进一步限定，步骤9）具体包括：As a further limitation of the present invention, step 9) specifically includes:

根据出现n次的商品属性a，其褒贬值计算公式如下：According to the product attribute a that appears n times, the calculation formula of its praise and depreciation value is as follows:

其中a.score是商品属性a的情感值，为商品属性出现的第i次，privative是第i次商品属性所对应否定词的所得值（-1或+1），degree是第i次商品属性所对应的程度副词个数；由此计算出商品属性情感值，相同评价对象累加计算； Where a.score is the emotional value of product attribute a, is the i-th occurrence of the commodity attribute, privative is the obtained value (-1 or +1) of the negative word corresponding to the i-th commodity attribute, and degree is the number of degree adverbs corresponding to the i-th commodity attribute; thus calculated The emotional value of commodity attributes is calculated cumulatively for the same evaluation object;

对抽取的所有评价对象，分褒贬两类，利用冒泡排序排列出最后结果。All the evaluation objects extracted are divided into two categories, and the final results are arranged by bubble sorting.

一种可视化交互界面，可以执行权利要求上述的所有步骤，可以将情感值以柱状图的形式很好地展现之外，还增添了很多友好的交互功能，包括：加载、登录、注销、修改密码以及用户登录使用状态等。A visual interactive interface that can perform all the above-mentioned steps in the claims, and can display the emotional value in the form of a histogram well, and also adds many friendly interactive functions, including: loading, logging in, logging out, and changing passwords And user login usage status, etc.

本发明采用以上技术方案与现有技术相比，具有以下技术效果：Compared with the prior art, the present invention adopts the above technical scheme and has the following technical effects:

本发明通过获取电商网站商品页面评论数据并进行预处理，构建基础情感词典；再将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练并生成语义相似度矩阵进而建立概率转移矩阵，并结合种子词集通过LPA标签传播算法生成最终的情感词典；将获取的商品评论文本，进行基于核心句规则的处理，得到去除冗余的评论文本；再将所得到的去除冗余的文本进行预处理，对得到的分词数据集合基于依存关系、句法特征形成依存关系树，生成SBV、VOB、ATT、CMP、COO依存关系对并提取<商品属性，否定词，程度词，情感词>评价搭配对，再结合情感词典，对商品属性进行褒贬值计算、优劣排序，最终通过可视化交互界面实现；可以同时实现对评论数据进行分析的准确、实时、自动和便利。The present invention constructs a basic sentiment dictionary by obtaining and preprocessing the commodity page comment data of an e-commerce website; and then performs word vector training on the obtained preprocessed data set through the Word2Vec tool to generate a semantic similarity matrix and then establish a probability transfer matrix , and combined with the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm; the obtained product review text is processed based on the core sentence rules to obtain the redundant comment text; and then the redundant redundant text is obtained Perform preprocessing, form a dependency tree based on the dependency relationship and syntactic features of the obtained word segmentation data set, generate SBV, VOB, ATT, CMP, COO dependency relationship pairs and extract <commodity attributes, negative words, degree words, emotional words> evaluation Matching pairs, combined with the emotional dictionary, calculates the value of product attributes, ranks the pros and cons, and finally realizes it through a visual interactive interface; it can simultaneously realize accurate, real-time, automatic and convenient analysis of comment data.

附图说明Description of drawings

图1为本发明流程图。Fig. 1 is the flow chart of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案做进一步的详细说明：Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail:

本发明的技术方案通过使用一种神经网络模型训练的词向量，并结合LTP标签传播算法构建一个适用于商品评论的情感词典；通过基于核心句规则、依存关系以及句法特征设计了一个商品属性识别提取算法；并结合上述技术方案构建了一个基于词向量和句法特征的评论分析系统，根据分析结果得到用户对于商品各属性的满意度，进而总结出商品的优势、劣势，然后对分析结果进行数据可视化。The technical solution of the present invention uses a word vector trained by a neural network model, combined with the LTP label propagation algorithm to construct an emotional dictionary suitable for commodity reviews; and designs a commodity attribute recognition based on core sentence rules, dependency relationships and syntactic features Extraction algorithm; combined with the above technical solutions, a comment analysis system based on word vectors and syntactic features was constructed, and the user's satisfaction with each attribute of the product was obtained according to the analysis results, and then the advantages and disadvantages of the product were summarized, and then the analysis results were analyzed. visualization.

参阅图1，本发明实施一个基于词向量和句法特征的评论分析方法，具体的实施步骤如下：Referring to Fig. 1, the present invention implements a comment analysis method based on word vectors and syntactic features, and the specific implementation steps are as follows:

步骤S101：获取电商网站商品页面评论数据。Step S101: Acquiring review data on product pages of an e-commerce website.

在具体实施中，设计一种评论数据爬取算法，获取电商网站各类商品的评论数据，生成原始评论数据集。In the specific implementation, a comment data crawling algorithm is designed to obtain the comment data of various commodities on the e-commerce website, and generate the original comment data set.

步骤S102：将所述获取的目标数据集进行预处理，并构建基础情感词典。Step S102: preprocessing the acquired target data set, and building a basic sentiment dictionary.

在具体实施中，对原始数据集使用字符匹配算法去除非法字符；首先使用LTP进行分词、词性标注，提取词性标识为“a”（adj）的词，经过去重，组成候选情感词集1；然后使用NLPIR进行分词、词性标注，提取词性标识为“a”（adj）的词，经过去重，组成候选情感词集2；合并候选情感词集1和候选情感词集2，经过去重，组成最终的候选情感词集。In the specific implementation, the character matching algorithm is used to remove illegal characters on the original data set; firstly, LTP is used for word segmentation and part-of-speech tagging, and words whose part-of-speech identifier is "a" (adj) are extracted, and the candidate emotional word set 1 is formed after deduplication; Then use NLPIR for word segmentation and part-of-speech tagging, extract the words with the part-of-speech tag "a" (adj), and form candidate emotional word set 2 after deduplication; merge candidate emotional word set 1 and candidate emotional word set 2, after deduplication, Compose the final candidate sentiment word set.

步骤S103：提取Hownet和NTU提供的褒贬词集组成基础情感词典。Step S103: extract the praise and derogation words provided by Hownet and NTU to form the basic sentiment dictionary.

在具体实施中，利用hownet情感词典和NTU评价词词典，分别提取其中的褒贬词，合并后去重，组成基础情感词典。In the specific implementation, the hownet emotion dictionary and the NTU evaluation word dictionary are used to extract the praise and derogatory words in them respectively, and then merge them to remove the duplicates to form the basic emotion dictionary.

步骤S104：将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练，得到词向量并生成语义相似度矩阵。Step S104: Perform word vector training on the obtained preprocessed data set through the Word2Vec tool to obtain word vectors and generate a semantic similarity matrix.

在具体实施中，利用Word2Vec训练数据集，分别设置训练参数size=100, window=5, sg=0, min_count=0，经过训练得到词语的词向量。In the specific implementation, use the Word2Vec training data set, set the training parameters size=100, window=5, sg=0, min_count=0 respectively, and get the word vector of the word after training.

结合候选情感词集，采用如下公式计算词语之间的语义相似度。Combined with the candidate emotional word set, the following formula is used to calculate the semantic similarity between words.

例如两个n维词向量a (x₁₁, x₁₂, … , x_1n)和b (x₂₁, x₂₂, … , x_2n) ,其语义相似度计算公式如下：For example, for two n-dimensional word vectors a (x ₁₁ , x ₁₂ , … , x _1n ) and b (x ₂₁ , x ₂₂ , … , x _2n ), the formula for calculating the semantic similarity is as follows:

按顺序遍历候选情感词集中的所有情感词，固定一个，计算其与其他所有情感词的相似度；假设有m个候选情感词，经过m*m次计算，得到一个m*m的语义相似度矩阵。Traverse all the emotional words in the candidate emotional word set in order, fix one, and calculate its similarity with all other emotional words; suppose there are m candidate emotional words, after m*m calculations, get a m*m semantic similarity matrix.

为便于下述操作，规定，同一情感词之间的相似度为0。For the convenience of the following operations, it is stipulated that the similarity between the same emotional words is 0.

根据计算出的语义相似度构建语义相似度矩阵。Construct a semantic similarity matrix based on the calculated semantic similarity.

步骤S105：使用语义相似度矩阵建立概率转移矩阵，并结合种子词集通过LPA标签传播算法且经过基础情感词典检验后生成最终的情感词典。Step S105: Use the semantic similarity matrix to establish a probability transition matrix, and combine the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm and the basic sentiment dictionary test.

在具体实施中，将每个词看作图的节点，两个节点间边的权重用其所代表词之间的语义相似度表示。In the specific implementation, each word is regarded as a node of the graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words it represents.

根据如下公式建立概率转移矩阵P：The probability transition matrix P is established according to the following formula:

其中，P[i][j]表示词语i到j之间的相似度转移概率，SIM(w_i,w_j)表示词语i和j 的相似度，m表示与词语i语义相似度最高的词的个数（人工设置）；根据上述公式建立概率转移矩阵P。Among them, P[i][j] represents the similarity transfer probability between words i and j, SIM(w _i , w _j ) represents the similarity between words i and j, and m represents the word with the highest semantic similarity with word i The number of (manually set); establish the probability transition matrix P according to the above formula.

统计候选情感词集中所有情感词在原始评论数据中的词频，筛选出词频最高的100个词，组成种子词集1；利用大连理工大学情感词汇本体库，筛选出情感词汇本体强度>7且在候选情感词集中的词，组成种子词集2；将种子词集1和种子词集2合并后去重，组成种子词集，进行人工情感标注。The word frequency of all emotional words in the original comment data in the candidate emotional word set was counted, and the 100 words with the highest word frequency were selected to form the seed word set 1; using the emotional vocabulary ontology database of Dalian University of Technology, the emotional vocabulary ontology strength > 7 and in The words in the candidate emotional word set form the seed word set 2; the seed word set 1 and the seed word set 2 are combined and removed to form the seed word set for artificial emotion labeling.

之后利用人工标注的少量种子词建立LxC的label矩阵Y_L，其中：L表示种子词个数；C表示类的个数，一般为3类（褒义，贬义，中性）；同时利用未标注的样本词建立UxC的label矩阵Y_U，其中：U表示未标注样本词个数；C表示类的个数，一般为3类（褒义，贬义，中性）；把两个label矩阵合并，得到一个NxC的soft label矩阵F=[Y_L;Y_U]。Then use a small number of artificially labeled seed words to establish the LxC label matrix Y _L , where: L represents the number of seed words; C represents the number of classes, generally 3 types (commendative, derogatory, neutral); The sample word establishes the label matrix Y _U of UxC, where: U represents the number of unlabeled sample words; C represents the number of classes, generally 3 classes (commendative, derogatory, neutral); combine the two label matrices to obtain a NxC's soft label matrix F=[Y _L ; Y _U ].

执行标签传播算法，具体操作为：1）执行传播：F=PF； 2）重置F中labeled样本的标签：F_L=Y_L；3）重复步骤1）和2）直到F收敛。Execute the label propagation algorithm, the specific operation is: 1) Execute propagation: F=PF; 2) Reset the label of the labeled sample in F: F _L =Y _L ; 3) Repeat steps 1) and 2) until F converges.

其中，步骤1的目的是将每个节点（情感词）的标签（情感属性）以概率转移矩阵确定的概率传播给其他节点，如果两个节点的相似度越大，传播的概率越大；步骤2的目的是将已标注种子词的标签重置为标注的值，避免因步骤1的运算过程而改变；步骤3中确定F收敛的方法是计算最新的F与上一次运算后的F₀的矩阵相似度，直到相似度不再变化时，认为F已收敛。Among them, the purpose of step 1 is to propagate the label (emotional attribute) of each node (emotional word) to other nodes with the probability determined by the probability transfer matrix. If the similarity between the two nodes is greater, the probability of propagation is greater; step The purpose of 2 is to reset the label of the tagged seed word to the tagged value to avoid changes due to the operation process of step 1; the method of determining the convergence of F in step 3 is to calculate the latest F and the F ₀ after the last operation Matrix similarity, until the similarity does not change, it is considered that F has converged.

最终矩阵F中单行的三个数值表示其所对应的情感词的属性传播值，选取其中最大的数值，判断其所对应属性，确定该情感词属性。The three values in a single row in the final matrix F represent the attribute propagation value of the corresponding emotional word, select the largest value among them, judge its corresponding attribute, and determine the attribute of the emotional word.

导出确认属性的情感词，组成情感词典1，遍历情感词典1中的所有情感词，若步骤S103所述基础情感词典中含有该词且与基础情感词典中属性矛盾，改变其属性，以基础情感词典中属性为准；反之，属性不变。Deriving the sentiment word of confirming property, form sentiment dictionary 1, traverse all sentiment words in sentiment dictionary 1, if the basic sentiment dictionary described in step S103 contains this word and contradicts with the property in the basic sentiment dictionary, change its property, with basic sentiment The attribute in the dictionary prevails; otherwise, the attribute remains unchanged.

上述步骤结束后，修改后的情感词典1即为最终的情感词典。After the above steps are completed, the modified sentiment dictionary 1 is the final sentiment dictionary.

步骤S106：将所述获取的商品评论文本，进行基于核心句规则的处理，得到去除冗余的评论文本。Step S106: Process the acquired commodity review text based on core sentence rules to obtain a redundant review text.

在具体实施中，在本系统网页的交互界面上，输入商品网址，通过后台设计的网络爬虫机制，爬取电商平台上所输入商品的评论数据，系统设置爬取该商品的前1000条优质评论数据。In the specific implementation, on the interactive interface of the webpage of this system, input the website address of the product, and crawl the comment data of the input product on the e-commerce platform through the web crawler mechanism designed in the background, and the system is set to crawl the first 1000 high-quality articles of the product. comment data.

将爬取获得的商品评论数据，基于核心句规则进行去除冗余的处理，保留与评价搭配相关的主干成分；例如：“手机收到了，挺不错的，像素和音质都很好，尤其是快递很给力（次日达），唯一的不足就是包装不是很好，希望店家可以改进一下。。。”处理如下：The commodity review data obtained by crawling will be processed based on the core sentence rules to remove redundancy, and the main components related to the evaluation collocation will be retained; for example: "The mobile phone has been received, it is very good, the pixel and sound quality are very good, especially the express delivery Very good (next day delivery), the only downside is that the packaging is not very good, I hope the store can improve it..." The processing is as follows:

（1）匹配规则1，例句中匹配到“…的不足”，处理后变为“手机收到了，挺不错的，像素和音质都很好，尤其是快递很给力（次日达），就是包装不是很好，希望店家可以改进一下。。。”；(1) Matching rule 1. In the example sentence, "…'s lack of" is matched, and after processing, it becomes "The mobile phone has been received, it is very good, the pixel and sound quality are very good, especially the express delivery is very good (next day delivery), it is the packaging Not very good, I hope the store can improve it...";

（2）匹配规则2，例句中匹配到“希望”，处理后变为“手机收到了，挺不错的，像素和音质都很好，尤其是快递很给力（次日达），就是包装不是很好，店家可以改进一下。。。”；(2) Matching rule 2, "hope" is matched in the example sentence, and after processing, it becomes "The mobile phone has been received, it is very good, the pixel and sound quality are very good, especially the express delivery is very good (next day delivery), but the packaging is not very good Well, the store can improve it...";

（3）匹配规则3，例句中匹配到“就是”“尤其是”，处理后变为“手机收到了，挺不错的，像素和音质都很好，快递很给力（次日达），包装不是很好，店家可以改进一下。。。”；(3) Matching rule 3. In the example sentence, "is" and "especially" are matched, and after processing, it becomes "The mobile phone has been received, it is very good, the pixel and sound quality are very good, the express delivery is very good (the next day), the packaging is not Very good, the store can improve it...";

（4）匹配规则5，例句删除连续的标点符号，最终处理得到的核心句为“手机收到了，挺不错的，像素和音质都很好，快递很给力次日达，包装不是很好，店家可以改进一下。”，此实施例记为实施例句Sentences。(4) Matching rule 5, delete consecutive punctuation marks in the example sentence, and the core sentence obtained in the final processing is "The mobile phone has been received, it is very good, the pixel and sound quality are very good, the express delivery is very good, the packaging is not very good, the store It can be improved." This embodiment is recorded as the embodiment sentence Sentences.

步骤S107：将所得到的去除冗余的文本进行预处理，得到的分词数据集合基于依存关系、句法特征形成依存关系树，生成SBV、VOB、ATT、CMP、COO依存关系对。Step S107: Preprocess the obtained redundant text, form a dependency tree based on the dependency and syntactic features of the obtained word segmentation data set, and generate SBV, VOB, ATT, CMP, COO dependency pairs.

在具体实施中，将上述步骤S106中所得到的去除冗余的文本进行预处理，以标点符号分句，得到6个小句。将每一小句，利用LTP工具对其进行分词，词性标注，并基于依存关系、句法特征形成依存关系树。得到依存关系对SBV<手机，收到>，SBV<像素，好>，COO<音质，像素>，SBV<快递，给力>，SBV<包装，好>，SBV<店家，改进>。In a specific implementation, the redundant text obtained in the above step S106 is preprocessed to divide sentences with punctuation marks to obtain 6 clauses. Use the LTP tool to segment each sentence, mark part of speech, and form a dependency tree based on dependency and syntactic features. Get the dependency on SBV<mobile phone, received>, SBV<pixel, good>, COO<sound quality, pixel>, SBV<express delivery, awesome>, SBV<packaging, good>, SBV<store, improvement>.

例如此小句“像素和音质都很好”，经过以上步骤处理后，再结合识别并列评价对象、并列评价词的COO算法再次提取依存关系对，则得到的依存关系对为<像素，好>，<音质，好>。For example, for the small sentence "The pixels and sound quality are both very good", after the above steps, the COO algorithm for identifying parallel evaluation objects and parallel evaluation words is used to extract the dependency relationship pair again, and the obtained dependency relationship pair is <pixel, good> , <sound quality, good>.

步骤S108：对所得依存关系对通过词性提取<商品属性，否定词，程度词，情感词>评价搭配对。Step S108: Evaluate the matching pair by extracting <commodity attribute, negative word, degree word, emotion word> from the obtained dependency relationship pair.

在具体实施中，对每个抽取的关系对，遍历评价对象与评价词之间是否有否定词，并计算个数，对评价对象与评价词之间的否定词经判别奇偶后得到其正负值，即否定词判定为奇数个数，对应privative赋值-1；否定词判定为偶数个数，对应privative赋值+1。然后再遍历评价对象与评价词之间是否有程度词，并计算程度词个数。最终形成<商品属性，privative，degree，情感词>评价搭配对。步骤S106中实施例句Sentences中，关系对<包装，好>之间识别到一个否定词“不”，则此对应的privative值为-1；再遍历“包装”与“好”之间的程度副词，识别到“很”，则对应的degree值为1。则此小句提取的评价搭配对为<包装，-1，1，好>。In the specific implementation, for each extracted relationship pair, whether there is a negative word between the evaluation object and the evaluation word is traversed, and the number is calculated. value, that is, if the number of negative words is judged to be an odd number, the corresponding privative value is -1; if the number of negative words is judged to be an even number, the corresponding privative value is +1. Then traverse whether there are degree words between the evaluation object and the evaluation words, and calculate the number of degree words. Finally, an evaluation pair of <commodity attribute, privative, degree, emotional word> is formed. In the example sentence Sentences in step S106, a negative word "no" is identified between the relationship pair <package, good>, then the corresponding privative value is -1; then traverse the degree adverbs between "package" and "good" , if "very" is recognized, the corresponding degree value is 1. Then the evaluation collocation pair extracted from this clause is <package, -1, 1, good>.

步骤S109：将所得评价搭配对结合情感词典，对评价对象进行褒贬值计算、优劣排序，最终通过可视化交互界面实现。Step S109: Combining the obtained evaluation matching pairs with the sentiment dictionary, calculate the value of the evaluation object, rank the superior and inferior, and finally realize it through the visual interactive interface.

在具体实施中，对所提取的评价搭配对，通过情感词典获得情感词的褒贬属性。再根据以下公式进行商品属性的褒贬值计算：In the specific implementation, for the extracted evaluation collocation pairs, the praise and derogation attributes of the sentiment words are obtained through the sentiment dictionary. Then calculate the appreciation and depreciation value of commodity attributes according to the following formula:

对步骤S107中得到的评价搭配对<包装，-1，1，好>中的商品属性“包装”进行褒贬值计算得其情感值为。For the evaluation collocation obtained in step S107, the value of the product attribute "package" in <package, -1, 1, good> is calculated to obtain its emotional value .

遍历所获取的商品所有的评论数据，进行以上步骤处理，对相同的评价对象进行累加，最终提取得到该商品所有的商品属性，然后分褒贬两类，利用冒泡排序排列得出最后结果。最后通过前后端，在网页上用可视化交互界面实现。Traverse all the comment data of the acquired product, perform the above steps, accumulate the same evaluation objects, and finally extract all the product attributes of the product, and then divide them into two categories, using bubble sorting to get the final result. Finally, through the front and back ends, it is realized on the web page with a visual interactive interface.

以上所述，仅为本发明中的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉该技术的人在本发明所揭露的技术范围内，可理解想到的变换或替换，都应涵盖在本发明的包含范围之内，因此，本发明的保护范围应该以权利要求书的保护范围为准。The above is only a specific implementation mode in the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technology can understand the conceivable transformation or replacement within the technical scope disclosed in the present invention. All should be covered within the scope of the present invention, therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A comment analysis method based on word vector and syntactic feature, is characterized in that, comprises the following steps:

1) Obtain the comment data on the product page of the e-commerce website;

2) Preprocess the acquired target data set and construct a candidate emotional word set;

3) Extract the praise and derogation words provided by Hownet and NTU to form the basic sentiment dictionary;

4) Perform word vector training on the obtained preprocessed data set through the Word2Vec tool, obtain word vectors and generate a semantic similarity matrix;

5) Use the semantic similarity matrix to establish a probability transition matrix, and combine the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm and the basic sentiment dictionary test;

6) Process the obtained product review text based on the core sentence rules to obtain a redundant review text;

7) Preprocess the obtained redundant text, form a dependency tree based on the dependency and syntactic features of the obtained word segmentation data set, and generate SBV, VOB, ATT, CMP, COO dependency pairs;

8) Extracting <commodity attributes, negative words, degree words, emotional words> through part-of-speech evaluation on the resulting dependency pairs;

9) Combining the obtained evaluations with the emotional dictionary, calculate the value of the evaluation object, rank the pros and cons, and finally realize it through the visual interactive interface.

2. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 2) specifically includes:

2-1) Use character matching algorithm to remove illegal characters;

2-2) Use LTP for word segmentation and part-of-speech tagging on the original data set;

2-3) Extract the words that match the part of speech, and form the candidate emotional word set 1 after deduplication;

2-4) Use NLPIR for word segmentation and part-of-speech tagging on the original data set;

2-5) Extract the words that match the part of speech, and form the candidate emotional word set 2 after deduplication;

2-6) Combine the candidate emotional word set 1 and the candidate emotional word set 2, and obtain the candidate emotional word set after deduplication.

3. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that, step 3) specifically includes: using hownet sentiment dictionary and ntu evaluation word dictionary, extracting the praise and derogation words therein respectively, and removing them after merging Heavy, forming a basic sentiment dictionary.

4. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 4) specifically includes:

4-1) Use the Word2Vec training data set to get the word vector of the word;

4-2) Combined with the candidate emotional word set, the following formula is used to calculate the semantic similarity between words:

4-3) For example, for two n-dimensional word vectors a (x ₁₁ , x ₁₂ , … , x _1n ) and b (x ₂₁ , x ₂₂ , … , x _2n ), the formula for calculating the semantic similarity is as follows:

in, Indicates the semantic similarity value; Indicates the value of the kth dimension of the word vector a; Indicates the value of the kth dimension of the word vector b;

4-4) Construct a semantic similarity matrix based on the calculated semantic similarity.

5. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 5) specifically includes:

5-1) Each word is regarded as a node of a graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words it represents;

5-2) Establish the probability transition matrix P according to the following formula:

Among them, P[i][j] represents the similarity transfer probability between words i and j, SIM(w _i , w _j ) represents the similarity between words i and j, and m represents the word with the highest semantic similarity with word i the number of

5-3) Count the word frequency of all emotional words in the original comment data in the candidate emotional word set, and filter out the N words with the highest word frequency to form the seed word set 1; use the emotional vocabulary ontology database to filter out the emotional vocabulary ontology strength>m and The words in the candidate emotional word set form the seed word set 2; the seed word set 1 and the seed word set 2 are merged and removed to form the seed word set for artificial emotion labeling;

5-4) Using a small amount of artificially labeled seed words to establish a label matrix Y _L of LxC, wherein: L represents the number of seed words; C represents the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral;

5-5) At the same time, use the unlabeled sample words to establish the label matrix Y _U of UxC, where: U indicates the number of unlabeled sample words; C indicates the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral ;

5-6) Finally, the part-of-speech tagging of the sample words is carried out using the LPA label propagation algorithm, and the final sentiment dictionary is formed after passing the basic sentiment dictionary test.

6. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that, step 6) specifically includes:

The core sentence mainly refers to deleting redundancy and retaining the main components related to the evaluation collocation; if the original sentence does not meet any rules, it will remain unchanged. This method uses the core sentence to improve the accuracy of the syntactic dependency analysis of the evaluation text. The rules include the following :

Rule 1: Delete the initial adverbial components in the sentence, such as the sequence of "advantages of...", "disadvantages of...", "deficiencies of...", "advantages of...", "benefits of...";

Rule 2: Delete sentences with hypothetical tendencies, such as "if...", "hope...", "if...", "hope...", "suggest...";

Rule 3: Delete the sequence of "is", "is actually", "especially", "there is", "especially" at the beginning of the sentence;

Rule 4: Delete the words "feeling" and "thinking";

Rule 5: Delete consecutive punctuation marks except the first punctuation mark and abnormal characters such as emoticons, emoticons, and brackets.

7. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that, step 7) specifically includes:

The five axioms of dependency syntax:

(1) A sentence can only have one and only one independent component;

(2) Any element in a sentence must be dependent on a certain element at the same time;

(3) Any element in a sentence cannot depend on two or more elements at the same time;

(4) In a sentence, if component a directly depends on component b, and component c is located between components a and b, then component c depends on a or b or other components between a and b;

(5) There is no interdependent relationship between the components on the left and right sides of the central component;

The characteristics of the dependency tree are:

(1) The nodes in the tree are acted by the components in the sentence;

(2) The root node of the tree is the central component of the entire sentence;

(3) The edges formed between the nodes in the tree are directional, which reflects the asymmetric dependency between components;

(4) Satisfy the five axioms of dependency syntax;

Most of the sentences in the comments depend on five types of relationship: main-predicate relationship (SBV), verb-object relationship (VOB/FOB), fixed-center relationship (ATT), verb-complement relationship (CMP), and parallel relationship (COO). The LTP dependency syntax analyzer performs dependency syntax analysis, and extracts the dependency relationship in conjunction with the COO algorithm that identifies parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying parallel evaluation objects and parallel evaluation words specifically includes:

Traversing all words between two nodes in the SBV, VOB, ATT, and CMP dependency pairs obtained based on the dependency relationship and syntactic features, and all words related to it in the dependency syntax tree;

Determine whether there is a COO relationship in all the words traversed;

Expand the parallel evaluation objects and evaluation words of the COO relationship.

8. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 8) specifically includes:

8-1) According to the characteristics of the Chinese language, evaluation objects are mostly nouns or verbs, and evaluation words are mostly adjectives or verbs;

8-2) Extract evaluation objects and evaluation words according to part of speech, that is, commodity attributes and emotional words;

8-3) According to the dependency syntax tree, whether there is a negative word between the evaluation object obtained by traversal and the evaluation word, if there is, the number of negative words is +1, and if multiple negative words are traversed, they are accumulated and added until the end of the traversal. The number of negative words is used for parity judgment;

If it is an odd number, the corresponding negative word privative is assigned a value of -1, and if it is an even number, the corresponding negative word privative is assigned a value of +1;

8-4) According to the dependency syntax tree, whether there are degree words between the evaluation object obtained by traversing and the evaluation word, if more than one is traversed, the number is accumulated to obtain the number of degree words of this matching pair;

8-5) Finally form <commodity attributes, negative words, degree words, emotional words> evaluation collocation pairs.

9. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 9) specifically includes:

According to the product attribute a that appears n times, the calculation formula of its praise and depreciation value is as follows:

Among them, a.score is the emotional value of product attribute a, X _i is the i-th occurrence of the product attribute, privative is the obtained value (-1 or +1) of the negative word corresponding to the i-th product attribute, and degree is the ith time The number of degree adverbs corresponding to the product attributes; from this, the emotional value of the product attributes is calculated, and the same evaluation objects are cumulatively calculated;

All the evaluation objects extracted are divided into two categories, and the final results are arranged by bubble sorting.

10. A visual interactive interface, characterized in that it can perform all the steps of claims 1 to 9, can display the emotional value in the form of a histogram well, and adds many friendly interactive functions, including : Loading, login, logout, password change, user login status, etc.