CN110175325A - The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature - Google Patents

The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature Download PDF

Info

Publication number
CN110175325A
CN110175325A CN201910343337.5A CN201910343337A CN110175325A CN 110175325 A CN110175325 A CN 110175325A CN 201910343337 A CN201910343337 A CN 201910343337A CN 110175325 A CN110175325 A CN 110175325A
Authority
CN
China
Prior art keywords
word
words
evaluation
emotional
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910343337.5A
Other languages
Chinese (zh)
Other versions
CN110175325B (en
Inventor
吕奇
沈楠楠
胡新春
陈可佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910343337.5A priority Critical patent/CN110175325B/en
Publication of CN110175325A publication Critical patent/CN110175325A/en
Application granted granted Critical
Publication of CN110175325B publication Critical patent/CN110175325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

本发明提出了数据分析领域内的一种基于词向量和句法特征的评论分析方法,包括:获取电商网站商品页面评论数据;将获取的目标数据集进行预处理;提取Hownet和NTU提供的褒贬词集组成基础情感词典;将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练;使用语义相似度矩阵建立概率转移矩阵;将获取的商品评论文本,进行基于核心句规则的处理;将所得到的去除冗余的文本进行预处理;对所得依存关系对通过词性提取<商品属性,否定词,程度词,情感词>评价搭配对;将所得评价搭配对结合情感词典,对评价对象进行褒贬值计算、优劣排序,最终通过可视化交互界面实现,实现对商品评论数据进行准确、实时、自动、便利的处理与分析,可用于电商平台中。

The present invention proposes a comment analysis method based on word vectors and syntactic features in the field of data analysis, including: obtaining the comment data on the product page of the e-commerce website; preprocessing the acquired target data set; extracting the praise and criticism provided by Hownet and NTU The word set forms the basic emotional dictionary; the preprocessed data set is used for word vector training through the Word2Vec tool; the semantic similarity matrix is used to establish the probability transition matrix; the obtained product review text is processed based on the core sentence rules; Preprocess the obtained redundant text; evaluate the matching pair through part-of-speech extraction <commodity attribute, negative word, degree word, emotional word> on the obtained dependency relationship pair; combine the obtained evaluation matching pair with the sentiment dictionary, and evaluate the evaluation object Computational value calculation, pros and cons ranking, and finally realized through a visual interactive interface to realize accurate, real-time, automatic, and convenient processing and analysis of product review data, which can be used in e-commerce platforms.

Description

基于词向量和句法特征的评论分析方法及可视化交互界面Comment analysis method and visual interactive interface based on word vector and syntactic features

技术领域technical field

本发明属于数据分析技术领域,特别是涉及一种使用神经网络模型训练的词向量构建的适用于商品评论的情感词典、属性识别算法和基于词向量和句法特征的评论分析系统。The invention belongs to the technical field of data analysis, and in particular relates to an emotional dictionary suitable for commodity reviews constructed using word vectors trained by a neural network model, an attribute recognition algorithm, and a comment analysis system based on word vectors and syntactic features.

背景技术Background technique

随着互联网的普及与电子商务的发展,京东、淘宝等互联网电子商务网站迅速发展,越来越多的消费者开始选择网上购物;这些电商网站拥有海量的商品,同时也拥有广大的用户群,由此产生了庞大的评论数据。消费者给出的评论往往携带了用户对此次消费的主观感受,包括对购买商品的喜好程度,对商家服务的满意程度等。对消费者而言,这些评论文本可以帮助其更客观地了解到相关商品或服务的信息,从而给出更适合的选择;对商家而言,通过用户反馈的关于商品或服务的体验信息,可以帮助其进一步针对性的改善服务或商品质量,从而获得更多的客户和利润。然而,随着数据量的爆炸性增长,用户从海量评论数据中获取到有用的信息所需付出的成本也越来越大,因此,如何快速有效地对用户评论文本进行处理和分析,并从提取出有价值的信息,具有重要应用价值与研究意义。With the popularization of the Internet and the development of e-commerce, Internet e-commerce websites such as JD.com and Taobao have developed rapidly, and more and more consumers have begun to choose online shopping; these e-commerce websites have a large number of commodities and a large user base , resulting in a huge amount of comment data. The comments given by consumers often carry the user's subjective feelings about the consumption, including the degree of preference for the purchased goods, the degree of satisfaction with the merchant's service, and so on. For consumers, these review texts can help them understand the information of related products or services more objectively, so as to give more suitable choices; Help it to further improve the quality of services or goods in a targeted manner, so as to obtain more customers and profits. However, with the explosive growth of data volume, the cost for users to obtain useful information from massive comment data is also increasing. Therefore, how to process and analyze user comment text quickly and effectively, and extract It has important application value and research significance.

当前,大量的评论数据无法得到充分的利用,消费者难以从海量的评论数据中获取到有价值的信息。因此,我们研究了一种基于词向量和句法特征的评论分析系统,根据分析结果得到用户对于商品各属性的满意度,进而总结出商品的优势、劣势,然后对分析结果进行数据可视化。At present, a large amount of review data cannot be fully utilized, and it is difficult for consumers to obtain valuable information from massive review data. Therefore, we researched a comment analysis system based on word vectors and syntactic features, obtained the user's satisfaction with each attribute of the product according to the analysis results, and then summarized the advantages and disadvantages of the product, and then visualized the analysis results.

发明内容Contents of the invention

本发明所要解决的技术问题是如何实现对商品评论数据进行准确、实时、自动、便利的处理与分析,克服现有技术的不足而提供一种基于词向量和句法特征的评论分析方法。The technical problem to be solved by the present invention is how to realize accurate, real-time, automatic and convenient processing and analysis of commodity review data, and provide a comment analysis method based on word vector and syntactic features to overcome the deficiencies of the prior art.

本发明提供一种基于词向量和句法特征的评论分析方法,包括以下步骤:The present invention provides a comment analysis method based on word vectors and syntactic features, comprising the following steps:

1)获取电商网站商品页面评论数据;1) Obtain the comment data on the product page of the e-commerce website;

2)将获取的目标数据集进行预处理,并构建候选情感词集;2) Preprocess the acquired target data set and construct a candidate emotional word set;

3)提取Hownet和NTU提供的褒贬词集组成基础情感词典;3) Extract the praise and derogation words provided by Hownet and NTU to form the basic sentiment dictionary;

4)将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练,得到词向量并生成语义相似度矩阵;4) Perform word vector training on the obtained preprocessed data set through the Word2Vec tool, obtain word vectors and generate a semantic similarity matrix;

5)使用语义相似度矩阵建立概率转移矩阵,并结合种子词集通过LPA标签传播算法且经过基础情感词典检验后生成最终的情感词典;5) Use the semantic similarity matrix to establish a probability transition matrix, and combine the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm and the basic sentiment dictionary test;

6)将获取的商品评论文本,进行基于核心句规则的处理,得到去除冗余的评论文本;6) Process the obtained product review text based on the core sentence rules to obtain a redundant review text;

7)将所得到的去除冗余的文本进行预处理,对得到的分词数据集合基于依存关系、句法特征形成依存关系树,生成SBV、VOB、ATT、CMP、COO依存关系对;7) Preprocess the obtained redundant text, form a dependency tree based on the dependency and syntactic features of the obtained word segmentation data set, and generate SBV, VOB, ATT, CMP, COO dependency pairs;

8)对所得依存关系对通过词性提取<商品属性,否定词,程度词,情感词>评价搭配对;8) Extracting <commodity attributes, negative words, degree words, emotional words> through part-of-speech evaluation on the resulting dependency pairs;

9)将所得评价搭配对结合情感词典,对评价对象进行褒贬值计算、优劣排序,最终通过可视化交互界面实现。9) Combining the obtained evaluations with the emotional dictionary, calculate the value of the evaluation object, rank the pros and cons, and finally realize it through the visual interactive interface.

作为本发明的进一步限定,步骤2)具体包括:As a further definition of the present invention, step 2) specifically includes:

2-1)使用字符匹配算法去除非法字符;2-1) Use character matching algorithm to remove illegal characters;

2-2)将原始数据集使用LTP进行分词、词性标注;2-2) Use LTP for word segmentation and part-of-speech tagging on the original data set;

2-3)提取符合词性的词,经过去重,组成候选情感词集1;2-3) Extract the words that match the part of speech, and form the candidate emotional word set 1 after deduplication;

2-4)将原始数据集使用NLPIR进行分词、词性标注;2-4) Use NLPIR for word segmentation and part-of-speech tagging on the original data set;

2-5)提取符合词性的词,经过去重,组成候选情感词集2;2-5) Extract the words that match the part of speech, and form the candidate emotional word set 2 after deduplication;

2-6)将候选情感词集1和候选情感词集2组合,经过去重,得到候选情感词集。2-6) Combine the candidate emotional word set 1 and the candidate emotional word set 2, and obtain the candidate emotional word set after deduplication.

作为本发明的进一步限定,步骤3)具体包括:利用hownet情感词典和ntu评价词词典,分别提取其中的褒贬词,合并后去重,组成基础情感词典。As a further limitation of the present invention, step 3) specifically includes: using the hownet emotion dictionary and the ntu evaluation word dictionary to extract the praise and derogatory words in them respectively, and deduplicate them after merging to form a basic emotion dictionary.

作为本发明的进一步限定,步骤4)具体包括:As a further definition of the present invention, step 4) specifically includes:

4-1)利用Word2Vec训练数据集,得到词语的词向量;4-1) Use the Word2Vec training data set to get the word vector of the word;

4-2)结合候选情感词集,采用如下公式计算词语之间的语义相似度:4-2) Combined with the candidate emotional word set, the following formula is used to calculate the semantic similarity between words:

4-3)例如两个n维词向量a (x11, x12, … , x1n)和b (x21, x22, … , x2n) ,其语义相似度计算公式如下:4-3) For example, for two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) and b (x 21 , x 22 , … , x 2n ), the formula for calculating the semantic similarity is as follows:

其中, 表示语义相似度值; 表示词向量a第k维度数值; 表示词向量b第k 维度数值; in, Indicates the semantic similarity value; Indicates the value of the kth dimension of the word vector a; Indicates the value of the kth dimension of the word vector b;

4-4)根据计算出的语义相似度构建语义相似度矩阵。4-4) Construct a semantic similarity matrix based on the calculated semantic similarity.

作为本发明的进一步限定,步骤5)具体包括:As a further definition of the present invention, step 5) specifically includes:

5-1)将每个词看作图的节点,两个节点间边的权重用其所代表词之间的语义相似度表示;5-1) Each word is regarded as a node of a graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words it represents;

5-2)根据如下公式建立概率转移矩阵P:5-2) Establish the probability transition matrix P according to the following formula:

其中,P[i][j]表示词语i到j之间的相似度转移概率,SIM(wi,wj)表示词语i和j 的相似度,m表示与词语i语义相似度最高的词的个数;Among them, P[i][j] represents the similarity transfer probability between words i and j, SIM(w i , w j ) represents the similarity between words i and j, and m represents the word with the highest semantic similarity with word i the number of

5-3)统计候选情感词集中所有情感词在原始评论数据中的词频,筛选出词频最高的N个词,组成种子词集1;利用情感词汇本体库,筛选出情感词汇本体强度>m且在候选情感词集中的词,组成种子词集2;将种子词集1和种子词集2合并后去重,组成种子词集,进行人工情感标注;5-3) Count the word frequency of all emotional words in the original comment data in the candidate emotional word set, and filter out the N words with the highest word frequency to form the seed word set 1; use the emotional vocabulary ontology database to filter out the emotional vocabulary ontology strength>m and The words in the candidate emotional word set form the seed word set 2; the seed word set 1 and the seed word set 2 are merged and removed to form the seed word set for artificial emotion labeling;

5-4)利用人工标注的少量种子词建立LxC的label矩阵YL,其中:L表示种子词个数;C表示类的个数,分为3类,分别为褒义,贬义,中性;5-4) Using a small amount of artificially labeled seed words to establish a label matrix Y L of LxC, wherein: L represents the number of seed words; C represents the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral;

5-5)同时利用未标注的样本词建立UxC的label矩阵YU,其中:U表示未标注样本词个数;C表示类的个数,分为3类,分别为褒义,贬义,中性;5-5) At the same time, use the unlabeled sample words to establish the label matrix Y U of UxC, where: U indicates the number of unlabeled sample words; C indicates the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral ;

5-6)最后采用LPA标签传播算法对所述样本词进行词性标注,并通过基础情感词典检验后,形成最终的情感词典。5-6) Finally, the part-of-speech tagging of the sample words is carried out using the LPA label propagation algorithm, and the final sentiment dictionary is formed after passing the basic sentiment dictionary test.

作为本发明的进一步限定,步骤6)具体包括:As a further definition of the present invention, step 6) specifically includes:

核心句主要指删除冗余,保留与评价搭配相关的主干成分;若原句不符合任何规则,则保持不变,本方法利用核心句旨在提高评价文本句法依存分析的准确率,其规则包括如下:The core sentence mainly refers to deleting redundancy and retaining the main components related to the evaluation collocation; if the original sentence does not meet any rules, it will remain unchanged. This method uses the core sentence to improve the accuracy of the syntactic dependency analysis of the evaluation text. The rules include the following :

规则1:删除句子中句首状语成分,如“…的优点”、“…的缺点”、“…的不足”、“…的优势”、“…的好处”序列;Rule 1: Delete the initial adverbial components in the sentence, such as the sequence of "advantages of...", "disadvantages of...", "deficiencies of...", "advantages of...", "benefits of...";

规则2:删除带有假设性倾向的句子,如“假如…”、“希望…”、“如果…”、“但愿…”、“建议…”;Rule 2: Delete sentences with hypothetical tendencies, such as "if...", "hope...", "if...", "hope...", "suggest...";

规则3:删除句首为“就是”、“居然是”、“特别是”、“还有就是”、“尤其是”序列;Rule 3: Delete the sequence beginning with "that is", "it is", "especially", "there is", "especially";

规则4:删除“感觉”、“认为”主张词;Rule 4: Delete the words "feeling" and "thinking";

规则5:删除除去第一个标点符号外的连续的标点符号以及如表情、颜文字、括号非正常的字符。Rule 5: Delete consecutive punctuation marks except the first punctuation mark and abnormal characters such as emoticons, emoticons, and brackets.

作为本发明的进一步限定,步骤7)具体包括:As a further definition of the present invention, step 7) specifically includes:

依存句法的五条公理:The five axioms of dependency syntax:

(1)一个句子只能有且只有一个独立成分;(1) A sentence can only have one and only one independent component;

(2)句子中任何成分都必须同时依存于某一成分;(2) Any element in the sentence must be dependent on a certain element at the same time;

(3)句子中任何成分不能同时依存于两个或两个以上的成分;(3) Any element in a sentence cannot depend on two or more elements at the same time;

(4)句子中如果成分a直接依存于成分b,成分c位于成分a和b之间,那么成分c依存于a或b或a、b之间的其它成分;(4) In a sentence, if component a directly depends on component b, and component c is located between components a and b, then component c depends on a or b or other components between a and b;

(5)中心成分左右两边的成分之间相互不存在依存关系;(5) There is no interdependent relationship between the components on the left and right sides of the central component;

依存关系树的特点有:The characteristics of the dependency tree are:

(1)树中的结点由句子中的各个成分充当;(1) The nodes in the tree are acted by the components in the sentence;

(2)树的根节点为整个句子中心成分;(2) The root node of the tree is the central component of the entire sentence;

(3)树中的结点之间构成的边具有方向性,反映了成分之间不对称的依存关系;(3) The edges formed between the nodes in the tree are directional, which reflects the asymmetric dependency between components;

(4)满足依存句法的五条公理;(4) Satisfy the five axioms of dependency syntax;

评论中绝大部分句子依存关系为主谓关系(SBV)、动宾关系(VOB/FOB)、定中关系(ATT)、动补关系(CMP)、并列关系(COO)这五类,可以通过LTP依存句法分析器进行依存句法分析,并结合识别并列评价对象、并列评价词的COO算法提取依存关系对;所述的识别并列评价对象、并列评价词的COO算法,具体包括:Most of the sentences in the comments depend on five types of relationship: main-predicate relationship (SBV), verb-object relationship (VOB/FOB), fixed-center relationship (ATT), verb-complement relationship (CMP), and parallel relationship (COO). The LTP dependency syntax analyzer performs dependency syntax analysis, and extracts the dependency relationship in conjunction with the COO algorithm that identifies parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying parallel evaluation objects and parallel evaluation words specifically includes:

遍历基于依存关系、句法特征所得到的SBV、VOB、ATT、CMP依存关系对中两个结点之间以及依存句法树中与之左右相关的所有词;Traversing all words between two nodes in the SBV, VOB, ATT, and CMP dependency pairs obtained based on the dependency relationship and syntactic features, and all words related to it in the dependency syntax tree;

判断所遍历的所有词中是否有COO关系;Determine whether there is a COO relationship in all the words traversed;

扩充COO关系的并列评价对象和评价词。Expand the parallel evaluation objects and evaluation words of the COO relationship.

作为本发明的进一步限定,步骤8)具体包括:As a further definition of the present invention, step 8) specifically includes:

8-1)根据中文语言特点,评价对象多为名词或动词,评价词多为形容词或动词;8-1) According to the characteristics of the Chinese language, evaluation objects are mostly nouns or verbs, and evaluation words are mostly adjectives or verbs;

8-2)根据词性提取评价对象与评价词,即商品属性与情感词;8-2) Extract evaluation objects and evaluation words according to part of speech, that is, commodity attributes and emotional words;

8-3)根据依存句法树,遍历所得的评价对象与评价词之间是否有否定词,如果有,否定词个数+1,若遍历到多个否定词累计相加,直至遍历结束,对否定词个数进行奇偶性判断。若为奇数,对应的否定词privative赋值为-1,若为偶数,对应的否定词privative赋值为+1;8-3) According to the dependency syntax tree, whether there is a negative word between the evaluation object obtained by traversal and the evaluation word, if there is, the number of negative words is +1, and if multiple negative words are traversed, they are accumulated and added until the end of the traversal. The number of negative words is used for parity judgment. If it is an odd number, the corresponding negative word privative is assigned a value of -1, and if it is an even number, the corresponding negative word privative is assigned a value of +1;

8-4)根据依存句法树,遍历所得的评价对象与评价词之间是否有程度词,若遍历到多个,进行个数累加,得到此搭配对的程度词个数;8-4) According to the dependency syntax tree, whether there are degree words between the evaluation object obtained by traversing and the evaluation word, if more than one is traversed, the number is accumulated to obtain the number of degree words of this matching pair;

8-5)最终形成<商品属性,否定词,程度词,情感词>评价搭配对。8-5) Finally form <commodity attributes, negative words, degree words, emotional words> evaluation collocation pairs.

作为本发明的进一步限定,步骤9)具体包括:As a further limitation of the present invention, step 9) specifically includes:

根据出现n次的商品属性a,其褒贬值计算公式如下:According to the product attribute a that appears n times, the calculation formula of its praise and depreciation value is as follows:

其中a.score是商品属性a的情感值,为商品属性出现的第i次,privative是第i次 商品属性所对应否定词的所得值(-1或+1),degree是第i次商品属性所对应的程度副词个 数;由此计算出商品属性情感值,相同评价对象累加计算; Where a.score is the emotional value of product attribute a, is the i-th occurrence of the commodity attribute, privative is the obtained value (-1 or +1) of the negative word corresponding to the i-th commodity attribute, and degree is the number of degree adverbs corresponding to the i-th commodity attribute; thus calculated The emotional value of commodity attributes is calculated cumulatively for the same evaluation object;

对抽取的所有评价对象,分褒贬两类,利用冒泡排序排列出最后结果。All the evaluation objects extracted are divided into two categories, and the final results are arranged by bubble sorting.

一种可视化交互界面,可以执行权利要求上述的所有步骤,可以将情感值以柱状图的形式很好地展现之外,还增添了很多友好的交互功能,包括:加载、登录、注销、修改密码以及用户登录使用状态等。A visual interactive interface that can perform all the above-mentioned steps in the claims, and can display the emotional value in the form of a histogram well, and also adds many friendly interactive functions, including: loading, logging in, logging out, and changing passwords And user login usage status, etc.

本发明采用以上技术方案与现有技术相比,具有以下技术效果:Compared with the prior art, the present invention adopts the above technical scheme and has the following technical effects:

本发明通过获取电商网站商品页面评论数据并进行预处理,构建基础情感词典;再将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练并生成语义相似度矩阵进而建立概率转移矩阵,并结合种子词集通过LPA标签传播算法生成最终的情感词典;将获取的商品评论文本,进行基于核心句规则的处理,得到去除冗余的评论文本;再将所得到的去除冗余的文本进行预处理,对得到的分词数据集合基于依存关系、句法特征形成依存关系树,生成SBV、VOB、ATT、CMP、COO依存关系对并提取<商品属性,否定词,程度词,情感词>评价搭配对,再结合情感词典,对商品属性进行褒贬值计算、优劣排序,最终通过可视化交互界面实现;可以同时实现对评论数据进行分析的准确、实时、自动和便利。The present invention constructs a basic sentiment dictionary by obtaining and preprocessing the commodity page comment data of an e-commerce website; and then performs word vector training on the obtained preprocessed data set through the Word2Vec tool to generate a semantic similarity matrix and then establish a probability transfer matrix , and combined with the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm; the obtained product review text is processed based on the core sentence rules to obtain the redundant comment text; and then the redundant redundant text is obtained Perform preprocessing, form a dependency tree based on the dependency relationship and syntactic features of the obtained word segmentation data set, generate SBV, VOB, ATT, CMP, COO dependency relationship pairs and extract <commodity attributes, negative words, degree words, emotional words> evaluation Matching pairs, combined with the emotional dictionary, calculates the value of product attributes, ranks the pros and cons, and finally realizes it through a visual interactive interface; it can simultaneously realize accurate, real-time, automatic and convenient analysis of comment data.

附图说明Description of drawings

图1为本发明流程图。Fig. 1 is the flow chart of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案做进一步的详细说明:Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail:

本发明的技术方案通过使用一种神经网络模型训练的词向量,并结合LTP标签传播算法构建一个适用于商品评论的情感词典;通过基于核心句规则、依存关系以及句法特征设计了一个商品属性识别提取算法;并结合上述技术方案构建了一个基于词向量和句法特征的评论分析系统,根据分析结果得到用户对于商品各属性的满意度,进而总结出商品的优势、劣势,然后对分析结果进行数据可视化。The technical solution of the present invention uses a word vector trained by a neural network model, combined with the LTP label propagation algorithm to construct an emotional dictionary suitable for commodity reviews; and designs a commodity attribute recognition based on core sentence rules, dependency relationships and syntactic features Extraction algorithm; combined with the above technical solutions, a comment analysis system based on word vectors and syntactic features was constructed, and the user's satisfaction with each attribute of the product was obtained according to the analysis results, and then the advantages and disadvantages of the product were summarized, and then the analysis results were analyzed. visualization.

参阅图1,本发明实施一个基于词向量和句法特征的评论分析方法,具体的实施步骤如下:Referring to Fig. 1, the present invention implements a comment analysis method based on word vectors and syntactic features, and the specific implementation steps are as follows:

步骤S101:获取电商网站商品页面评论数据。Step S101: Acquiring review data on product pages of an e-commerce website.

在具体实施中,设计一种评论数据爬取算法,获取电商网站各类商品的评论数据,生成原始评论数据集。In the specific implementation, a comment data crawling algorithm is designed to obtain the comment data of various commodities on the e-commerce website, and generate the original comment data set.

步骤S102:将所述获取的目标数据集进行预处理,并构建基础情感词典。Step S102: preprocessing the acquired target data set, and building a basic sentiment dictionary.

在具体实施中,对原始数据集使用字符匹配算法去除非法字符;首先使用LTP进行分词、词性标注,提取词性标识为“a”(adj)的词,经过去重,组成候选情感词集1;然后使用NLPIR进行分词、词性标注,提取词性标识为“a”(adj)的词,经过去重,组成候选情感词集2;合并候选情感词集1和候选情感词集2,经过去重,组成最终的候选情感词集。In the specific implementation, the character matching algorithm is used to remove illegal characters on the original data set; firstly, LTP is used for word segmentation and part-of-speech tagging, and words whose part-of-speech identifier is "a" (adj) are extracted, and the candidate emotional word set 1 is formed after deduplication; Then use NLPIR for word segmentation and part-of-speech tagging, extract the words with the part-of-speech tag "a" (adj), and form candidate emotional word set 2 after deduplication; merge candidate emotional word set 1 and candidate emotional word set 2, after deduplication, Compose the final candidate sentiment word set.

步骤S103:提取Hownet和NTU提供的褒贬词集组成基础情感词典。Step S103: extract the praise and derogation words provided by Hownet and NTU to form the basic sentiment dictionary.

在具体实施中,利用hownet情感词典和NTU评价词词典,分别提取其中的褒贬词,合并后去重,组成基础情感词典。In the specific implementation, the hownet emotion dictionary and the NTU evaluation word dictionary are used to extract the praise and derogatory words in them respectively, and then merge them to remove the duplicates to form the basic emotion dictionary.

步骤S104:将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练,得到词向量并生成语义相似度矩阵。Step S104: Perform word vector training on the obtained preprocessed data set through the Word2Vec tool to obtain word vectors and generate a semantic similarity matrix.

在具体实施中,利用Word2Vec训练数据集,分别设置训练参数size=100, window=5, sg=0, min_count=0,经过训练得到词语的词向量。In the specific implementation, use the Word2Vec training data set, set the training parameters size=100, window=5, sg=0, min_count=0 respectively, and get the word vector of the word after training.

结合候选情感词集,采用如下公式计算词语之间的语义相似度。Combined with the candidate emotional word set, the following formula is used to calculate the semantic similarity between words.

例如两个n维词向量a (x11, x12, … , x1n)和b (x21, x22, … , x2n) ,其语义相似度计算公式如下:For example, for two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) and b (x 21 , x 22 , … , x 2n ), the formula for calculating the semantic similarity is as follows:

其中, 表示语义相似度值; 表示词向量a第k维度数值; 表示词向量b第k 维度数值; in, Indicates the semantic similarity value; Indicates the value of the kth dimension of the word vector a; Indicates the value of the kth dimension of the word vector b;

按顺序遍历候选情感词集中的所有情感词,固定一个,计算其与其他所有情感词的相似度;假设有m个候选情感词,经过m*m次计算,得到一个m*m的语义相似度矩阵。Traverse all the emotional words in the candidate emotional word set in order, fix one, and calculate its similarity with all other emotional words; suppose there are m candidate emotional words, after m*m calculations, get a m*m semantic similarity matrix.

为便于下述操作,规定,同一情感词之间的相似度为0。For the convenience of the following operations, it is stipulated that the similarity between the same emotional words is 0.

根据计算出的语义相似度构建语义相似度矩阵。Construct a semantic similarity matrix based on the calculated semantic similarity.

步骤S105:使用语义相似度矩阵建立概率转移矩阵,并结合种子词集通过LPA标签传播算法且经过基础情感词典检验后生成最终的情感词典。Step S105: Use the semantic similarity matrix to establish a probability transition matrix, and combine the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm and the basic sentiment dictionary test.

在具体实施中,将每个词看作图的节点,两个节点间边的权重用其所代表词之间的语义相似度表示。In the specific implementation, each word is regarded as a node of the graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words it represents.

根据如下公式建立概率转移矩阵P:The probability transition matrix P is established according to the following formula:

其中,P[i][j]表示词语i到j之间的相似度转移概率,SIM(wi,wj)表示词语i和j 的相似度,m表示与词语i语义相似度最高的词的个数(人工设置);根据上述公式建立概率转移矩阵P。Among them, P[i][j] represents the similarity transfer probability between words i and j, SIM(w i , w j ) represents the similarity between words i and j, and m represents the word with the highest semantic similarity with word i The number of (manually set); establish the probability transition matrix P according to the above formula.

统计候选情感词集中所有情感词在原始评论数据中的词频,筛选出词频最高的100个词,组成种子词集1;利用大连理工大学情感词汇本体库,筛选出情感词汇本体强度>7且在候选情感词集中的词,组成种子词集2;将种子词集1和种子词集2合并后去重,组成种子词集,进行人工情感标注。The word frequency of all emotional words in the original comment data in the candidate emotional word set was counted, and the 100 words with the highest word frequency were selected to form the seed word set 1; using the emotional vocabulary ontology database of Dalian University of Technology, the emotional vocabulary ontology strength > 7 and in The words in the candidate emotional word set form the seed word set 2; the seed word set 1 and the seed word set 2 are combined and removed to form the seed word set for artificial emotion labeling.

之后利用人工标注的少量种子词建立LxC的label矩阵YL,其中:L表示种子词个数;C表示类的个数,一般为3类(褒义,贬义,中性);同时利用未标注的样本词建立UxC的label矩阵YU,其中:U表示未标注样本词个数;C表示类的个数,一般为3类(褒义,贬义,中性); 把两个label矩阵合并,得到一个NxC的soft label矩阵F=[YL;YU]。Then use a small number of artificially labeled seed words to establish the LxC label matrix Y L , where: L represents the number of seed words; C represents the number of classes, generally 3 types (commendative, derogatory, neutral); The sample word establishes the label matrix Y U of UxC, where: U represents the number of unlabeled sample words; C represents the number of classes, generally 3 classes (commendative, derogatory, neutral); combine the two label matrices to obtain a NxC's soft label matrix F=[Y L ; Y U ].

执行标签传播算法,具体操作为:1)执行传播:F=PF; 2)重置F中labeled样本的标签:FL=YL;3)重复步骤1)和2)直到F收敛。Execute the label propagation algorithm, the specific operation is: 1) Execute propagation: F=PF; 2) Reset the label of the labeled sample in F: F L =Y L ; 3) Repeat steps 1) and 2) until F converges.

其中,步骤1的目的是将每个节点(情感词)的标签(情感属性)以概率转移矩阵确定的概率传播给其他节点,如果两个节点的相似度越大,传播的概率越大;步骤2的目的是将已标注种子词的标签重置为标注的值,避免因步骤1的运算过程而改变;步骤3中确定F收敛的方法是计算最新的F与上一次运算后的F0的矩阵相似度,直到相似度不再变化时,认为F已收敛。Among them, the purpose of step 1 is to propagate the label (emotional attribute) of each node (emotional word) to other nodes with the probability determined by the probability transfer matrix. If the similarity between the two nodes is greater, the probability of propagation is greater; step The purpose of 2 is to reset the label of the tagged seed word to the tagged value to avoid changes due to the operation process of step 1; the method of determining the convergence of F in step 3 is to calculate the latest F and the F 0 after the last operation Matrix similarity, until the similarity does not change, it is considered that F has converged.

最终矩阵F中单行的三个数值表示其所对应的情感词的属性传播值,选取其中最大的数值,判断其所对应属性,确定该情感词属性。The three values in a single row in the final matrix F represent the attribute propagation value of the corresponding emotional word, select the largest value among them, judge its corresponding attribute, and determine the attribute of the emotional word.

导出确认属性的情感词,组成情感词典1,遍历情感词典1中的所有情感词,若步骤S103所述基础情感词典中含有该词且与基础情感词典中属性矛盾,改变其属性,以基础情感词典中属性为准;反之,属性不变。Deriving the sentiment word of confirming property, form sentiment dictionary 1, traverse all sentiment words in sentiment dictionary 1, if the basic sentiment dictionary described in step S103 contains this word and contradicts with the property in the basic sentiment dictionary, change its property, with basic sentiment The attribute in the dictionary prevails; otherwise, the attribute remains unchanged.

上述步骤结束后,修改后的情感词典1即为最终的情感词典。After the above steps are completed, the modified sentiment dictionary 1 is the final sentiment dictionary.

步骤S106:将所述获取的商品评论文本,进行基于核心句规则的处理,得到去除冗余的评论文本。Step S106: Process the acquired commodity review text based on core sentence rules to obtain a redundant review text.

在具体实施中,在本系统网页的交互界面上,输入商品网址,通过后台设计的网络爬虫机制,爬取电商平台上所输入商品的评论数据,系统设置爬取该商品的前1000条优质评论数据。In the specific implementation, on the interactive interface of the webpage of this system, input the website address of the product, and crawl the comment data of the input product on the e-commerce platform through the web crawler mechanism designed in the background, and the system is set to crawl the first 1000 high-quality articles of the product. comment data.

将爬取获得的商品评论数据,基于核心句规则进行去除冗余的处理,保留与评价搭配相关的主干成分;例如:“手机收到了,挺不错的,像素和音质都很好,尤其是快递很给力(次日达),唯一的不足就是包装不是很好,希望店家可以改进一下。。。”处理如下:The commodity review data obtained by crawling will be processed based on the core sentence rules to remove redundancy, and the main components related to the evaluation collocation will be retained; for example: "The mobile phone has been received, it is very good, the pixel and sound quality are very good, especially the express delivery Very good (next day delivery), the only downside is that the packaging is not very good, I hope the store can improve it..." The processing is as follows:

(1)匹配规则1,例句中匹配到“…的不足”,处理后变为“手机收到了,挺不错的,像素和音质都很好,尤其是快递很给力(次日达),就是包装不是很好,希望店家可以改进一下。。。”;(1) Matching rule 1. In the example sentence, "…'s lack of" is matched, and after processing, it becomes "The mobile phone has been received, it is very good, the pixel and sound quality are very good, especially the express delivery is very good (next day delivery), it is the packaging Not very good, I hope the store can improve it...";

(2)匹配规则2,例句中匹配到“希望”,处理后变为“手机收到了,挺不错的,像素和音质都很好,尤其是快递很给力(次日达),就是包装不是很好,店家可以改进一下。。。”;(2) Matching rule 2, "hope" is matched in the example sentence, and after processing, it becomes "The mobile phone has been received, it is very good, the pixel and sound quality are very good, especially the express delivery is very good (next day delivery), but the packaging is not very good Well, the store can improve it...";

(3)匹配规则3,例句中匹配到“就是”“尤其是”,处理后变为“手机收到了,挺不错的,像素和音质都很好,快递很给力(次日达),包装不是很好,店家可以改进一下。。。”;(3) Matching rule 3. In the example sentence, "is" and "especially" are matched, and after processing, it becomes "The mobile phone has been received, it is very good, the pixel and sound quality are very good, the express delivery is very good (the next day), the packaging is not Very good, the store can improve it...";

(4)匹配规则5,例句删除连续的标点符号,最终处理得到的核心句为“手机收到了,挺不错的,像素和音质都很好,快递很给力次日达,包装不是很好,店家可以改进一下。”,此实施例记为实施例句Sentences。(4) Matching rule 5, delete consecutive punctuation marks in the example sentence, and the core sentence obtained in the final processing is "The mobile phone has been received, it is very good, the pixel and sound quality are very good, the express delivery is very good, the packaging is not very good, the store It can be improved." This embodiment is recorded as the embodiment sentence Sentences.

步骤S107:将所得到的去除冗余的文本进行预处理,得到的分词数据集合基于依存关系、句法特征形成依存关系树,生成SBV、VOB、ATT、CMP、COO依存关系对。Step S107: Preprocess the obtained redundant text, form a dependency tree based on the dependency and syntactic features of the obtained word segmentation data set, and generate SBV, VOB, ATT, CMP, COO dependency pairs.

在具体实施中,将上述步骤S106中所得到的去除冗余的文本进行预处理,以标点符号分句,得到6个小句。将每一小句,利用LTP工具对其进行分词,词性标注,并基于依存关系、句法特征形成依存关系树。得到依存关系对SBV<手机,收到>,SBV<像素,好>,COO<音质,像素>,SBV<快递,给力>,SBV<包装,好>,SBV<店家,改进>。In a specific implementation, the redundant text obtained in the above step S106 is preprocessed to divide sentences with punctuation marks to obtain 6 clauses. Use the LTP tool to segment each sentence, mark part of speech, and form a dependency tree based on dependency and syntactic features. Get the dependency on SBV<mobile phone, received>, SBV<pixel, good>, COO<sound quality, pixel>, SBV<express delivery, awesome>, SBV<packaging, good>, SBV<store, improvement>.

例如此小句“像素和音质都很好”,经过以上步骤处理后,再结合识别并列评价对象、并列评价词的COO算法再次提取依存关系对,则得到的依存关系对为<像素,好>,<音质,好>。For example, for the small sentence "The pixels and sound quality are both very good", after the above steps, the COO algorithm for identifying parallel evaluation objects and parallel evaluation words is used to extract the dependency relationship pair again, and the obtained dependency relationship pair is <pixel, good> , <sound quality, good>.

步骤S108:对所得依存关系对通过词性提取<商品属性,否定词,程度词,情感词>评价搭配对。Step S108: Evaluate the matching pair by extracting <commodity attribute, negative word, degree word, emotion word> from the obtained dependency relationship pair.

在具体实施中,对每个抽取的关系对,遍历评价对象与评价词之间是否有否定词,并计算个数,对评价对象与评价词之间的否定词经判别奇偶后得到其正负值,即否定词判定为奇数个数,对应privative赋值-1;否定词判定为偶数个数,对应privative赋值+1。然后再遍历评价对象与评价词之间是否有程度词,并计算程度词个数。最终形成<商品属性,privative,degree,情感词>评价搭配对。步骤S106中实施例句Sentences中,关系对<包装,好>之间识别到一个否定词“不”,则此对应的privative值为-1;再遍历“包装”与“好”之间的程度副词,识别到“很”,则对应的degree值为1。则此小句提取的评价搭配对为<包装,-1,1,好>。In the specific implementation, for each extracted relationship pair, whether there is a negative word between the evaluation object and the evaluation word is traversed, and the number is calculated. value, that is, if the number of negative words is judged to be an odd number, the corresponding privative value is -1; if the number of negative words is judged to be an even number, the corresponding privative value is +1. Then traverse whether there are degree words between the evaluation object and the evaluation words, and calculate the number of degree words. Finally, an evaluation pair of <commodity attribute, privative, degree, emotional word> is formed. In the example sentence Sentences in step S106, a negative word "no" is identified between the relationship pair <package, good>, then the corresponding privative value is -1; then traverse the degree adverbs between "package" and "good" , if "very" is recognized, the corresponding degree value is 1. Then the evaluation collocation pair extracted from this clause is <package, -1, 1, good>.

步骤S109:将所得评价搭配对结合情感词典,对评价对象进行褒贬值计算、优劣排序,最终通过可视化交互界面实现。Step S109: Combining the obtained evaluation matching pairs with the sentiment dictionary, calculate the value of the evaluation object, rank the superior and inferior, and finally realize it through the visual interactive interface.

在具体实施中,对所提取的评价搭配对,通过情感词典获得情感词的褒贬属性。再根据以下公式进行商品属性的褒贬值计算:In the specific implementation, for the extracted evaluation collocation pairs, the praise and derogation attributes of the sentiment words are obtained through the sentiment dictionary. Then calculate the appreciation and depreciation value of commodity attributes according to the following formula:

对步骤S107中得到的评价搭配对<包装,-1,1,好>中的商品属性“包装”进行褒贬值计算得其情感值为For the evaluation collocation obtained in step S107, the value of the product attribute "package" in <package, -1, 1, good> is calculated to obtain its emotional value .

遍历所获取的商品所有的评论数据,进行以上步骤处理,对相同的评价对象进行累加,最终提取得到该商品所有的商品属性,然后分褒贬两类,利用冒泡排序排列得出最后结果。最后通过前后端,在网页上用可视化交互界面实现。Traverse all the comment data of the acquired product, perform the above steps, accumulate the same evaluation objects, and finally extract all the product attributes of the product, and then divide them into two categories, using bubble sorting to get the final result. Finally, through the front and back ends, it is realized on the web page with a visual interactive interface.

以上所述,仅为本发明中的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉该技术的人在本发明所揭露的技术范围内,可理解想到的变换或替换,都应涵盖在本发明的包含范围之内,因此,本发明的保护范围应该以权利要求书的保护范围为准。The above is only a specific implementation mode in the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technology can understand the conceivable transformation or replacement within the technical scope disclosed in the present invention. All should be covered within the scope of the present invention, therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims (10)

1.一种基于词向量和句法特征的评论分析方法,其特征在于,包括以下步骤:1. A comment analysis method based on word vector and syntactic feature, is characterized in that, comprises the following steps: 1)获取电商网站商品页面评论数据;1) Obtain the comment data on the product page of the e-commerce website; 2)将获取的目标数据集进行预处理,并构建候选情感词集;2) Preprocess the acquired target data set and construct a candidate emotional word set; 3)提取Hownet和NTU提供的褒贬词集组成基础情感词典;3) Extract the praise and derogation words provided by Hownet and NTU to form the basic sentiment dictionary; 4)将所得到的经过预处理的数据集合通过Word2Vec工具进行词向量训练,得到词向量并生成语义相似度矩阵;4) Perform word vector training on the obtained preprocessed data set through the Word2Vec tool, obtain word vectors and generate a semantic similarity matrix; 5)使用语义相似度矩阵建立概率转移矩阵,并结合种子词集通过LPA标签传播算法且经过基础情感词典检验后生成最终的情感词典;5) Use the semantic similarity matrix to establish a probability transition matrix, and combine the seed word set to generate the final sentiment dictionary through the LPA label propagation algorithm and the basic sentiment dictionary test; 6)将获取的商品评论文本,进行基于核心句规则的处理,得到去除冗余的评论文本;6) Process the obtained product review text based on the core sentence rules to obtain a redundant review text; 7)将所得到的去除冗余的文本进行预处理,对得到的分词数据集合基于依存关系、句法特征形成依存关系树,生成SBV、VOB、ATT、CMP、COO依存关系对;7) Preprocess the obtained redundant text, form a dependency tree based on the dependency and syntactic features of the obtained word segmentation data set, and generate SBV, VOB, ATT, CMP, COO dependency pairs; 8)对所得依存关系对通过词性提取<商品属性,否定词,程度词,情感词>评价搭配对;8) Extracting <commodity attributes, negative words, degree words, emotional words> through part-of-speech evaluation on the resulting dependency pairs; 9)将所得评价搭配对结合情感词典,对评价对象进行褒贬值计算、优劣排序,最终通过可视化交互界面实现。9) Combining the obtained evaluations with the emotional dictionary, calculate the value of the evaluation object, rank the pros and cons, and finally realize it through the visual interactive interface. 2.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤2)具体包括:2. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 2) specifically includes: 2-1)使用字符匹配算法去除非法字符;2-1) Use character matching algorithm to remove illegal characters; 2-2)将原始数据集使用LTP进行分词、词性标注;2-2) Use LTP for word segmentation and part-of-speech tagging on the original data set; 2-3)提取符合词性的词,经过去重,组成候选情感词集1;2-3) Extract the words that match the part of speech, and form the candidate emotional word set 1 after deduplication; 2-4)将原始数据集使用NLPIR进行分词、词性标注;2-4) Use NLPIR for word segmentation and part-of-speech tagging on the original data set; 2-5)提取符合词性的词,经过去重,组成候选情感词集2;2-5) Extract the words that match the part of speech, and form the candidate emotional word set 2 after deduplication; 2-6)将候选情感词集1和候选情感词集2组合,经过去重,得到候选情感词集。2-6) Combine the candidate emotional word set 1 and the candidate emotional word set 2, and obtain the candidate emotional word set after deduplication. 3.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤3)具体包括:利用hownet情感词典和ntu评价词词典,分别提取其中的褒贬词,合并后去重,组成基础情感词典。3. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that, step 3) specifically includes: using hownet sentiment dictionary and ntu evaluation word dictionary, extracting the praise and derogation words therein respectively, and removing them after merging Heavy, forming a basic sentiment dictionary. 4.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤4)具体包括:4. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 4) specifically includes: 4-1)利用Word2Vec训练数据集,得到词语的词向量;4-1) Use the Word2Vec training data set to get the word vector of the word; 4-2)结合候选情感词集,采用如下公式计算词语之间的语义相似度:4-2) Combined with the candidate emotional word set, the following formula is used to calculate the semantic similarity between words: 4-3)例如两个n维词向量a (x11, x12, … , x1n)和b (x21, x22, … , x2n) ,其语义相似度计算公式如下:4-3) For example, for two n-dimensional word vectors a (x 11 , x 12 , … , x 1n ) and b (x 21 , x 22 , … , x 2n ), the formula for calculating the semantic similarity is as follows: 其中, 表示语义相似度值;表示词向量a第k维度数值;表示词向量b第k维度 数值; in, Indicates the semantic similarity value; Indicates the value of the kth dimension of the word vector a; Indicates the value of the kth dimension of the word vector b; 4-4)根据计算出的语义相似度构建语义相似度矩阵。4-4) Construct a semantic similarity matrix based on the calculated semantic similarity. 5.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤5)具体包括:5. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 5) specifically includes: 5-1)将每个词看作图的节点,两个节点间边的权重用其所代表词之间的语义相似度表示;5-1) Each word is regarded as a node of a graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words it represents; 5-2)根据如下公式建立概率转移矩阵P:5-2) Establish the probability transition matrix P according to the following formula: 其中,P[i][j]表示词语i到j之间的相似度转移概率,SIM(wi,wj)表示词语i和j 的相似度,m表示与词语i语义相似度最高的词的个数;Among them, P[i][j] represents the similarity transfer probability between words i and j, SIM(w i , w j ) represents the similarity between words i and j, and m represents the word with the highest semantic similarity with word i the number of 5-3)统计候选情感词集中所有情感词在原始评论数据中的词频,筛选出词频最高的N个词,组成种子词集1;利用情感词汇本体库,筛选出情感词汇本体强度>m且在候选情感词集中的词,组成种子词集2;将种子词集1和种子词集2合并后去重,组成种子词集,进行人工情感标注;5-3) Count the word frequency of all emotional words in the original comment data in the candidate emotional word set, and filter out the N words with the highest word frequency to form the seed word set 1; use the emotional vocabulary ontology database to filter out the emotional vocabulary ontology strength>m and The words in the candidate emotional word set form the seed word set 2; the seed word set 1 and the seed word set 2 are merged and removed to form the seed word set for artificial emotion labeling; 5-4)利用人工标注的少量种子词建立LxC的label矩阵YL,其中:L表示种子词个数;C表示类的个数,分为3类,分别为褒义,贬义,中性;5-4) Using a small amount of artificially labeled seed words to establish a label matrix Y L of LxC, wherein: L represents the number of seed words; C represents the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral; 5-5)同时利用未标注的样本词建立UxC的label矩阵YU,其中:U表示未标注样本词个数;C表示类的个数,分为3类,分别为褒义,贬义,中性;5-5) At the same time, use the unlabeled sample words to establish the label matrix Y U of UxC, where: U indicates the number of unlabeled sample words; C indicates the number of classes, which are divided into 3 categories, respectively commendatory, derogatory, and neutral ; 5-6)最后采用LPA标签传播算法对所述样本词进行词性标注,并通过基础情感词典检验后,形成最终的情感词典。5-6) Finally, the part-of-speech tagging of the sample words is carried out using the LPA label propagation algorithm, and the final sentiment dictionary is formed after passing the basic sentiment dictionary test. 6.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤6)具体包括:6. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that, step 6) specifically includes: 核心句主要指删除冗余,保留与评价搭配相关的主干成分;若原句不符合任何规则,则保持不变,本方法利用核心句旨在提高评价文本句法依存分析的准确率,其规则包括如下:The core sentence mainly refers to deleting redundancy and retaining the main components related to the evaluation collocation; if the original sentence does not meet any rules, it will remain unchanged. This method uses the core sentence to improve the accuracy of the syntactic dependency analysis of the evaluation text. The rules include the following : 规则1:删除句子中句首状语成分,如“…的优点”、“…的缺点”、 “…的不足”、“…的优势”、“…的好处”序列;Rule 1: Delete the initial adverbial components in the sentence, such as the sequence of "advantages of...", "disadvantages of...", "deficiencies of...", "advantages of...", "benefits of..."; 规则2:删除带有假设性倾向的句子,如“假如…”、“希望…”、“如果…”、“但愿…”、“建议…”;Rule 2: Delete sentences with hypothetical tendencies, such as "if...", "hope...", "if...", "hope...", "suggest..."; 规则3:删除句首为“就是”、“居然是”、“特别是”、“还有就是”、 “尤其是”序列;Rule 3: Delete the sequence of "is", "is actually", "especially", "there is", "especially" at the beginning of the sentence; 规则4:删除“感觉”、“认为”主张词;Rule 4: Delete the words "feeling" and "thinking"; 规则5:删除除去第一个标点符号外的连续的标点符号以及如表情、颜文字、括号非正常的字符。Rule 5: Delete consecutive punctuation marks except the first punctuation mark and abnormal characters such as emoticons, emoticons, and brackets. 7.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤7)具体包括:7. The comment analysis method based on word vectors and syntactic features according to claim 1, characterized in that, step 7) specifically includes: 依存句法的五条公理:The five axioms of dependency syntax: (1)一个句子只能有且只有一个独立成分;(1) A sentence can only have one and only one independent component; (2)句子中任何成分都必须同时依存于某一成分;(2) Any element in a sentence must be dependent on a certain element at the same time; (3)句子中任何成分不能同时依存于两个或两个以上的成分;(3) Any element in a sentence cannot depend on two or more elements at the same time; (4)句子中如果成分a直接依存于成分b,成分c位于成分a和b之间,那么成分c依存于a或b或a、b之间的其它成分;(4) In a sentence, if component a directly depends on component b, and component c is located between components a and b, then component c depends on a or b or other components between a and b; (5)中心成分左右两边的成分之间相互不存在依存关系;(5) There is no interdependent relationship between the components on the left and right sides of the central component; 依存关系树的特点有:The characteristics of the dependency tree are: (1)树中的结点由句子中的各个成分充当;(1) The nodes in the tree are acted by the components in the sentence; (2)树的根节点为整个句子中心成分;(2) The root node of the tree is the central component of the entire sentence; (3)树中的结点之间构成的边具有方向性,反映了成分之间不对称的依存关系;(3) The edges formed between the nodes in the tree are directional, which reflects the asymmetric dependency between components; (4)满足依存句法的五条公理;(4) Satisfy the five axioms of dependency syntax; 评论中绝大部分句子依存关系为主谓关系(SBV)、动宾关系(VOB/FOB)、定中关系(ATT)、动补关系(CMP)、并列关系(COO)这五类,可以通过LTP依存句法分析器进行依存句法分析,并结合识别并列评价对象、并列评价词的COO算法提取依存关系对;所述的识别并列评价对象、并列评价词的COO算法,具体包括:Most of the sentences in the comments depend on five types of relationship: main-predicate relationship (SBV), verb-object relationship (VOB/FOB), fixed-center relationship (ATT), verb-complement relationship (CMP), and parallel relationship (COO). The LTP dependency syntax analyzer performs dependency syntax analysis, and extracts the dependency relationship in conjunction with the COO algorithm that identifies parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying parallel evaluation objects and parallel evaluation words specifically includes: 遍历基于依存关系、句法特征所得到的SBV、VOB、ATT、CMP依存关系对中两个结点之间以及依存句法树中与之左右相关的所有词;Traversing all words between two nodes in the SBV, VOB, ATT, and CMP dependency pairs obtained based on the dependency relationship and syntactic features, and all words related to it in the dependency syntax tree; 判断所遍历的所有词中是否有COO关系;Determine whether there is a COO relationship in all the words traversed; 扩充COO关系的并列评价对象和评价词。Expand the parallel evaluation objects and evaluation words of the COO relationship. 8.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤8)具体包括:8. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 8) specifically includes: 8-1)根据中文语言特点,评价对象多为名词或动词,评价词多为形容词或动词;8-1) According to the characteristics of the Chinese language, evaluation objects are mostly nouns or verbs, and evaluation words are mostly adjectives or verbs; 8-2)根据词性提取评价对象与评价词,即商品属性与情感词;8-2) Extract evaluation objects and evaluation words according to part of speech, that is, commodity attributes and emotional words; 8-3)根据依存句法树,遍历所得的评价对象与评价词之间是否有否定词,如果有,否定词个数+1,若遍历到多个否定词累计相加,直至遍历结束,对否定词个数进行奇偶性判断;8-3) According to the dependency syntax tree, whether there is a negative word between the evaluation object obtained by traversal and the evaluation word, if there is, the number of negative words is +1, and if multiple negative words are traversed, they are accumulated and added until the end of the traversal. The number of negative words is used for parity judgment; 若为奇数,对应的否定词privative赋值为-1,若为偶数,对应的否定词privative赋值为+1;If it is an odd number, the corresponding negative word privative is assigned a value of -1, and if it is an even number, the corresponding negative word privative is assigned a value of +1; 8-4)根据依存句法树,遍历所得的评价对象与评价词之间是否有程度词,若遍历到多个,进行个数累加,得到此搭配对的程度词个数;8-4) According to the dependency syntax tree, whether there are degree words between the evaluation object obtained by traversing and the evaluation word, if more than one is traversed, the number is accumulated to obtain the number of degree words of this matching pair; 8-5)最终形成<商品属性,否定词,程度词,情感词>评价搭配对。8-5) Finally form <commodity attributes, negative words, degree words, emotional words> evaluation collocation pairs. 9.根据权利要求1所述的基于词向量和句法特征的评论分析方法,其特征在于,步骤9)具体包括:9. The comment analysis method based on word vectors and syntactic features according to claim 1, wherein step 9) specifically includes: 根据出现n次的商品属性a,其褒贬值计算公式如下:According to the product attribute a that appears n times, the calculation formula of its praise and depreciation value is as follows: 其中a.score是商品属性a的情感值,Xi为商品属性出现的第i次,privative是第i次商品属性所对应否定词的所得值(-1或+1),degree是第i次商品属性所对应的程度副词个数;由此计算出商品属性情感值,相同评价对象累加计算;Among them, a.score is the emotional value of product attribute a, X i is the i-th occurrence of the product attribute, privative is the obtained value (-1 or +1) of the negative word corresponding to the i-th product attribute, and degree is the ith time The number of degree adverbs corresponding to the product attributes; from this, the emotional value of the product attributes is calculated, and the same evaluation objects are cumulatively calculated; 对抽取的所有评价对象,分褒贬两类,利用冒泡排序排列出最后结果。All the evaluation objects extracted are divided into two categories, and the final results are arranged by bubble sorting. 10.一种可视化交互界面,其特征在于,可以执行权利要求1至9项的所有步骤,可以将情感值以柱状图的形式很好地展现之外,还增添了很多友好的交互功能,包括:加载、登录、注销、修改密码以及用户登录使用状态等。10. A visual interactive interface, characterized in that it can perform all the steps of claims 1 to 9, can display the emotional value in the form of a histogram well, and adds many friendly interactive functions, including : Loading, login, logout, password change, user login status, etc.
CN201910343337.5A 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface Active CN110175325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910343337.5A CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910343337.5A CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Publications (2)

Publication Number Publication Date
CN110175325A true CN110175325A (en) 2019-08-27
CN110175325B CN110175325B (en) 2023-07-11

Family

ID=67690209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910343337.5A Active CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Country Status (1)

Country Link
CN (1) CN110175325B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659828A (en) * 2019-09-23 2020-01-07 上海海事大学 A software feature evaluation method based on review data
CN110705266A (en) * 2019-09-09 2020-01-17 创新奇智(南京)科技有限公司 Emotion analysis method and device
CN110706028A (en) * 2019-09-26 2020-01-17 四川长虹电器股份有限公司 Commodity evaluation emotion analysis system based on attribute characteristics
CN110717654A (en) * 2019-09-17 2020-01-21 合肥工业大学 Product quality evaluation method and system based on user comments
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN111259661A (en) * 2020-02-11 2020-06-09 安徽理工大学 A new sentiment word extraction method based on product reviews
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for vocabulary extraction of product perceptual imagery
CN111523300A (en) * 2020-04-14 2020-08-11 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111639159A (en) * 2020-05-28 2020-09-08 深圳壹账通智能科技有限公司 Real-time generation method and device for phrase dictionary, electronic equipment and storage medium
CN111898928A (en) * 2020-08-18 2020-11-06 哈尔滨工业大学 Multi-party service value-quality-capability index alignment method facing space-time boundary
CN111930941A (en) * 2020-07-31 2020-11-13 腾讯音乐娱乐科技(深圳)有限公司 Method and device for identifying abuse content and server
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device
CN112115700A (en) * 2020-08-19 2020-12-22 北京交通大学 An Aspect-Level Sentiment Analysis Method Based on Dependency Syntax Tree and Deep Learning
CN112579776A (en) * 2020-12-21 2021-03-30 北京智齿博创科技有限公司 Automatic labeling method of quality problem scene labels based on categories
CN113327140A (en) * 2021-08-02 2021-08-31 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN113535901A (en) * 2021-07-08 2021-10-22 北京航空航天大学 E-commerce comment-based user-side commodity knowledge graph construction method
CN114493760A (en) * 2021-12-30 2022-05-13 杭州盟码科技有限公司 E-commerce cloud data analysis method and system
CN114881039A (en) * 2022-05-05 2022-08-09 重庆锐云科技有限公司 Owner portrait method, device and equipment based on customer evaluation and storage medium
CN117436446A (en) * 2023-12-21 2024-01-23 江西农业大学 Agricultural socialized sales service user evaluation data analysis method based on weak supervision
WO2024037483A1 (en) * 2022-08-16 2024-02-22 中国第一汽车股份有限公司 Text processing method and apparatus, and electronic device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133282A (en) * 2017-04-17 2017-09-05 华南理工大学 A kind of improved evaluation object recognition methods based on two-way propagation
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133282A (en) * 2017-04-17 2017-09-05 华南理工大学 A kind of improved evaluation object recognition methods based on two-way propagation
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
邓淑卿 等: "基于句法依赖规则和词性特征的情感词识别研究", 《情报理论与实践》 *
陆峰: "基于word2vec扩充情感词典的商品评论倾向分析", 《电脑知识与技术》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705266A (en) * 2019-09-09 2020-01-17 创新奇智(南京)科技有限公司 Emotion analysis method and device
CN110717654A (en) * 2019-09-17 2020-01-21 合肥工业大学 Product quality evaluation method and system based on user comments
CN110659828B (en) * 2019-09-23 2022-03-08 上海海事大学 Software feature evaluation method based on comment data
CN110659828A (en) * 2019-09-23 2020-01-07 上海海事大学 A software feature evaluation method based on review data
CN110706028A (en) * 2019-09-26 2020-01-17 四川长虹电器股份有限公司 Commodity evaluation emotion analysis system based on attribute characteristics
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN110750646B (en) * 2019-10-16 2022-12-06 乐山师范学院 Attribute description extracting method for hotel comment text
CN111259661A (en) * 2020-02-11 2020-06-09 安徽理工大学 A new sentiment word extraction method based on product reviews
CN111259661B (en) * 2020-02-11 2023-07-25 安徽理工大学 A New Sentiment Word Extraction Method Based on Commodity Reviews
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for vocabulary extraction of product perceptual imagery
CN111523300A (en) * 2020-04-14 2020-08-11 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111523300B (en) * 2020-04-14 2021-03-05 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111639159A (en) * 2020-05-28 2020-09-08 深圳壹账通智能科技有限公司 Real-time generation method and device for phrase dictionary, electronic equipment and storage medium
CN111930941A (en) * 2020-07-31 2020-11-13 腾讯音乐娱乐科技(深圳)有限公司 Method and device for identifying abuse content and server
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device
CN112069312B (en) * 2020-08-12 2023-06-20 中国科学院信息工程研究所 A text classification method and electronic device based on entity recognition
CN111898928B (en) * 2020-08-18 2021-08-31 哈尔滨工业大学 The Alignment Method of Multi-Party Service Value-Quality-Capability Index Oriented to the Space-Time Boundary
CN111898928A (en) * 2020-08-18 2020-11-06 哈尔滨工业大学 Multi-party service value-quality-capability index alignment method facing space-time boundary
CN112115700A (en) * 2020-08-19 2020-12-22 北京交通大学 An Aspect-Level Sentiment Analysis Method Based on Dependency Syntax Tree and Deep Learning
CN112115700B (en) * 2020-08-19 2024-03-12 北京交通大学 Aspect-level emotion analysis method based on dependency syntax tree and deep learning
CN112579776A (en) * 2020-12-21 2021-03-30 北京智齿博创科技有限公司 Automatic labeling method of quality problem scene labels based on categories
CN113535901A (en) * 2021-07-08 2021-10-22 北京航空航天大学 E-commerce comment-based user-side commodity knowledge graph construction method
CN113535901B (en) * 2021-07-08 2023-08-18 北京航空航天大学 Method for constructing user side commodity knowledge graph based on e-commerce comments
CN113327140B (en) * 2021-08-02 2021-10-29 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN113327140A (en) * 2021-08-02 2021-08-31 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN114493760A (en) * 2021-12-30 2022-05-13 杭州盟码科技有限公司 E-commerce cloud data analysis method and system
CN114881039A (en) * 2022-05-05 2022-08-09 重庆锐云科技有限公司 Owner portrait method, device and equipment based on customer evaluation and storage medium
WO2024037483A1 (en) * 2022-08-16 2024-02-22 中国第一汽车股份有限公司 Text processing method and apparatus, and electronic device and medium
CN117436446A (en) * 2023-12-21 2024-01-23 江西农业大学 Agricultural socialized sales service user evaluation data analysis method based on weak supervision
CN117436446B (en) * 2023-12-21 2024-03-22 江西农业大学 Agricultural socialized sales service user evaluation data analysis method based on weak supervision

Also Published As

Publication number Publication date
CN110175325B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
CN110175325B (en) Comment analysis method based on word vector and syntactic characteristics and visual interaction interface
US10748164B2 (en) Analyzing sentiment in product reviews
CN111260437B (en) A Product Recommendation Method Based on Product Aspect-Level Sentiment Mining and Fuzzy Decision-Making
CN104008091B (en) A kind of network text sentiment analysis method based on emotion value
WO2021077973A1 (en) Personalised product description generating method based on multi-source crowd intelligence data
CN103207914B (en) The preference vector evaluated based on user feedback generates method and system
CN105930368B (en) A kind of sensibility classification method and system
CN105550269A (en) Product comment analyzing method and system with learning supervising function
CN108763321A (en) A kind of related entities recommendation method based on extensive related entities network
CN106096004A (en) A kind of method setting up extensive cross-domain texts emotional orientation analysis framework
CN108154395A (en) A kind of customer network behavior portrait method based on big data
CN109255027B (en) E-commerce comment sentiment analysis noise reduction method and device
CN103729359A (en) Method and system for recommending search terms
CN110489553B (en) A sentiment classification method based on multi-source information fusion
CN108388660A (en) A kind of improved electric business product pain spot analysis method
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
KR102325022B1 (en) On-line image and review integrated analysis method and system using deep learning-based hybrid analysis method
CN112990973B (en) Online shop portrait construction method and system
CN108596637B (en) Automatic E-commerce service problem discovery system
CN107818173B (en) A Chinese fake comment filtering method based on vector space model
CN106096609A (en) A kind of merchandise query keyword automatic generation method based on OCR
CN110096587A (en) The fine granularity sentiment classification model of LSTM-CNN word insertion based on attention mechanism
CN110706028A (en) Commodity evaluation emotion analysis system based on attribute characteristics
CN113807092A (en) Cigarette brand online comment analysis method based on LDA topic model
CN106021413B (en) A Bootstrap Feature Selection Method and System Based on Topic Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant