CN110175325A - The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature - Google Patents

The comment and analysis method and Visual Intelligent Interface Model of word-based vector sum syntactic feature Download PDF

Info

Publication number
CN110175325A
CN110175325A CN201910343337.5A CN201910343337A CN110175325A CN 110175325 A CN110175325 A CN 110175325A CN 201910343337 A CN201910343337 A CN 201910343337A CN 110175325 A CN110175325 A CN 110175325A
Authority
CN
China
Prior art keywords
word
words
evaluation
emotion
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910343337.5A
Other languages
Chinese (zh)
Other versions
CN110175325B (en
Inventor
吕奇
沈楠楠
胡新春
陈可佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910343337.5A priority Critical patent/CN110175325B/en
Publication of CN110175325A publication Critical patent/CN110175325A/en
Application granted granted Critical
Publication of CN110175325B publication Critical patent/CN110175325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention proposes a kind of comment and analysis methods of word-based vector sum syntactic feature in data analysis field, comprising: obtains electric business website commodity page comment data;The target data set that will acquire is pre-processed;It extracts the word set of passing judgement on that Hownet and NTU is provided and forms basic sentiment dictionary;Term vector training is carried out by Word2Vec tool by pretreated data acquisition system by obtained;Probability transfer matrix is established using semantic similarity matrix;The comment on commodity text that will acquire, carries out the processing based on kernel sentence rule;The text of obtained removal redundancy is pre-processed;To gained dependence to passing through part of speech extraction<item property, negative word, degree word, the collocation pair of emotion word>evaluation;By gained evaluation collocation in conjunction with sentiment dictionary, evaluation object is carried out to pass judgement on value calculating, trap queuing, it is realized eventually by Visual Intelligent Interface Model, realizes and accurate, real-time, automatic, convenient processing is carried out to comment on commodity data and is analyzed, can be used in electric business platform.

Description

Comment analysis method based on word vector and syntactic characteristics and visual interactive interface
Technical Field
The invention belongs to the technical field of data analysis, and particularly relates to an emotion dictionary and attribute recognition algorithm which are constructed by using word vectors trained by a neural network model and are suitable for commodity comments, and a comment analysis system based on the word vectors and syntactic characteristics.
Background
With the popularization of the internet and the development of electronic commerce, internet electronic commerce websites such as the Jingdong website and the Taobao website are rapidly developed, and more consumers begin to choose online shopping; these e-commerce websites have a huge amount of commodities and a large user group, and thus generate huge comment data. The comments given by the consumers often carry subjective feelings of the consumers on the consumption, including the preference degree of the consumers for purchasing commodities, the satisfaction degree of the consumers for the services of merchants and the like. For the consumers, the comment texts can help the consumers to more objectively know the information of the related goods or services, so that more suitable choices are given; the experience information about the goods or services fed back by the user can help the user to further improve the quality of the services or goods in a targeted manner, so that more customers and profits are obtained. However, with the explosive increase of the data volume, the cost required for the user to acquire useful information from massive comment data is more and more increased, so how to quickly and effectively process and analyze the comment text of the user and extract the valuable information has important application value and research significance.
At present, a large amount of comment data cannot be fully utilized, and a consumer cannot acquire valuable information from a large amount of comment data. Therefore, a comment analysis system based on word vectors and syntactic characteristics is researched, the satisfaction degree of a user on each attribute of a commodity is obtained according to an analysis result, the advantages and disadvantages of the commodity are summarized, and then data visualization is performed on the analysis result.
Disclosure of Invention
The invention aims to solve the technical problem of how to realize accurate, real-time, automatic and convenient processing and analysis of commodity comment data, and provides a comment analysis method based on word vectors and syntactic characteristics to overcome the defects of the prior art.
The invention provides a comment analysis method based on word vectors and syntactic characteristics, which comprises the following steps:
1) acquiring commodity page comment data of an e-commerce website;
2) preprocessing the acquired target data set and constructing a candidate emotion word set;
3) extracting a commendable and derogative word set provided by the Hownet and the NTU to form a basic emotion dictionary;
4) carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix;
5) establishing a probability transfer matrix by using a semantic similarity matrix, and generating a final emotion dictionary by combining a seed word set through an LPA label propagation algorithm and through basic emotion dictionary inspection;
6) processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed;
7) preprocessing the obtained text without redundancy, forming a dependency relationship tree for the obtained word segmentation data set based on dependency relationship and syntactic characteristics, and generating an SBV, VOB, ATT, CMP and COO dependency relationship pair;
8) for the obtained dependency relationship pair, the evaluation matching pair of the commodity attribute, the negative word, the degree word and the emotional word is extracted through the part of speech;
9) and combining the obtained evaluation building pairs with an emotion dictionary, performing commendatory and derogatory calculation and quality sequencing on the evaluation object, and finally realizing the evaluation through a visual interactive interface.
As a further limitation of the present invention, step 2) specifically comprises:
2-1) removing illegal characters by using a character matching algorithm;
2-2) carrying out word segmentation and part-of-speech tagging on the original data set by using LTP;
2-3) extracting words according with parts of speech, and forming a candidate emotion word set 1 through de-emphasis;
2-4) carrying out word segmentation and part-of-speech tagging on the original data set by using NLPIR;
2-5) extracting words according with parts of speech, and forming a candidate emotion word set 2 through de-emphasis;
2-6) combining the candidate emotion word set 1 and the candidate emotion word set 2, and removing duplication to obtain a candidate emotion word set.
As a further limitation of the present invention, step 3) specifically comprises: and (4) evaluating the word dictionary by using the hownet emotion dictionary and ntu, respectively extracting the commendable and derogable words in the word dictionary, combining the words and derogable words, and removing the duplication to form a basic emotion dictionary.
As a further limitation of the present invention, step 4) specifically comprises:
4-1) utilizing a Word2Vec training data set to obtain Word vectors of words;
4-2) combining the candidate emotion word sets, and calculating the semantic similarity between the words by adopting the following formula:
4-3) e.g. two n-dimensional word vectors a (x)11, x12, … , x1n) And b (x)21, x22, … , x2n) The semantic similarity calculation formula is as follows:
wherein,representing a semantic similarity value;representing the k-dimension value of the word vector a;representing the k-dimension value of the word vector b;
4-4) constructing a semantic similarity matrix according to the calculated semantic similarity.
As a further limitation of the present invention, step 5) specifically comprises:
5-1) taking each word as a node of the graph, wherein the weight of an edge between two nodes is represented by the semantic similarity between the represented words;
5-2) establishing a probability transition matrix P according to the following formula:
wherein, P [ i][j]Representing the probability of a similarity transition between words i to j, SIM (w)i,wj) Representing the similarity of the words i and j, and m represents the number of words with the highest semantic similarity with the word i;
5-3) counting the word frequency of all the emotional words in the candidate emotional word set in the original comment data, and screening out N words with the highest word frequency to form a seed word set 1; screening out words with the emotion vocabulary body strength greater than m and in the candidate emotion word set by using the emotion vocabulary body library to form a seed word set 2; combining the seed word set 1 and the seed word set 2, removing duplication to form a seed word set, and carrying out artificial emotion labeling;
5-4) building LxC's label matrix Y by using a small number of artificially labeled seed wordsLWherein: l represents the number of seed words; c represents the number of the classes, and is divided into 3 classes which are commendative, devaluative and neutral respectively;
5-5) simultaneously establishing Label matrix Y of UxC by using unlabeled sample wordsUWherein: u represents the number of unlabeled sample words; c represents the number of the classes, and is divided into 3 classes which are commendative, devaluative and neutral respectively;
5-6) finally, performing part-of-speech tagging on the sample word by adopting an LPA label propagation algorithm, and forming a final emotion dictionary after the sample word is checked by a basic emotion dictionary.
As a further limitation of the present invention, step 6) specifically comprises:
the core sentence mainly refers to deleting redundancy and reserving trunk components related to evaluation matching; if the original sentence does not accord with any rule, the original sentence is kept unchanged, the method aims to improve the accuracy of evaluating the syntactic dependency analysis of the text by using the core sentence, and the rule comprises the following steps:
rule 1: deleting the components of the sentence headings in the sentence, such as the sequence of 'advantage of …', 'disadvantage of …', 'deficiency of …', 'advantage of …', 'benefit of …';
rule 2: sentences with hypothetical tendencies, such as "say …", "wish …", "if …", "wish …", "suggest …", are deleted;
rule 3: deletion periods are "exactly", "naturally", "particularly", "also exactly", "especially" sequences;
rule 4: delete "feel", "think" claims;
rule 5: and deleting the continuous punctuation marks except the first punctuation mark and abnormal characters such as expressions, characters and brackets.
As a further limitation of the present invention, step 7) specifically comprises:
five axioms of dependency syntax:
(1) a sentence has only one and only one independent component;
(2) any component in the sentence must depend on a certain component at the same time;
(3) any component in the sentence cannot depend on two or more components at the same time;
(4) if the component a directly depends on the component b and the component c is positioned between the components a and b in the sentence, the component c depends on the component a or the component b or other components between the components a and b;
(5) the components on the left side and the right side of the central component do not have dependency relationship with each other;
the dependency tree is characterized in that:
(1) nodes in the tree are served by various components in the sentence;
(2) the root node of the tree is the central component of the whole sentence;
(3) edges formed between nodes in the tree have directionality, and asymmetric dependency relationships among the components are reflected;
(4) five axioms of dependency syntax are satisfied;
most sentence dependency relations in the comments are five types, namely a main predicate relation (SBV), a dynamic guest relation (VOB/FOB), a centering relation (ATT), a dynamic complement relation (CMP) and a parallel relation (COO), dependency syntax analysis can be carried out through an LTP dependency syntax analyzer, and dependency relation pairs are extracted by combining COO algorithms for identifying parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying the parallel evaluation objects and the parallel evaluation words specifically comprises the following steps:
traversing all words related to the dependency relationship between two nodes in SBV, VOB, ATT and CMP dependency relationship pairs and in a dependency syntax tree obtained based on the dependency relationship and the syntax characteristics;
judging whether all traversed words have COO relations or not;
and expanding the parallel evaluation objects and evaluation words of the COO relation.
As a further limitation of the present invention, step 8) specifically comprises:
8-1) according to the characteristics of Chinese language, the evaluation objects are mostly nouns or verbs, and the evaluation words are mostly adjectives or verbs;
8-2) extracting an evaluation object and an evaluation word according to the part of speech, namely commodity attribute and emotion word;
8-3) traversing the obtained evaluation object and the evaluation words according to the dependency syntax tree to judge whether negative words exist between the evaluation object and the evaluation words, if yes, counting the number of the negative words by +1, and if traversing to cumulatively add a plurality of negative words until the traversal is finished, performing parity judgment on the number of the negative words. If the negative word is odd, the assignment of the corresponding negative word is-1, and if the negative word is even, the assignment of the corresponding negative word is + 1;
8-4) traversing whether a degree word exists between the obtained evaluation object and the evaluation word according to the dependency syntax tree, and if the degree word exists in a plurality of evaluation objects, accumulating the number of the degree words to obtain the number of the degree words of the collocation pair;
8-5) finally forming the evaluation matching pairs of the commodity attributes, the negative words, the degree words and the emotional words.
As a further limitation of the present invention, step 9) specifically comprises:
according to the commodity attribute a appearing n times, the commendation and derogation calculation formula is as follows:
score is the sentiment value of the commodity attribute a,for the ith time of the occurrence of the commodity attribute, the private is the value (-1 or + 1) of the negative word corresponding to the ith commodity attribute, and the degree is the number of degree adverbs corresponding to the ith commodity attribute; thus, the commodity attribute emotional value is calculated, and the same evaluation object is accumulated and calculated;
and judging whether all the extracted evaluation objects are commendably and commendably, and sorting and arranging the final results by using bubble sorting.
A visual interactive interface, which can execute all the steps of the claims, can well display the emotion value in a bar chart form, and is added with a plurality of friendly interactive functions, comprising: loading, logging in, logging out, modifying the password, logging in the use state of the user and the like.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
according to the method, a basic emotion dictionary is constructed by obtaining commodity page comment data of an e-commerce website and preprocessing the commodity page comment data; carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool and generating a semantic similarity matrix so as to establish a probability transition matrix, and generating a final emotion dictionary through an LPA (low-power amplifier) label propagation algorithm by combining a seed Word set; processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed; preprocessing the obtained text without redundancy, forming a dependency relationship tree for the obtained word segmentation data set based on dependency relationship and syntactic characteristics, generating SBV, VOB, ATT, CMP and COO dependency relationship pairs, extracting < commodity attribute, negative word, degree word and emotional word > evaluation matching pairs, combining an emotional dictionary, performing positive and negative value calculation and quality sequencing on the commodity attribute, and finally realizing through a visual interactive interface; the accuracy, real-time performance, automation and convenience of the comment data analysis can be realized at the same time.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
the technical scheme of the invention constructs an emotion dictionary suitable for commodity comments by using a word vector trained by a neural network model and combining an LTP label propagation algorithm; designing a commodity attribute identification and extraction algorithm based on the core sentence rule, the dependency relationship and the syntactic characteristics; and a comment analysis system based on word vectors and syntactic characteristics is constructed by combining the technical scheme, the satisfaction degree of the user on each attribute of the commodity is obtained according to the analysis result, the advantages and disadvantages of the commodity are summarized, and then data visualization is carried out on the analysis result.
Referring to fig. 1, the comment analyzing method based on word vectors and syntactic characteristics according to the present invention includes the following specific steps:
step S101: and obtaining commodity page comment data of the e-commerce website.
In specific implementation, a comment data crawling algorithm is designed, comment data of various commodities of an E-commerce website are obtained, and an original comment data set is generated.
Step S102: and preprocessing the acquired target data set and constructing a basic emotion dictionary.
In specific implementation, a character matching algorithm is used for an original data set to remove illegal characters; firstly, performing word segmentation and part-of-speech tagging by using LTP, extracting words with part-of-speech labels of 'a' (adj), and removing duplication to form a candidate emotion word set 1; then, performing word segmentation and part-of-speech tagging by using NLPIR, extracting words with part-of-speech identifiers of 'a' (adj), and performing duplication removal to form a candidate emotion word set 2; and combining the candidate emotion word set 1 and the candidate emotion word set 2, and removing the duplication to form a final candidate emotion word set.
Step S103: and extracting the commendable and derogative word sets provided by the Hownet and the NTU to form a basic emotion dictionary.
In the specific implementation, the method comprises the steps of utilizing a hosnet emotion dictionary and an NTU evaluation word dictionary, extracting commendable and derogable words in the hosnet emotion dictionary and NTU evaluation word dictionary respectively, combining the commendable and derogable words, and removing the duplication to form a basic emotion dictionary.
Step S104: and carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain a Word vector and generate a semantic similarity matrix.
In the specific implementation, a Word2Vec training data set is used, training parameters size =100, window =5, sg =0, and min _ count =0 are set respectively, and a Word vector of a Word is obtained through training.
And (4) combining the candidate emotion word sets, and calculating the semantic similarity between the words by adopting the following formula.
E.g. two n-dimensional word vectors a (x)11, x12, … , x1n) And b (x)21, x22, … , x2n) The semantic similarity calculation formula is as follows:
wherein,representing a semantic similarity value;representing the k-dimension value of the word vector a;representing the k-dimension value of the word vector b;
traversing all the emotional words in the candidate emotional word set in sequence, fixing one emotional word, and calculating the similarity of the fixed emotional word and all other emotional words; supposing that m candidate emotion words are provided, and obtaining a semantic similarity matrix of m through m times of calculation.
For the convenience of the following operation, it is specified that the similarity between the same emotional words is 0.
And constructing a semantic similarity matrix according to the calculated semantic similarity.
Step S105: and establishing a probability transition matrix by using the semantic similarity matrix, and generating a final emotion dictionary by combining the seed word set through an LPA label propagation algorithm and through basic emotion dictionary inspection.
In a specific implementation, each word is considered as a node of the graph, and the weight of an edge between two nodes is represented by the semantic similarity between the words represented by the edge.
The probability transition matrix P is established according to the following formula:
wherein, P [ i][j]Representing the probability of a similarity transition between words i to j, SIM (w)i,wj) Representing the similarity of the words i and j, and m representing the number of words with the highest semantic similarity to the word i (manually set); and establishing a probability transition matrix P according to the formula.
Counting the word frequency of all emotion words in the candidate emotion word set in the original comment data, and screening out 100 words with the highest word frequency to form a seed word set 1; screening words with the emotion vocabulary body strength greater than 7 and in the candidate emotion vocabulary set by using an emotion vocabulary body library of the university of great succession of studios to form a seed vocabulary set 2; combining the seed word set 1 and the seed word set 2, removing duplication to form a seed word set, and carrying out artificial emotion labeling.
Then a label matrix Y of LxC is established by using a small number of artificially labeled seed wordsLWherein: l represents the number of seed words; c represents the number of classes, generally 3 classes (commendatory, derogatory, neutral); meanwhile, a label matrix Y of UxC is established by using unlabeled sample wordsUWherein: u represents the number of unlabeled sample words; c represents the number of classes, generally 3 classes (commendatory, derogatory, neutral); combining the two label matrixes to obtain an NxC soft label matrix F = [ Y =L;YU]。
Executing a label propagation algorithm, and specifically operating as follows: 1) and (3) performing propagation: f = PF; 2) Label of labeled sample reset F: fL=YL(ii) a 3) Repeating steps 1) and 2) until F converges.
WhereinStep 1 is to transmit the label (emotion attribute) of each node (emotion word) to other nodes according to the probability determined by the probability transition matrix, wherein if the similarity of the two nodes is higher, the transmission probability is higher; the purpose of step 2 is to reset the label of the labeled seed word to the labeled value, so as to avoid the change caused by the operation process of step 1; the method for determining F convergence in step 3 is to calculate the latest F and the last F after operation0Until the similarity no longer changes, F is considered converged.
And finally, the three numerical values in a single row in the matrix F represent the attribute propagation values of the corresponding emotional words, the maximum numerical value is selected, the corresponding attribute is judged, and the attribute of the emotional words is determined.
Deriving the emotion words with confirmed attributes to form an emotion dictionary 1, traversing all emotion words in the emotion dictionary 1, and if the basic emotion dictionary contains the emotion words and contradicts the attributes in the basic emotion dictionary in the step S103, changing the attributes of the emotion words with the attributes in the basic emotion dictionary as the standard; otherwise, the attribute is unchanged.
After the above steps are finished, the modified emotion dictionary 1 is the final emotion dictionary.
Step S106: and processing the obtained commodity comment text based on the core sentence rule to obtain the comment text with the redundancy removed.
In specific implementation, a commodity website is input on an interactive interface of a webpage of the system, comment data of a commodity input on an e-commerce platform is crawled through a web crawler mechanism designed in a background, and the system is set to crawl the top 1000 pieces of high-quality comment data of the commodity.
Carrying out redundancy removal processing on the obtained commodity comment data based on the core sentence rule, and reserving trunk components related to evaluation matching; for example: the mobile phone is good in receiving, stiffness and pixel and tone quality, and particularly gives force for express delivery (next day), the only defect is that the package is not good, and a shop can be improved. . . "the treatment is as follows:
(1) matching rule 1, the example sentence is matched to be insufficient of …, the processed example sentence is changed into' the mobile phone is received, the mobile phone is very stiff, the pixel and the tone quality are good, particularly, express delivery is very strong (next day), namely, the package is not good, and a shop can improve the rule. . . ";
(2) and 2, matching rule 2, namely, the hope is matched in the example sentence, the result is changed into' the mobile phone receives the hope after processing, the mobile phone is very stiff, the pixel and the tone quality are good, particularly, the express delivery is very good (next day), namely, the package is not very good, and the shop can improve the result. . . ";
(3) and 3, matching rule 3, namely 'the example sentence is matched with' the example sentence 'and' the example sentence is especially 'after processing, the example sentence is changed into' the example sentence is received by a mobile phone, the example sentence is very stiff, the pixel and the tone quality are good, the express delivery is very powerful (the next day), the package is not very good, and the shop can improve the example sentence once. . . ";
(4) and matching rule 5, deleting continuous punctuation marks from example sentences, and finally processing to obtain a core sentence, wherein the core sentence is' received by the mobile phone, good in shape, good in pixel and tone quality, good in express delivery, poor in package, and capable of being improved by a shop. ", this embodiment is denoted as the example sentence sequences.
Step S107: and preprocessing the obtained text without redundancy, forming a dependency relationship tree by the obtained word segmentation data set based on dependency relationship and syntactic characteristics, and generating an SBV, VOB, ATT, CMP and COO dependency relationship pair.
In a specific implementation, the text with the redundancy removed, which is obtained in step S106, is preprocessed to make punctuation clauses, so as to obtain 6 clauses. And (3) segmenting each small sentence by utilizing an LTP tool, labeling the part of speech, and forming a dependency relationship tree based on dependency relationship and syntactic characteristics. The dependency relationship is obtained for SBV < mobile phone, receiving >, SBV < pixel, good >, COO < tone quality, pixel >, SBV < express, give force >, SBV < package, good >, SBV < shop, improvement >.
For example, if the phrase "both the pixel and the tone quality are good", after the above steps are performed, the dependency relationship pair is extracted again by combining with the COO algorithm for identifying the parallel evaluation object and the parallel evaluation word, and the obtained dependency relationship pair is < pixel, good >, < tone quality, good >.
Step S108: and (4) extracting the evaluation matching pairs of the commodity attributes, the negative words, the degree words and the emotional words from the obtained dependency relationship pairs through the parts of speech.
In the specific implementation, for each extracted relation pair, traversing whether negative words exist between the evaluation object and the evaluation word, calculating the number of the negative words, judging whether the negative words between the evaluation object and the evaluation word are odd or even to obtain positive and negative values of the negative words, namely judging the negative words to be odd numbers, and assigning a value of-1 to the corresponding negative words; the negative word is judged to be even number and is correspondingly assigned with the value of + 1. And then traversing whether a degree word exists between the evaluation object and the evaluation word, and calculating the number of the degree words. Finally, a commodity attribute, private, default, emotion word evaluation matching pair is formed. In the example sentence sequences in step S106, a negative word "no" is recognized between the relation pair < package, good >, and the corresponding private value is-1; and traversing the degree adverb between the package and the good, and identifying the good, wherein the corresponding degree value is 1. The evaluation match for this sentence extraction is < package, -1, 1, good >.
Step S109: and combining the obtained evaluation building pairs with an emotion dictionary, performing commendatory and derogatory calculation and quality sequencing on the evaluation object, and finally realizing the evaluation through a visual interactive interface.
In the specific implementation, the extracted evaluation matching pairs are combined, and the commendatory and derogatory attributes of the emotion words are obtained through the emotion dictionary. And then the commendatory and derogatory value calculation of the commodity attribute is carried out according to the following formula:
for step S107The evaluation pair obtained in (1)<Packaging, -1, 1, good>The commodity attribute of "package" is commendably and derogatively calculated to obtain its emotion value
Traversing all the obtained comment data of the commodity, performing the steps, accumulating the same evaluation objects, finally extracting all the commodity attributes of the commodity, then classifying into commendatory and derogatory types, and obtaining the final result by using bubble sorting arrangement. And finally, realizing the webpage by using a visual interactive interface through the front end and the back end.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (10)

1. A comment analysis method based on word vectors and syntactic characteristics is characterized by comprising the following steps:
1) acquiring commodity page comment data of an e-commerce website;
2) preprocessing the acquired target data set and constructing a candidate emotion word set;
3) extracting a commendable and derogative word set provided by the Hownet and the NTU to form a basic emotion dictionary;
4) carrying out Word vector training on the obtained preprocessed data set through a Word2Vec tool to obtain Word vectors and generate a semantic similarity matrix;
5) establishing a probability transfer matrix by using a semantic similarity matrix, and generating a final emotion dictionary by combining a seed word set through an LPA label propagation algorithm and through basic emotion dictionary inspection;
6) processing the obtained commodity comment text based on the core sentence rule to obtain a comment text with redundancy removed;
7) preprocessing the obtained text without redundancy, forming a dependency relationship tree for the obtained word segmentation data set based on dependency relationship and syntactic characteristics, and generating an SBV, VOB, ATT, CMP and COO dependency relationship pair;
8) for the obtained dependency relationship pair, the evaluation matching pair of the commodity attribute, the negative word, the degree word and the emotional word is extracted through the part of speech;
9) and combining the obtained evaluation building pairs with an emotion dictionary, performing commendatory and derogatory calculation and quality sequencing on the evaluation object, and finally realizing the evaluation through a visual interactive interface.
2. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein the step 2) specifically comprises:
2-1) removing illegal characters by using a character matching algorithm;
2-2) carrying out word segmentation and part-of-speech tagging on the original data set by using LTP;
2-3) extracting words according with parts of speech, and forming a candidate emotion word set 1 through de-emphasis;
2-4) carrying out word segmentation and part-of-speech tagging on the original data set by using NLPIR;
2-5) extracting words according with parts of speech, and forming a candidate emotion word set 2 through de-emphasis;
2-6) combining the candidate emotion word set 1 and the candidate emotion word set 2, and removing duplication to obtain a candidate emotion word set.
3. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein step 3) specifically comprises: and (4) evaluating the word dictionary by using the hownet emotion dictionary and ntu, respectively extracting the commendable and derogable words in the word dictionary, combining the words and derogable words, and removing the duplication to form a basic emotion dictionary.
4. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein the step 4) specifically comprises:
4-1) utilizing a Word2Vec training data set to obtain Word vectors of words;
4-2) combining the candidate emotion word sets, and calculating the semantic similarity between the words by adopting the following formula:
4-3) e.g. two n-dimensional word vectors a (x)11, x12, … , x1n) And b (x)21, x22, … , x2n) The semantic similarity calculation formula is as follows:
wherein,representing a semantic similarity value;representing the k-dimension value of the word vector a;representing the k-dimension value of the word vector b;
4-4) constructing a semantic similarity matrix according to the calculated semantic similarity.
5. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein the step 5) specifically comprises:
5-1) taking each word as a node of the graph, wherein the weight of an edge between two nodes is represented by the semantic similarity between the represented words;
5-2) establishing a probability transition matrix P according to the following formula:
wherein, P [ i][j]Representing the probability of a similarity transition between words i to j, SIM (w)i,wj) Representing the similarity of the words i and j, and m represents the number of words with the highest semantic similarity with the word i;
5-3) counting the word frequency of all the emotional words in the candidate emotional word set in the original comment data, and screening out N words with the highest word frequency to form a seed word set 1; screening out words with the emotion vocabulary body strength greater than m and in the candidate emotion word set by using the emotion vocabulary body library to form a seed word set 2; combining the seed word set 1 and the seed word set 2, removing duplication to form a seed word set, and carrying out artificial emotion labeling;
5-4) building LxC's label matrix Y by using a small number of artificially labeled seed wordsLWherein: l represents the number of seed words; c represents the number of the classes, and is divided into 3 classes which are commendative, devaluative and neutral respectively;
5-5) simultaneously establishing Label matrix Y of UxC by using unlabeled sample wordsUWherein: u represents the number of unlabeled sample words; c represents the number of the classes, and is divided into 3 classes which are commendative, devaluative and neutral respectively;
5-6) finally, performing part-of-speech tagging on the sample word by adopting an LPA label propagation algorithm, and forming a final emotion dictionary after the sample word is checked by a basic emotion dictionary.
6. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein step 6) specifically comprises:
the core sentence mainly refers to deleting redundancy and reserving trunk components related to evaluation matching; if the original sentence does not accord with any rule, the original sentence is kept unchanged, the method aims to improve the accuracy of evaluating the syntactic dependency analysis of the text by using the core sentence, and the rule comprises the following steps:
rule 1: deleting the components of the sentence headings in the sentence, such as the sequence of 'advantage of …', 'disadvantage of …', 'deficiency of …', 'advantage of …', 'benefit of …';
rule 2: sentences with hypothetical tendencies, such as "say …", "wish …", "if …", "wish …", "suggest …", are deleted;
rule 3: deletion periods are "exactly", "naturally", "particularly", "also exactly", "especially" sequences;
rule 4: delete "feel", "think" claims;
rule 5: and deleting the continuous punctuation marks except the first punctuation mark and abnormal characters such as expressions, characters and brackets.
7. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein step 7) specifically comprises:
five axioms of dependency syntax:
(1) a sentence has only one and only one independent component;
(2) any component in the sentence must depend on a certain component at the same time;
(3) any component in the sentence cannot depend on two or more components at the same time;
(4) if the component a directly depends on the component b and the component c is positioned between the components a and b in the sentence, the component c depends on the component a or the component b or other components between the components a and b;
(5) the components on the left side and the right side of the central component do not have dependency relationship with each other;
the dependency tree is characterized in that:
(1) nodes in the tree are served by various components in the sentence;
(2) the root node of the tree is the central component of the whole sentence;
(3) edges formed between nodes in the tree have directionality, and asymmetric dependency relationships among the components are reflected;
(4) five axioms of dependency syntax are satisfied;
most sentence dependency relations in the comments are five types, namely a main predicate relation (SBV), a dynamic guest relation (VOB/FOB), a centering relation (ATT), a dynamic complement relation (CMP) and a parallel relation (COO), dependency syntax analysis can be carried out through an LTP dependency syntax analyzer, and dependency relation pairs are extracted by combining COO algorithms for identifying parallel evaluation objects and parallel evaluation words; the COO algorithm for identifying the parallel evaluation objects and the parallel evaluation words specifically comprises the following steps:
traversing all words related to the dependency relationship between two nodes in SBV, VOB, ATT and CMP dependency relationship pairs and in a dependency syntax tree obtained based on the dependency relationship and the syntax characteristics;
judging whether all traversed words have COO relations or not;
and expanding the parallel evaluation objects and evaluation words of the COO relation.
8. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein step 8) specifically comprises:
8-1) according to the characteristics of Chinese language, the evaluation objects are mostly nouns or verbs, and the evaluation words are mostly adjectives or verbs;
8-2) extracting an evaluation object and an evaluation word according to the part of speech, namely commodity attribute and emotion word;
8-3) traversing the obtained evaluation object and the evaluation words according to the dependency syntax tree to determine whether negative words exist between the evaluation object and the evaluation words, if yes, counting the number of the negative words by +1, and if the negative words are traversed to be accumulated and added, performing parity judgment on the number of the negative words until the traversal is finished;
if the negative word is odd, the assignment of the corresponding negative word is-1, and if the negative word is even, the assignment of the corresponding negative word is + 1;
8-4) traversing whether a degree word exists between the obtained evaluation object and the evaluation word according to the dependency syntax tree, and if the degree word exists in a plurality of evaluation objects, accumulating the number of the degree words to obtain the number of the degree words of the collocation pair;
8-5) finally forming the evaluation matching pairs of the commodity attributes, the negative words, the degree words and the emotional words.
9. The method for analyzing comments based on word vectors and syntactic characteristics according to claim 1, wherein step 9) specifically comprises:
according to the commodity attribute a appearing n times, the commendation and derogation calculation formula is as follows:
score is the sentiment value of the commodity attribute a, XiFor the ith time of the occurrence of the commodity attribute, the private is the value (-1 or + 1) of the negative word corresponding to the ith commodity attribute, and the degree is the number of degree adverbs corresponding to the ith commodity attribute; thus, the commodity attribute emotional value is calculated, and the same evaluation object is accumulated and calculated;
and judging whether all the extracted evaluation objects are commendably and commendably, and sorting and arranging the final results by using bubble sorting.
10. A visual interactive interface, characterized in that, all the steps of claims 1 to 9 can be executed, besides the emotional values can be well displayed in the form of bar graph, a plurality of friendly interactive functions are added, including: loading, logging in, logging out, modifying the password, logging in the use state of the user and the like.
CN201910343337.5A 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface Active CN110175325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910343337.5A CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910343337.5A CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Publications (2)

Publication Number Publication Date
CN110175325A true CN110175325A (en) 2019-08-27
CN110175325B CN110175325B (en) 2023-07-11

Family

ID=67690209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910343337.5A Active CN110175325B (en) 2019-04-26 2019-04-26 Comment analysis method based on word vector and syntactic characteristics and visual interaction interface

Country Status (1)

Country Link
CN (1) CN110175325B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659828A (en) * 2019-09-23 2020-01-07 上海海事大学 Software feature evaluation method based on comment data
CN110705266A (en) * 2019-09-09 2020-01-17 创新奇智(南京)科技有限公司 Emotion analysis method and device
CN110706028A (en) * 2019-09-26 2020-01-17 四川长虹电器股份有限公司 Commodity evaluation emotion analysis system based on attribute characteristics
CN110717654A (en) * 2019-09-17 2020-01-21 合肥工业大学 Product quality evaluation method and system based on user comments
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN111259661A (en) * 2020-02-11 2020-06-09 安徽理工大学 New emotion word extraction method based on commodity comments
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for extracting perceptual image vocabulary of product
CN111523300A (en) * 2020-04-14 2020-08-11 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111898928A (en) * 2020-08-18 2020-11-06 哈尔滨工业大学 Multi-party service value-quality-capability index alignment method facing space-time boundary
CN111930941A (en) * 2020-07-31 2020-11-13 腾讯音乐娱乐科技(深圳)有限公司 Method and device for identifying abuse content and server
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device
CN112115700A (en) * 2020-08-19 2020-12-22 北京交通大学 Dependency syntax tree and deep learning based aspect level emotion analysis method
CN112579776A (en) * 2020-12-21 2021-03-30 北京智齿博创科技有限公司 Automatic labeling method of quality problem scene labels based on categories
CN113327140A (en) * 2021-08-02 2021-08-31 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN113535901A (en) * 2021-07-08 2021-10-22 北京航空航天大学 E-commerce comment-based user-side commodity knowledge graph construction method
CN114493760A (en) * 2021-12-30 2022-05-13 杭州盟码科技有限公司 E-commerce cloud data analysis method and system
CN114881039A (en) * 2022-05-05 2022-08-09 重庆锐云科技有限公司 Owner portrait method, device and equipment based on customer evaluation and storage medium
CN117436446A (en) * 2023-12-21 2024-01-23 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method
WO2024037483A1 (en) * 2022-08-16 2024-02-22 中国第一汽车股份有限公司 Text processing method and apparatus, and electronic device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133282A (en) * 2017-04-17 2017-09-05 华南理工大学 A kind of improved evaluation object recognition methods based on two-way propagation
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133282A (en) * 2017-04-17 2017-09-05 华南理工大学 A kind of improved evaluation object recognition methods based on two-way propagation
CN107247702A (en) * 2017-05-05 2017-10-13 桂林电子科技大学 A kind of text emotion analysis and processing method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
邓淑卿 等: "基于句法依赖规则和词性特征的情感词识别研究", 《情报理论与实践》 *
陆峰: "基于word2vec扩充情感词典的商品评论倾向分析", 《电脑知识与技术》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705266A (en) * 2019-09-09 2020-01-17 创新奇智(南京)科技有限公司 Emotion analysis method and device
CN110717654A (en) * 2019-09-17 2020-01-21 合肥工业大学 Product quality evaluation method and system based on user comments
CN110659828B (en) * 2019-09-23 2022-03-08 上海海事大学 Software feature evaluation method based on comment data
CN110659828A (en) * 2019-09-23 2020-01-07 上海海事大学 Software feature evaluation method based on comment data
CN110706028A (en) * 2019-09-26 2020-01-17 四川长虹电器股份有限公司 Commodity evaluation emotion analysis system based on attribute characteristics
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN110750646B (en) * 2019-10-16 2022-12-06 乐山师范学院 Attribute description extracting method for hotel comment text
CN111259661A (en) * 2020-02-11 2020-06-09 安徽理工大学 New emotion word extraction method based on commodity comments
CN111259661B (en) * 2020-02-11 2023-07-25 安徽理工大学 New emotion word extraction method based on commodity comments
CN111414753A (en) * 2020-03-09 2020-07-14 中国美术学院 Method and system for extracting perceptual image vocabulary of product
CN111523300A (en) * 2020-04-14 2020-08-11 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111523300B (en) * 2020-04-14 2021-03-05 北京精准沟通传媒科技股份有限公司 Vehicle comprehensive evaluation method and device and electronic equipment
CN111930941A (en) * 2020-07-31 2020-11-13 腾讯音乐娱乐科技(深圳)有限公司 Method and device for identifying abuse content and server
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device
CN112069312B (en) * 2020-08-12 2023-06-20 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device
CN111898928A (en) * 2020-08-18 2020-11-06 哈尔滨工业大学 Multi-party service value-quality-capability index alignment method facing space-time boundary
CN111898928B (en) * 2020-08-18 2021-08-31 哈尔滨工业大学 Multi-party service value-quality-capability index alignment method facing space-time boundary
CN112115700B (en) * 2020-08-19 2024-03-12 北京交通大学 Aspect-level emotion analysis method based on dependency syntax tree and deep learning
CN112115700A (en) * 2020-08-19 2020-12-22 北京交通大学 Dependency syntax tree and deep learning based aspect level emotion analysis method
CN112579776A (en) * 2020-12-21 2021-03-30 北京智齿博创科技有限公司 Automatic labeling method of quality problem scene labels based on categories
CN113535901B (en) * 2021-07-08 2023-08-18 北京航空航天大学 Method for constructing user side commodity knowledge graph based on e-commerce comments
CN113535901A (en) * 2021-07-08 2021-10-22 北京航空航天大学 E-commerce comment-based user-side commodity knowledge graph construction method
CN113327140B (en) * 2021-08-02 2021-10-29 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN113327140A (en) * 2021-08-02 2021-08-31 深圳小蝉文化传媒股份有限公司 Video advertisement putting effect intelligent analysis management system based on big data analysis
CN114493760A (en) * 2021-12-30 2022-05-13 杭州盟码科技有限公司 E-commerce cloud data analysis method and system
CN114881039A (en) * 2022-05-05 2022-08-09 重庆锐云科技有限公司 Owner portrait method, device and equipment based on customer evaluation and storage medium
WO2024037483A1 (en) * 2022-08-16 2024-02-22 中国第一汽车股份有限公司 Text processing method and apparatus, and electronic device and medium
CN117436446A (en) * 2023-12-21 2024-01-23 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method
CN117436446B (en) * 2023-12-21 2024-03-22 江西农业大学 Weak supervision-based agricultural social sales service user evaluation data analysis method

Also Published As

Publication number Publication date
CN110175325B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
CN110175325B (en) Comment analysis method based on word vector and syntactic characteristics and visual interaction interface
CN108694647B (en) Method and device for mining merchant recommendation reason and electronic equipment
CN106649603B (en) Designated information pushing method based on emotion classification of webpage text data
WO2021077973A1 (en) Personalised product description generating method based on multi-source crowd intelligence data
CN103903164B (en) Semi-supervised aspect extraction method and its system based on realm information
CN109376251A (en) A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model
CN103309862B (en) Webpage type recognition method and system
CN105550269A (en) Product comment analyzing method and system with learning supervising function
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN111260437A (en) Product recommendation method based on commodity aspect level emotion mining and fuzzy decision
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN112765974B (en) Service assistance method, electronic equipment and readable storage medium
CN108984554A (en) Method and apparatus for determining keyword
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN110955750A (en) Combined identification method and device for comment area and emotion polarity, and electronic equipment
CN112069312B (en) Text classification method based on entity recognition and electronic device
KR102325022B1 (en) On-line image and review integrated analysis method and system using deep learning-based hybrid analysis method
CN110706028A (en) Commodity evaluation emotion analysis system based on attribute characteristics
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN102789449A (en) Method and device for evaluating comment text
KR101416291B1 (en) Sentiment classification system using rule-based multi agents
CN114971730A (en) Method for extracting file material, device, equipment, medium and product thereof
CN108536673B (en) News event extraction method and device
CN117764669A (en) Article recommendation method, device, equipment, medium and product
CN115455151A (en) AI emotion visual identification method and system and cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant