CN110837740B - Comment aspect opinion level mining method based on dictionary improvement LDA model - Google Patents

Comment aspect opinion level mining method based on dictionary improvement LDA model Download PDF

Info

Publication number
CN110837740B
CN110837740B CN201911058218.1A CN201911058218A CN110837740B CN 110837740 B CN110837740 B CN 110837740B CN 201911058218 A CN201911058218 A CN 201911058218A CN 110837740 B CN110837740 B CN 110837740B
Authority
CN
China
Prior art keywords
word
words
comment
sentence
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201911058218.1A
Other languages
Chinese (zh)
Other versions
CN110837740A (en
Inventor
袁凌
冯晋田
李金珊
魏明
杨雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Wuhan Fiberhome Technical Services Co Ltd
Original Assignee
Huazhong University of Science and Technology
Wuhan Fiberhome Technical Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology, Wuhan Fiberhome Technical Services Co Ltd filed Critical Huazhong University of Science and Technology
Priority to CN201911058218.1A priority Critical patent/CN110837740B/en
Publication of CN110837740A publication Critical patent/CN110837740A/en
Application granted granted Critical
Publication of CN110837740B publication Critical patent/CN110837740B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a comment aspect opinion level mining method based on a dictionary improved LDA model, and belongs to the field of network comment text mining. The method comprises the following steps: constructing an inverted index list based on an original network comment library; carrying out word-stop removal processing on each sentence of the original network comment library to obtain a preprocessed network comment library; inputting the preprocessed network comment library into an improved LDA model based on SentiWordNet and WordNet, and sampling by adopting Gibbs to obtain a sampling result; and sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words. The invention directly sets the aspect of the network comment library as the seed word without manual marking. And separating the evaluation object words from the comment viewpoints, and offsetting LDA model parameters by calculating the similarity of the words and the seed words so as to improve the effect of the model. Based on the inverted index, the clustering result is associated with the seed words and the original text, and the readability of the result is improved.

Description

Comment aspect opinion level mining method based on dictionary improvement LDA model
Technical Field
The invention belongs to the field of network comment text mining, and particularly relates to a comment aspect viewpoint level mining method based on a dictionary improved LDA model.
Background
The rapid development of the mobile internet and the popularization of the smart phone provide favorable conditions for people to make comments and opinions anytime and anywhere. People can evaluate different commodities in different fields on social platforms such as Twitter and microblog and on online shopping platforms such as Taobao, Amazon and Jingdong. The evaluation can be effectively analyzed, so that the decision of the manufacturer on sale and future development can be assisted, and the consumer can be helped to screen the product which is expected by the consumer. However, the emotion polarity judgment is carried out on the comment sentences only, effective information cannot be provided, and further determination of objects described by emotion words is needed. Unlike news reports, blogs, etc., web reviews are typically short in content. Due to different service contents, the domain of the comment object of the network comment is different. And the attributes contained in the network comment objects are more, and only by mining the aspect-level viewpoints, the effective information in the comments can be mastered.
The aspect-level viewpoint mining of the comments can extract aspect-level comment objects and comment categories from the comments, and has important research significance and value. The aspect level comment object (Opinion Target Expression Extraction) refers to the entity itself or the attribute modified by the emotion Opinion word. In the mining of the comment information, only judging the emotional polarity of the comment sentences has no significance to people reading the comment, and people are more concerned about the good and bad of the commodity on a specific level. Therefore, the method has great significance in determining the aspect-level comment object of the comment. For example, the commodity comment, "the appearance of the mobile phone is general, the battery is enough, and the signal is strong. If the emotion polarity of the sentence is directly judged, under the condition that the user does not read the original text, only one comment is known to represent the mobile phone is good, and the value of the comment to the user is obviously not large. Therefore, when performing comment mining, the aspect-level comment target words in the comment sentence are extracted first, and as for the above-described sentence, the words to be extracted are "appearance", "battery", and "signal". The review category Identification (aspect category Identification) is associated with the facet-level review object. In addition to judging that a word belongs to a certain comment category, a sentence can also be tagged with a comment category.
However, the variety of goods related to massive comments is various, the process of data annotation required by aspect-level viewpoint mining is tedious, and a large amount of resources are consumed for establishing a normative annotation corpus for the comments in all fields. Supervised methods that rely on annotated datasets will be difficult to apply to the field of reviews that lack annotated corpora. How to improve the effect of the model under the conditions of little supervision and no supervision and make the model have domain adaptability (including different domains and different languages) is a topic which is very worthy of research. The prior art is a MaxEnt-LDA model that introduces two distributions to indicate the classification of comment object words and emotion words and the classification of positive emotion words and negative emotion words. However, it has the following drawbacks: the classifier for indicating the classification of the comment object words and the emotion words uses a maximum entropy model and needs to label a large amount of data sets.
Disclosure of Invention
Aiming at the problem that a large amount of labels need to be added to a data set in the prior art of an aspect-level opinion mining method based on MaxEnt-LDA model comments, the invention provides a comment aspect-level opinion mining method based on a dictionary-improved LDA model, and aims to solve the problem of aspect-level opinion mining of network comments by using label data as little as possible.
To achieve the above object, according to a first aspect of the present invention, there is provided a method for mining review aspect opinion level of an LDA model based on dictionary improvement, the method comprising the steps of:
s1, constructing an inverted index list based on an original network comment library;
s2, carrying out word stop removing processing on each sentence of the original network comment library to obtain a preprocessed network comment library;
s3, inputting the preprocessed network comment library into an improved LDA model based on SentiWordNet and WordNet, and sampling by adopting Gibbs to obtain a sampling result;
and S4, sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words.
Specifically, step S1 includes the following sub-steps:
s11, numbering words of each sentence in an original network comment library in a binary group < a, b >, wherein a represents the number of the sentence in which the word is located, and b represents the number of the word in the sentence;
s12, removing repeated words in the original network comment library, and recording the numbers of the remaining words;
and S13, generating an inverted index list based on the word number after the duplication removal.
Specifically, step S3 includes the following sub-steps:
s31, directly setting the aspect of the network comment library as a seed word;
s32, dividing comment texts in the network comment library by taking sentences as units to form a comment text sentence set;
s33, setting different parameters alpha for each sentence based on the similarity between the words and the seed wordsd(ii) a Based on semantic similarity between words and seed words, parameters beta are set for the aspect-level object words, the positive comment words and the negative comment words respectively and are set for each topict,A、βt,P、βt,N
S34, adopting Gibbs sampling comment text sentence set to carry out parameter estimation and reasoning on the improved LDA model based on SentiWordNet and WordNet.
In particular, the amount of the solvent to be used,
Figure BDA0002255541680000031
βt,A=sim(w,A)*βbase
βt,P=sim(w,P)*βbase
βt,N=sim(w,N)*βbase
wherein N isdIs the number of all words in the current sentence, T is the number of topics, wd,iIs the ith word in the current sentence, t is the seed word, sim (w, t) represents the semantic similarity between w and the seed word t, alphabaseA fixed value parameter alpha representing the distribution of subject obeying Dirichlet in the standard LDA model; sim (w, A) denotes the probability that w belongs to the object word, sim (w, P) denotes the probability that w belongs to the active word, sim (w, N) denotes the probability that w belongs to the passive word, βbaseThe words in the standard LDA model are subject to a constant parameter β of the dirichlet distribution.
Specifically, step S34 includes the steps of:
(1) randomly assigning a theme number to each word in each sentence in the corpus, and randomly setting the values of an indication variable y and an indication variable v for all words in the sentence, wherein when y is equal to A, the current word is an aspect level object of a comment, when y is equal to O, the current word is a comment viewpoint, when v is equal to P, the current word is a positive emotion, and when v is equal to N, the current word is a negative emotion;
(2) rescanning the corpus, resampling and updating the topic number of each word according to a formula (1), updating the number of the word in the corpus, resampling and updating an indicator variable y and an indicator variable v according to formulas (2) and (3);
Figure BDA0002255541680000041
Figure BDA0002255541680000042
Figure BDA0002255541680000043
wherein z isd,nDenotes the subject to which the d-th comment sentence belongs, t denotes the subject number, yd,nIndicating the nth word of the nth sentence as an aspect level object word or an emotional viewpoint word, vd,nIndicates that the nth word of the nth sentence is a positive emotion word or a negative emotion word,
Figure BDA0002255541680000044
representing the number of words v with topic t and category q,
Figure BDA0002255541680000045
a dirichlet distribution parameter representing a word v with topic t and class q,
Figure BDA0002255541680000046
representing the number of words v with topic t and category u,
Figure BDA0002255541680000047
dirichlet score representing a word v with topic t and category uDistribution parameter, V represents the number of words in the corpus, wd,nRepresents the nth word in the d comment sentence
Figure BDA0002255541680000048
Word w with topic t and category ud,nThe number of the (c) is,
Figure BDA0002255541680000049
word w with topic t and category ud,nThe parameter of the dirichlet distribution of (a),
Figure BDA00022555416800000410
word w with topic t and category qd,nThe number of the (c) is,
Figure BDA00022555416800000411
word w with topic t and category qd,nOf the Dirichlet distribution parameter, nd,tNumber of words, alpha, representing the topic of the d-th sentence as td,tA dirichlet distribution parameter representing that the subject of the d-th comment sentence is t,
Figure BDA00022555416800000514
i represents non-i;
(3) repeating the resampling of the corpus until Gibbs sampling converges;
(4) counting the topics of each word in each sentence in the corpus to obtain the probability distribution of sentence-topic
Figure BDA0002255541680000051
Counting the distribution of each subject word in the corpus to obtain the probability distribution of the subject and the word
Figure BDA0002255541680000052
Figure BDA0002255541680000053
Specifically, the probability distribution calculation formula of the sentence d, topic t, is as follows:
Figure BDA0002255541680000054
with t as the topic, the word wd,nThe probability distribution calculation formula of the aspect level object words for evaluation is as follows:
Figure BDA0002255541680000055
with t as the topic, the word wd,nThe positive opinion term probability distribution calculation formula for evaluation is as follows:
Figure BDA0002255541680000056
with t as the topic, the word wd,nThe probability distribution calculation formula of the negative viewpoint words for evaluation is as follows:
Figure BDA0002255541680000057
wherein n isdRepresenting the number of words in the d-th sentence.
Specifically, the document generation process based on the SentiWordNet and WordNet improved LDA model is as follows:
(1) from Dirichlet distribution betat,A、βt,P、βt,NMid-sampling, generating aspect-level object distributions for comments
Figure BDA0002255541680000058
Positive opinion word distribution for reviews
Figure BDA0002255541680000059
Negative opinion word distribution of reviews
Figure BDA00022555416800000510
Figure BDA00022555416800000511
Figure BDA00022555416800000512
Figure BDA00022555416800000513
(2) For each sentence, from the Dirichlet distribution αdMiddle sampling to generate topic distribution thetad~Dir(αd);
(3) Multiple distribution of theta from the subjectdSampling to generate word w in sentence dd,nSubject Z ofd,n~Multi(θd);
(4) Calculated from WordNet and SentiWordNet, the parameter is pid,nIs [ 0,1 ], [ Bernoulli ] distribution and parameter is [ omega ]d,n(iii) bernoulli distribution on {0,1 };
(5) by a parameter of pid,nIs extracted to obtain the indicated word wd,nFor comment aspect level object word or comment opinion word yd,nBy a parameter of Ωd,nIs extracted to obtain the indicated word wd,nCommenting on words for positive or negative opinions vd,n
(6) The word w is generated according to the following formulad,n
Figure BDA0002255541680000061
Specifically, the step (4) includes the steps of:
(4.1) query of the current word w in WordNetd,nIs interpreted as sd,n,kCalculating wd,nEach semantic meaning sd,n,kWith various sub-words wtSemantics st,k0Similarity between them Sim (S)d,n,k,st,k0) Taking the maximum value of the similarity result in all the calculation results, and determining k' when the result is maximum as the current word wd,nThe associated semantic s in the sentenced,n,k′
(4.2) query s in SentiWordNetd,n,k′Score of each emotion of (1)
Figure BDA0002255541680000062
(4.3) score based on feeling
Figure BDA0002255541680000063
Calculating a parameter pid,nAnd Ωd,n
In particular, the amount of the solvent to be used,
Figure BDA0002255541680000064
Figure BDA0002255541680000065
to achieve the above object, according to a second aspect of the present invention, there is provided a computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program, which when executed by a processor, implements a review aspect opinion-level mining method based on a dictionary-improved LDA model as described in the first aspect.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) aiming at the standard LDA model, in the finally obtained subject-document probability and subject-word probability, the subject word is not determined and needs to be manually screened and determined.
(2) The invention adds a viewpoint word and evaluation object classification layer on the basis of the LDA model to realize the separation of viewpoint and aspect. The method solves the problem of few supervision aspect level viewpoint mining by means of two tools, namely a seed word configured by a user and WordNet and SentiWordNet, calculates the similarity between a word and the seed word in a text by setting the seed word and utilizing a calculation tool of the similarity of the word and the seed word in the WordNet, and reflects the similarity on LDA model parameters. Meanwhile, a tool for calculating vocabulary emotion in SentiWordNet is utilized to separate the evaluation object words from the comment viewpoints, and the positive polarity and the negative polarity of the comment viewpoints are classified. The LDA model parameters are biased by calculating the similarity of the words and the seed words in the corpus, and the effect of the model is improved.
(3) Aiming at the problem that the final result in the standard LDA model is the probability of a theme-document and a theme-word and the relation between the word and the document is lacked, the invention inquires the belonged sentence by the word based on the inverted index, establishes the relation between the clustering result and the seed word and the original text, and improves the readability of the result.
Drawings
FIG. 1 is a flowchart of a review aspect opinion level mining method based on a dictionary improvement LDA model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a document generation process of the SWLDA model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
First, terms and variables involved in the invention are explained as follows:
comment on: the idea of a person holding something, and an explanation of the held idea.
The method comprises the following steps: the attributes of the related comment objects and other detailed aspects discussed in the comment text. When a commentator comments on a commentary object, the commentator firstly determines an aspect to be discussed, selects a corresponding word to represent the aspect, and then selects a viewpoint word with a specific emotional tendency according to the viewpoint of the aspect to evaluate the viewpoint word. For example, when a reviewer evaluates a hotel, the word price, value, etc. is used to represent the aspect of "price" to be discussed, and then the word high, cheap, etc. is selected to represent the point of view.
Aspect opinion mining of comments: and finding out a plurality of aspects of the comments related to the comments in the comment text by utilizing a text analysis and mining technology, analyzing the emotional tendency degree of the comments of the commentator on all aspects of the comment object, and displaying the analysis result in a certain form. In aspect-level opinion mining of comments, it is necessary to extract an aspect, an aspect-level comment object, and an opinion of the comment from text information. For example, in restaurant reviews, "food", "service", "environment" are review aspects, and "steak (steak)" is an aspect-level review object in the "food" aspect, and "good" for the "steak" is a review viewpoint.
The topic model is as follows: a bridge of "subject" is established between comments and words.
TABLE 1 legends
Figure BDA0002255541680000081
Figure BDA0002255541680000091
Under the condition that the aspects of comments in the network comment library are known, the invention extracts the aspect-level comment object words and the emotion viewpoint words in all aspects and simultaneously determines the comment aspects to which sentences in the corpus belong. If comment-oriented emotion analysis is to be performed, the relationship between the emotion viewpoint words and the aspect-level comment object words and the sentences to which the emotion viewpoint words belong need to be further researched.
As shown in FIG. 1, the invention provides a comment aspect opinion level mining method based on a dictionary improvement LDA model, which comprises the following steps:
s1, constructing an inverted index list based on an original network comment library.
The network comment information contains a large amount of viewpoint information, and aspect-level viewpoint mining is needed to analyze comments of commentators on various aspects of the comment object so as to obtain complete knowledge of the comment object. Unlike text information such as documents, blogs, news, etc., web reviews are often short and often appear in sentences.
In this embodiment, a Restaurant english review set is selected, and a sentence 1{ "We, heat power four of us, concerned at not-the place wave estimate-and the stamp activated like power organizing on the m and the power version rule" "" assessing is always linked to the permanent, the service is excellent, the deco code and understated "} is taken as an example, and the review aspect is { place, stamp, book, view, deco }; taking the sentence 2{ "This residual is quality complete and has a good service attribute, but the location of the residual is not subsequent", as an example, the comment aspect is { residual, service, location }, and the description of the subsequent processing is performed.
S11, numbering words of each sentence in the original network comment base in a form of a binary group < a, b >, wherein a represents the number of the sentence in which the word is located, and b represents the number of the word in the sentence.
As in sentence 1, <1,16> denotes staff, <1,36> denotes service, <1,40> denotes decor. In sentence 2, <2,10> indicates service, <2,14> indicates location.
And S12, removing repeated words in the original network comment library, and recording the numbers of the remaining words.
And S13, generating an inverted index list based on the word number after the duplication removal.
Such as < staff:1,16>, < deco: 1,40>, < service:1, 36; 2,10>, < location:2,14 >. The reverse index list keeps the sentence number of the word and the position information in the sentence, so that the search by using the context information is convenient.
And S2, carrying out word stop removing processing on each sentence of the original network comment library to obtain the preprocessed network comment library.
The original data set formats are xml and csv, and comment sentences need to be extracted according to corresponding labels and fields.
Sentences in the network comment library contain a plurality of useless stop words, such as ther, the and the like in sentence 1. These stops are removed before further processing to avoid disturbing the results.
And S3, inputting the preprocessed network comment library into an improved LDA model based on SentiWordNet and WordNet, and sampling by adopting Gibbs sampling to obtain a sampling result.
In an LDA (sentiWordNet WordNet-content Dirichlet Allocation) model, an aspect-level comment and emotion viewpoint word separation layer is introduced on the basis of an LDA theme model, and the problem is assisted and solved through semantic similarity calculation in WordNet and emotion factor calculation of sentiWordNet. Firstly, generating sentence-theme distribution, defining a theme for every sentence by means of polynomial distribution, then utilizing WordNet and SentiWordNet to determine ydAnd vdTwo influence factors to indicate the category of the word, finally one topic-word distribution is selected and the final word is determined.
As shown in fig. 2, the document generation process of the SWLDA model is as follows:
(1) from Dirichlet distribution betat,A、βt,P、βt,NMid-sampling, generating aspect-level object distributions for comments
Figure BDA0002255541680000111
Positive opinion word distribution for reviews
Figure BDA0002255541680000112
Negative opinion word distribution of reviews
Figure BDA0002255541680000113
Figure BDA0002255541680000114
Figure BDA0002255541680000115
Figure BDA0002255541680000116
(2) For each sentence, from the Dirichlet distribution αdMiddle sampling to generate topic distribution thetad~Dir(αd)。
(3) Multiple distribution of theta from the subjectdSampling to generate word w in sentence dd,nSubject z ofd,n~Multi(θd)。
(4) Calculated from WordNet and SentiWordNet, the parameter is pid,nIs [ 0,1 ], [ Bernoulli ] distribution and parameter is [ omega ]d,n(iii) a bernoulli distribution on {0,1 }.
(4.1) query of the current word w in WordNetd,nIs interpreted as sd,n,kCalculating wd,nEach semantic meaning sd,n,kWith various sub-words wtSemantics st,k0Similarity between Sim(s)d,n,k,st,k0) Taking the maximum value of the similarity result in all the calculation results, and determining k' when the result is maximum as the current word wd,nThe associated semantic s in the sentenced,n,k,
(4.2) query s in SentiWordNetd,n,k,Score of each emotion of (1)
Figure BDA0002255541680000117
(the semantic is an objective word),
Figure BDA0002255541680000118
(the semantic is a positive emotion),
Figure BDA0002255541680000119
(the semantic is a negative emotion).
Sentiherdnet is based on an emotion dictionary of WordNet, and a positive emotion score, a negative emotion score and a neutral emotion score are given for each explanation of each word in WordNet, and the sum of the positive emotion score, the negative emotion score and the neutral emotion score is 1, wherein the positive emotion score, the negative emotion score and the neutral emotion score are in the range of 0-1. The invention uses the cosine distance of the word vector to calculate the similarity between words.
(4.3) score based on feeling
Figure BDA0002255541680000121
(the semantic is an objective word),
Figure BDA0002255541680000122
(the semantic is a positive emotion),
Figure BDA0002255541680000123
Calculating a parameter pid,nAnd Ωd,n
Parameter is pid,nThe Bernoulli distribution (used for separating aspect level object words from emotion viewpoint words) and the parameter are omegad,nThe Bernoulli distribution (used to separate positive and negative emotion words) of (1) is related to WordNet and SentiWordNet. Pid,nIs dependent on the seed word wt
Figure BDA0002255541680000124
Figure BDA0002255541680000125
(5) By a parameter of pid,nIs extracted to obtain the indicated word wd,nFor comment aspect level object word or comment opinion word yd,nBy a parameter of Ωd,nIs extracted to obtain the indicated word wd,nCommenting on words for positive or negative opinions vd,n
Parameter is pid,nIs used to indicate the word wd,nObject words for comment aspect levelOr comment on the opinion word yd,n(ii) a Parameter is omegad,nIs used to indicate the word wd,nCommenting on words for positive or negative opinions vd,n
To represent the separation of the aspect level object of the comment and the comment perspective, the variable y is introduceddAnd epsilon { A, O }. When y isdWhen a, it represents that the current word is a facet level object of the comment, when ydWhen O, the current word is a comment perspective. When v isdWhen P, the current word is positive, when vdWhen N, the current word is a negative emotion. y isdAnd vdBoth are determined by correlation algorithms based on WordNet and SentiWordNet.
(6) The word w is generated according to the following formulad,n
Figure BDA0002255541680000126
Step S3 includes the following substeps:
and S31, directly setting the aspect of the network comment library as a seed word.
Directly setting the aspect of the network comment library as seed words, such as { place, staff, cook, service, deco } in sentence 1, and marking the seed words as wtT ∈ {1,..., T }. When setting seed word, it needs to clearly indicate the semantic of the word in WordNet, i.e. determine the seed word stSemantic interpretation s in WordNett,k0. When the semantics of various sub-words are determined, the sub-words can be used as topics.
And S32, dividing the comment texts in the network comment library by taking sentences as units to form a comment text sentence set.
S33, setting different parameters alpha for each sentence based on the similarity between the words and the seed wordsd(ii) a Based on semantic similarity between words and seed words, parameters beta are set for the aspect-level object words, the positive comment words and the negative comment words respectively and are set for each topict,A、βt,P、βt,N
Figure BDA0002255541680000131
βt,A=sim(w,A)*βbase
βt,P=sim(w,P)*βbase
βt,N=sim(w,N)*βbase
Wherein N isdIs the number of all words in the current sentence, T is the number of topics, wd,iIs the ith word in the current sentence, t is the seed word, sim (w, t) represents the semantic similarity between w and the seed word t, alphabaseA constant parameter a representing the dirichlet-compliant distribution of the subject in the standard LDA model. sim (w, A) denotes that w belongs to the object word, sim (w, P) denotes the probability that w belongs to the active word, sim (w, N) denotes the probability that w belongs to the passive word, βbaseThe words in the standard LDA model are subject to a constant parameter β of the dirichlet distribution.
S34, adopting Gibbs sampling to carry out parameter estimation and reasoning on the improved LDA model based on SentiWordNet and WordNet.
The training process is as follows:
(1) and randomly assigning a theme number to each word in each sentence in the corpus, and randomly setting the values of an indicating variable y and an indicating variable v for all words in the sentences.
y belongs to { A, O }, and v belongs to { P, N }. Where y, u are digitized, A and P correspond to 0, O and N correspond to 1, i.e., y ∈ {0,1}, and v ∈ {0,1 }.
(2) Rescanning the corpus, resampling and updating the topic number of each word according to formula 1, updating the number of the word in the corpus, resampling and updating the indicator variable y and the indicator variable v according to formulas 2 and 3.
The gibbs sampling formula for the SWLDA model is as follows:
Figure BDA0002255541680000141
the right side is p (topic | doc) × p (word | topic), and the probability is the path probability of doc → topic → word.
Figure BDA0002255541680000142
Figure BDA0002255541680000143
Wherein the content of the first and second substances,
Figure BDA0002255541680000144
indicating non-i.
(3) The resampling of the corpus above is repeated until the gibbs sampling converges.
(4) Counting the topics of each word in each sentence in the corpus to obtain the probability distribution of sentence-topic
Figure BDA0002255541680000145
Counting the distribution of each subject word in the corpus to obtain the probability distribution of the subject and the word
Figure BDA0002255541680000146
Figure BDA0002255541680000147
The dirichlet distribution expectation calculation formula for the sentence d topic t is as follows:
Figure BDA0002255541680000148
when in use
Figure BDA0002255541680000149
And considering the topic of the sentence d as t, and obtaining comment topic-sentence information.
Figure BDA00022555416800001410
As a word wd,nThe topic of (a) is t, and the category is a probability of y, where y-0 represents a comment object word and y-1 represents a viewpoint word. The specific calculation method is as follows:
with t as the topic, the word wd,nThe dirichlet distribution expectation calculation formula for the evaluated aspect-level object words is as follows:
Figure BDA0002255541680000151
with t as the topic, the word wd,nThe dirichlet distribution expectation formula for the evaluated positive opinion term is as follows:
Figure BDA0002255541680000152
with t as the topic, the word wd,nThe dirichlet distribution of negative opinion terms for evaluation is expected to be calculated as follows:
Figure BDA0002255541680000153
when in use
Figure BDA0002255541680000154
And considering the topic of the word as t and the category as y, and obtaining comment topic-word information and comment topic-viewpoint word information.
And S4, sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words.
A final result is expressed in result. And (4) the result is subject information, namely a set seed word and can also be used as a comment category word. Word holds the original word. Word type is the category of the current word (facet comment object word, positive opinion word, negative opinion word). Sentecs is the sentence to which the word belongs. Prob is the probability that the word becomes a category word under the review category. The first m results are generated for each category under all topics. Based on the inverted index, the sentence to which the word belongs can be queried from, and additionally < subject, comment object, viewpoint, original sentence > information can be obtained.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A review aspect opinion level mining method based on a dictionary improvement LDA model, wherein the aspect is attribute detail of a related review object in a review text, the method comprising the steps of:
s1, constructing an inverted index list based on an original network comment library;
s2, carrying out word stop removing processing on each sentence of the original network comment library to obtain a preprocessed network comment library;
s3, inputting the preprocessed network comment library into an improved LDA model based on dictionaries SentiWordNet and WordNet, and sampling by adopting Gibbs to obtain a sampling result;
s4, sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words;
step S3 includes the following substeps:
s31, directly setting the aspect of the network comment library as a seed word;
s32, dividing comment texts in the network comment library by taking sentences as units to form a comment text sentence set;
s33, setting different parameters alpha for each sentence based on the similarity between the words and the seed wordsd(ii) a Based on semantic similarity between words and seed words, parameters beta are set for the aspect-level object words, the positive comment words and the negative comment words respectively and are set for each topict,A、βt,P、βt,N
S34, parameter estimation and reasoning are carried out on the improved LDA model based on the dictionaries sentiWordNet and WordNet by adopting the Gibbs sampling comment text sentence set.
2. The method of claim 1, wherein step S1 includes the sub-steps of:
s11, numbering words of each sentence in an original network comment library in a binary group < a, b >, wherein a represents the number of the sentence in which the word is located, and b represents the number of the word in the sentence;
s12, removing repeated words in the original network comment library, and recording the numbers of the remaining words;
and S13, generating an inverted index list based on the word number after the duplication removal.
3. The method of claim 1,
Figure FDA0002957923410000021
βt,A=sim(w,A)*βbase
βt,P=sim(w,P)*βbase
βt,N=sim(w,N)*βbase
wherein N isdIs the number of all words in the current sentence, T is the number of topics, wd,iIs the ith word in the current sentence, t is the seed word, sim (w, t) represents the semantic similarity between w and the seed word t, alphabaseA fixed value parameter alpha representing the distribution of subject obeying Dirichlet in the standard LDA model; sim (w, A) denotes the probability that w belongs to the object word, sim (w, P) denotes the probability that w belongs to the active word, sim (w, N) denotes the probability that w belongs to the passive word, βbaseThe words in the standard LDA model are subject to a constant parameter β of the dirichlet distribution.
4. The method of claim 1, wherein the step S34 includes the steps of:
(1) randomly assigning a theme number to each word in each sentence in the corpus, and randomly setting the values of an indication variable y and an indication variable v for all words in the sentence, wherein when y is equal to A, the current word is an aspect level object of a comment, when y is equal to O, the current word is a comment viewpoint, when v is equal to P, the current word is a positive emotion, and when v is equal to N, the current word is a negative emotion;
(2) rescanning the corpus, resampling and updating the topic number of each word according to a formula (1), updating the number of the word in the corpus, resampling and updating an indicator variable y and an indicator variable v according to formulas (2) and (3);
Figure FDA0002957923410000031
Figure FDA0002957923410000032
Figure FDA0002957923410000033
wherein z isd,nDenotes the subject to which the d-th comment sentence belongs, t denotes the subject number, yd,nIndicating the nth word of the nth sentence as an aspect level object word or an emotional viewpoint word, vd,nIndicates that the nth word of the nth sentence is a positive emotion word or a negative emotion word,
Figure FDA0002957923410000034
representing the number of words v with topic t and category q,
Figure FDA0002957923410000035
a dirichlet distribution parameter representing a word v with topic t and class q,
Figure FDA0002957923410000036
representing the number of words v with topic t and category u,
Figure FDA0002957923410000037
dirichlet distribution parameter representing a word v with topic t and category u, y representing the number of words in the corpus, wd,nRepresents the nth word in the d comment sentence
Figure FDA0002957923410000038
Word w with topic t and category ud,nThe number of the (c) is,
Figure FDA0002957923410000039
word w with topic t and category ud,nThe parameter of the dirichlet distribution of (a),
Figure FDA00029579234100000310
word w with topic t and category qd,nThe number of the (c) is,
Figure FDA00029579234100000311
word w with topic t and category qd,nOf the Dirichlet distribution parameter, nd,tNumber of words, alpha, representing the topic of the d-th sentence as td,tA dirichlet distribution parameter representing that the subject of the d-th comment sentence is t,
Figure FDA00029579234100000312
represents not i;
(3) repeating the resampling of the corpus until Gibbs sampling converges;
(4) counting the topics of each word in each sentence in the corpus to obtain the probability distribution of sentence-topic
Figure FDA0002957923410000041
The distribution of each subject word in the corpus is counted to obtainProbability distribution of topics and words
Figure FDA0002957923410000042
5. The method of claim 4,
the probability distribution calculation formula of the sentence d and the topic t is as follows:
Figure FDA0002957923410000043
with t as the topic, the word wd,nThe probability distribution calculation formula of the aspect level object words for evaluation is as follows:
Figure FDA0002957923410000044
with t as the topic, the word wd,nThe positive opinion term probability distribution calculation formula for evaluation is as follows:
Figure FDA0002957923410000045
with t as the topic, the word wd,nThe probability distribution calculation formula of the negative viewpoint words for evaluation is as follows:
Figure FDA0002957923410000046
wherein n isdRepresenting the number of words in the d-th sentence.
6. The method of claim 4, wherein the document generation process based on the dictionaries SentiWordNet and WordNet to improve the LDA model is as follows:
(1) from Dirichlet distribution betat,A、βt,P、βt,NMid-sampling, generating aspect-level object distributions for comments
Figure FDA0002957923410000047
Positive opinion word distribution for reviews
Figure FDA0002957923410000048
Negative opinion word distribution of reviews
Figure FDA0002957923410000049
Figure FDA0002957923410000051
Figure FDA0002957923410000052
Figure FDA0002957923410000053
(2) For each sentence, from the Dirichlet distribution αdMiddle sampling to generate topic distribution thetad~Dir(αd);
(3) Multiple distribution of theta from the subjectdSampling to generate word w in sentence dd,nSubject z ofd,n~Multi(θd);
(4) Calculated from the dictionaries WordNet and SentiWordNet, the parameter is pid,nIs [ 0,1 ], [ Bernoulli ] distribution and parameter is [ omega ]d,n(iii) bernoulli distribution on {0,1 };
(5) by a parameter of pid,nIs extracted to obtain the indicated word wd,nFor comment aspect level object word or comment opinion word yd,nBy a parameter of Ωd,nIs extracted to obtain the indicated word wd,nTo make an active view ofComment word or negative opinion comment word vd,n
(6) The word w is generated according to the following formulad,n
Figure FDA0002957923410000054
7. The method of claim 6, wherein step (4) comprises the steps of:
(4.1) query of the current word w in WordNetd,nIs interpreted as sd,n,kCalculating wd,nEach semantic meaning sd,n,kWith various sub-words wtSemantics st,k0Similarity between Sim(s)d,n,k,st,k0) Taking the maximum value of the similarity result in all the calculation results, and determining k' when the result is maximum as the current word wd,nThe associated semantic s in the sentenced,n,k′
(4.2) query s in SentiWordNetd,n,k’Score of each emotion of (1)
Figure FDA0002957923410000061
Figure FDA0002957923410000062
(4.3) score based on feeling
Figure FDA0002957923410000063
Calculating a parameter pid,nAnd Ωd,n
8. The method of claim 7,
Figure FDA0002957923410000064
Figure FDA0002957923410000065
9. a computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the dictionary improvement LDA model-based opinion-oriented opinion-level mining method of any of claims 1 to 8.
CN201911058218.1A 2019-10-31 2019-10-31 Comment aspect opinion level mining method based on dictionary improvement LDA model Expired - Fee Related CN110837740B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911058218.1A CN110837740B (en) 2019-10-31 2019-10-31 Comment aspect opinion level mining method based on dictionary improvement LDA model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911058218.1A CN110837740B (en) 2019-10-31 2019-10-31 Comment aspect opinion level mining method based on dictionary improvement LDA model

Publications (2)

Publication Number Publication Date
CN110837740A CN110837740A (en) 2020-02-25
CN110837740B true CN110837740B (en) 2021-04-20

Family

ID=69575829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911058218.1A Expired - Fee Related CN110837740B (en) 2019-10-31 2019-10-31 Comment aspect opinion level mining method based on dictionary improvement LDA model

Country Status (1)

Country Link
CN (1) CN110837740B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536269B2 (en) * 2011-01-19 2017-01-03 24/7 Customer, Inc. Method and apparatus for analyzing and applying data related to customer interactions with social media
CN103020851B (en) * 2013-01-10 2015-10-14 山大地纬软件股份有限公司 A kind of metric calculation method supporting comment on commodity data multidimensional to analyze
CN103778207B (en) * 2014-01-15 2017-03-01 杭州电子科技大学 The topic method for digging of the news analysiss based on LDA
CN103823893A (en) * 2014-03-11 2014-05-28 北京大学 User comment-based product search method and system
CN105573985A (en) * 2016-03-04 2016-05-11 北京理工大学 Sentence expression method based on Chinese sentence meaning structural model and topic model
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN109977413B (en) * 2019-03-29 2023-06-06 南京邮电大学 Emotion analysis method based on improved CNN-LDA

Also Published As

Publication number Publication date
CN110837740A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
Kumar et al. Sentiment analysis of multimodal twitter data
Mostafa Clustering halal food consumers: A Twitter sentiment analysis
Fiarni et al. Sentiment analysis system for Indonesia online retail shop review using hierarchy Naive Bayes technique
CN112991017A (en) Accurate recommendation method for label system based on user comment analysis
Avasthi et al. Techniques, applications, and issues in mining large-scale text databases
Ahlgren Research on sentiment analysis: the first decade
Zahid et al. Roman urdu reviews dataset for aspect based opinion mining
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
JP2015075993A (en) Information processing device and information processing program
Rani et al. Study and comparision of vectorization techniques used in text classification
de Zarate et al. Measuring controversy in social networks through nlp
Suresh et al. Mining of customer review feedback using sentiment analysis for smart phone product
KR102185733B1 (en) Server and method for automatically generating profile
CN112989053A (en) Periodical recommendation method and device
Jeevanandam Jotheeswaran Sentiment analysis: A survey of current research and techniques
Hussain et al. A technique for perceiving abusive bangla comments
Jayasekara et al. Opinion mining of customer reviews: feature and smiley based approach
CN110837740B (en) Comment aspect opinion level mining method based on dictionary improvement LDA model
CN115659961A (en) Method, apparatus and computer storage medium for extracting text viewpoints
Dandannavar et al. A proposed framework for evaluating the performance of government initiatives through sentiment analysis
Jayawickrama et al. Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts
Karim et al. Classification of Google Play Store Application Reviews Using Machine Learning
Alorini et al. Machine learning enabled sentiment index estimation using social media big data
Yuan et al. Big data aspect-based opinion mining using the SLDA and HME-LDA models
Merayo-Alba et al. Use of natural language processing to identify inappropriate content in text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210420

CF01 Termination of patent right due to non-payment of annual fee