CN110837740B - Comment aspect opinion level mining method based on dictionary improvement LDA model - Google Patents
Comment aspect opinion level mining method based on dictionary improvement LDA model Download PDFInfo
- Publication number
- CN110837740B CN110837740B CN201911058218.1A CN201911058218A CN110837740B CN 110837740 B CN110837740 B CN 110837740B CN 201911058218 A CN201911058218 A CN 201911058218A CN 110837740 B CN110837740 B CN 110837740B
- Authority
- CN
- China
- Prior art keywords
- word
- words
- comment
- sentence
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000005065 mining Methods 0.000 title claims abstract description 25
- 238000005070 sampling Methods 0.000 claims abstract description 29
- 238000011156 evaluation Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 6
- 238000012163 sequencing technique Methods 0.000 claims abstract description 4
- 238000009826 distribution Methods 0.000 claims description 69
- 230000008451 emotion Effects 0.000 claims description 42
- 238000012552 review Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000012952 Resampling Methods 0.000 claims description 9
- 230000002996 emotional effect Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 description 3
- 230000007935 neutral effect Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000002904 solvent Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a comment aspect opinion level mining method based on a dictionary improved LDA model, and belongs to the field of network comment text mining. The method comprises the following steps: constructing an inverted index list based on an original network comment library; carrying out word-stop removal processing on each sentence of the original network comment library to obtain a preprocessed network comment library; inputting the preprocessed network comment library into an improved LDA model based on SentiWordNet and WordNet, and sampling by adopting Gibbs to obtain a sampling result; and sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words. The invention directly sets the aspect of the network comment library as the seed word without manual marking. And separating the evaluation object words from the comment viewpoints, and offsetting LDA model parameters by calculating the similarity of the words and the seed words so as to improve the effect of the model. Based on the inverted index, the clustering result is associated with the seed words and the original text, and the readability of the result is improved.
Description
Technical Field
The invention belongs to the field of network comment text mining, and particularly relates to a comment aspect viewpoint level mining method based on a dictionary improved LDA model.
Background
The rapid development of the mobile internet and the popularization of the smart phone provide favorable conditions for people to make comments and opinions anytime and anywhere. People can evaluate different commodities in different fields on social platforms such as Twitter and microblog and on online shopping platforms such as Taobao, Amazon and Jingdong. The evaluation can be effectively analyzed, so that the decision of the manufacturer on sale and future development can be assisted, and the consumer can be helped to screen the product which is expected by the consumer. However, the emotion polarity judgment is carried out on the comment sentences only, effective information cannot be provided, and further determination of objects described by emotion words is needed. Unlike news reports, blogs, etc., web reviews are typically short in content. Due to different service contents, the domain of the comment object of the network comment is different. And the attributes contained in the network comment objects are more, and only by mining the aspect-level viewpoints, the effective information in the comments can be mastered.
The aspect-level viewpoint mining of the comments can extract aspect-level comment objects and comment categories from the comments, and has important research significance and value. The aspect level comment object (Opinion Target Expression Extraction) refers to the entity itself or the attribute modified by the emotion Opinion word. In the mining of the comment information, only judging the emotional polarity of the comment sentences has no significance to people reading the comment, and people are more concerned about the good and bad of the commodity on a specific level. Therefore, the method has great significance in determining the aspect-level comment object of the comment. For example, the commodity comment, "the appearance of the mobile phone is general, the battery is enough, and the signal is strong. If the emotion polarity of the sentence is directly judged, under the condition that the user does not read the original text, only one comment is known to represent the mobile phone is good, and the value of the comment to the user is obviously not large. Therefore, when performing comment mining, the aspect-level comment target words in the comment sentence are extracted first, and as for the above-described sentence, the words to be extracted are "appearance", "battery", and "signal". The review category Identification (aspect category Identification) is associated with the facet-level review object. In addition to judging that a word belongs to a certain comment category, a sentence can also be tagged with a comment category.
However, the variety of goods related to massive comments is various, the process of data annotation required by aspect-level viewpoint mining is tedious, and a large amount of resources are consumed for establishing a normative annotation corpus for the comments in all fields. Supervised methods that rely on annotated datasets will be difficult to apply to the field of reviews that lack annotated corpora. How to improve the effect of the model under the conditions of little supervision and no supervision and make the model have domain adaptability (including different domains and different languages) is a topic which is very worthy of research. The prior art is a MaxEnt-LDA model that introduces two distributions to indicate the classification of comment object words and emotion words and the classification of positive emotion words and negative emotion words. However, it has the following drawbacks: the classifier for indicating the classification of the comment object words and the emotion words uses a maximum entropy model and needs to label a large amount of data sets.
Disclosure of Invention
Aiming at the problem that a large amount of labels need to be added to a data set in the prior art of an aspect-level opinion mining method based on MaxEnt-LDA model comments, the invention provides a comment aspect-level opinion mining method based on a dictionary-improved LDA model, and aims to solve the problem of aspect-level opinion mining of network comments by using label data as little as possible.
To achieve the above object, according to a first aspect of the present invention, there is provided a method for mining review aspect opinion level of an LDA model based on dictionary improvement, the method comprising the steps of:
s1, constructing an inverted index list based on an original network comment library;
s2, carrying out word stop removing processing on each sentence of the original network comment library to obtain a preprocessed network comment library;
s3, inputting the preprocessed network comment library into an improved LDA model based on SentiWordNet and WordNet, and sampling by adopting Gibbs to obtain a sampling result;
and S4, sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words.
Specifically, step S1 includes the following sub-steps:
s11, numbering words of each sentence in an original network comment library in a binary group < a, b >, wherein a represents the number of the sentence in which the word is located, and b represents the number of the word in the sentence;
s12, removing repeated words in the original network comment library, and recording the numbers of the remaining words;
and S13, generating an inverted index list based on the word number after the duplication removal.
Specifically, step S3 includes the following sub-steps:
s31, directly setting the aspect of the network comment library as a seed word;
s32, dividing comment texts in the network comment library by taking sentences as units to form a comment text sentence set;
s33, setting different parameters alpha for each sentence based on the similarity between the words and the seed wordsd(ii) a Based on semantic similarity between words and seed words, parameters beta are set for the aspect-level object words, the positive comment words and the negative comment words respectively and are set for each topict,A、βt,P、βt,N;
S34, adopting Gibbs sampling comment text sentence set to carry out parameter estimation and reasoning on the improved LDA model based on SentiWordNet and WordNet.
In particular, the amount of the solvent to be used,
βt,A=sim(w,A)*βbase
βt,P=sim(w,P)*βbase
βt,N=sim(w,N)*βbase
wherein N isdIs the number of all words in the current sentence, T is the number of topics, wd,iIs the ith word in the current sentence, t is the seed word, sim (w, t) represents the semantic similarity between w and the seed word t, alphabaseA fixed value parameter alpha representing the distribution of subject obeying Dirichlet in the standard LDA model; sim (w, A) denotes the probability that w belongs to the object word, sim (w, P) denotes the probability that w belongs to the active word, sim (w, N) denotes the probability that w belongs to the passive word, βbaseThe words in the standard LDA model are subject to a constant parameter β of the dirichlet distribution.
Specifically, step S34 includes the steps of:
(1) randomly assigning a theme number to each word in each sentence in the corpus, and randomly setting the values of an indication variable y and an indication variable v for all words in the sentence, wherein when y is equal to A, the current word is an aspect level object of a comment, when y is equal to O, the current word is a comment viewpoint, when v is equal to P, the current word is a positive emotion, and when v is equal to N, the current word is a negative emotion;
(2) rescanning the corpus, resampling and updating the topic number of each word according to a formula (1), updating the number of the word in the corpus, resampling and updating an indicator variable y and an indicator variable v according to formulas (2) and (3);
wherein z isd,nDenotes the subject to which the d-th comment sentence belongs, t denotes the subject number, yd,nIndicating the nth word of the nth sentence as an aspect level object word or an emotional viewpoint word, vd,nIndicates that the nth word of the nth sentence is a positive emotion word or a negative emotion word,representing the number of words v with topic t and category q,a dirichlet distribution parameter representing a word v with topic t and class q,representing the number of words v with topic t and category u,dirichlet score representing a word v with topic t and category uDistribution parameter, V represents the number of words in the corpus, wd,nRepresents the nth word in the d comment sentenceWord w with topic t and category ud,nThe number of the (c) is,word w with topic t and category ud,nThe parameter of the dirichlet distribution of (a),word w with topic t and category qd,nThe number of the (c) is,word w with topic t and category qd,nOf the Dirichlet distribution parameter, nd,tNumber of words, alpha, representing the topic of the d-th sentence as td,tA dirichlet distribution parameter representing that the subject of the d-th comment sentence is t,i represents non-i;
(3) repeating the resampling of the corpus until Gibbs sampling converges;
(4) counting the topics of each word in each sentence in the corpus to obtain the probability distribution of sentence-topicCounting the distribution of each subject word in the corpus to obtain the probability distribution of the subject and the word
Specifically, the probability distribution calculation formula of the sentence d, topic t, is as follows:
with t as the topic, the word wd,nThe probability distribution calculation formula of the aspect level object words for evaluation is as follows:
with t as the topic, the word wd,nThe positive opinion term probability distribution calculation formula for evaluation is as follows:
with t as the topic, the word wd,nThe probability distribution calculation formula of the negative viewpoint words for evaluation is as follows:
wherein n isdRepresenting the number of words in the d-th sentence.
Specifically, the document generation process based on the SentiWordNet and WordNet improved LDA model is as follows:
(1) from Dirichlet distribution betat,A、βt,P、βt,NMid-sampling, generating aspect-level object distributions for commentsPositive opinion word distribution for reviewsNegative opinion word distribution of reviews
(2) For each sentence, from the Dirichlet distribution αdMiddle sampling to generate topic distribution thetad~Dir(αd);
(3) Multiple distribution of theta from the subjectdSampling to generate word w in sentence dd,nSubject Z ofd,n~Multi(θd);
(4) Calculated from WordNet and SentiWordNet, the parameter is pid,nIs [ 0,1 ], [ Bernoulli ] distribution and parameter is [ omega ]d,n(iii) bernoulli distribution on {0,1 };
(5) by a parameter of pid,nIs extracted to obtain the indicated word wd,nFor comment aspect level object word or comment opinion word yd,nBy a parameter of Ωd,nIs extracted to obtain the indicated word wd,nCommenting on words for positive or negative opinions vd,n;
(6) The word w is generated according to the following formulad,n
Specifically, the step (4) includes the steps of:
(4.1) query of the current word w in WordNetd,nIs interpreted as sd,n,kCalculating wd,nEach semantic meaning sd,n,kWith various sub-words wtSemantics st,k0Similarity between them Sim (S)d,n,k,st,k0) Taking the maximum value of the similarity result in all the calculation results, and determining k' when the result is maximum as the current word wd,nThe associated semantic s in the sentenced,n,k′;
In particular, the amount of the solvent to be used,
to achieve the above object, according to a second aspect of the present invention, there is provided a computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program, which when executed by a processor, implements a review aspect opinion-level mining method based on a dictionary-improved LDA model as described in the first aspect.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) aiming at the standard LDA model, in the finally obtained subject-document probability and subject-word probability, the subject word is not determined and needs to be manually screened and determined.
(2) The invention adds a viewpoint word and evaluation object classification layer on the basis of the LDA model to realize the separation of viewpoint and aspect. The method solves the problem of few supervision aspect level viewpoint mining by means of two tools, namely a seed word configured by a user and WordNet and SentiWordNet, calculates the similarity between a word and the seed word in a text by setting the seed word and utilizing a calculation tool of the similarity of the word and the seed word in the WordNet, and reflects the similarity on LDA model parameters. Meanwhile, a tool for calculating vocabulary emotion in SentiWordNet is utilized to separate the evaluation object words from the comment viewpoints, and the positive polarity and the negative polarity of the comment viewpoints are classified. The LDA model parameters are biased by calculating the similarity of the words and the seed words in the corpus, and the effect of the model is improved.
(3) Aiming at the problem that the final result in the standard LDA model is the probability of a theme-document and a theme-word and the relation between the word and the document is lacked, the invention inquires the belonged sentence by the word based on the inverted index, establishes the relation between the clustering result and the seed word and the original text, and improves the readability of the result.
Drawings
FIG. 1 is a flowchart of a review aspect opinion level mining method based on a dictionary improvement LDA model according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a document generation process of the SWLDA model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
First, terms and variables involved in the invention are explained as follows:
comment on: the idea of a person holding something, and an explanation of the held idea.
The method comprises the following steps: the attributes of the related comment objects and other detailed aspects discussed in the comment text. When a commentator comments on a commentary object, the commentator firstly determines an aspect to be discussed, selects a corresponding word to represent the aspect, and then selects a viewpoint word with a specific emotional tendency according to the viewpoint of the aspect to evaluate the viewpoint word. For example, when a reviewer evaluates a hotel, the word price, value, etc. is used to represent the aspect of "price" to be discussed, and then the word high, cheap, etc. is selected to represent the point of view.
Aspect opinion mining of comments: and finding out a plurality of aspects of the comments related to the comments in the comment text by utilizing a text analysis and mining technology, analyzing the emotional tendency degree of the comments of the commentator on all aspects of the comment object, and displaying the analysis result in a certain form. In aspect-level opinion mining of comments, it is necessary to extract an aspect, an aspect-level comment object, and an opinion of the comment from text information. For example, in restaurant reviews, "food", "service", "environment" are review aspects, and "steak (steak)" is an aspect-level review object in the "food" aspect, and "good" for the "steak" is a review viewpoint.
The topic model is as follows: a bridge of "subject" is established between comments and words.
TABLE 1 legends
Under the condition that the aspects of comments in the network comment library are known, the invention extracts the aspect-level comment object words and the emotion viewpoint words in all aspects and simultaneously determines the comment aspects to which sentences in the corpus belong. If comment-oriented emotion analysis is to be performed, the relationship between the emotion viewpoint words and the aspect-level comment object words and the sentences to which the emotion viewpoint words belong need to be further researched.
As shown in FIG. 1, the invention provides a comment aspect opinion level mining method based on a dictionary improvement LDA model, which comprises the following steps:
s1, constructing an inverted index list based on an original network comment library.
The network comment information contains a large amount of viewpoint information, and aspect-level viewpoint mining is needed to analyze comments of commentators on various aspects of the comment object so as to obtain complete knowledge of the comment object. Unlike text information such as documents, blogs, news, etc., web reviews are often short and often appear in sentences.
In this embodiment, a Restaurant english review set is selected, and a sentence 1{ "We, heat power four of us, concerned at not-the place wave estimate-and the stamp activated like power organizing on the m and the power version rule" "" assessing is always linked to the permanent, the service is excellent, the deco code and understated "} is taken as an example, and the review aspect is { place, stamp, book, view, deco }; taking the sentence 2{ "This residual is quality complete and has a good service attribute, but the location of the residual is not subsequent", as an example, the comment aspect is { residual, service, location }, and the description of the subsequent processing is performed.
S11, numbering words of each sentence in the original network comment base in a form of a binary group < a, b >, wherein a represents the number of the sentence in which the word is located, and b represents the number of the word in the sentence.
As in sentence 1, <1,16> denotes staff, <1,36> denotes service, <1,40> denotes decor. In sentence 2, <2,10> indicates service, <2,14> indicates location.
And S12, removing repeated words in the original network comment library, and recording the numbers of the remaining words.
And S13, generating an inverted index list based on the word number after the duplication removal.
Such as < staff:1,16>, < deco: 1,40>, < service:1, 36; 2,10>, < location:2,14 >. The reverse index list keeps the sentence number of the word and the position information in the sentence, so that the search by using the context information is convenient.
And S2, carrying out word stop removing processing on each sentence of the original network comment library to obtain the preprocessed network comment library.
The original data set formats are xml and csv, and comment sentences need to be extracted according to corresponding labels and fields.
Sentences in the network comment library contain a plurality of useless stop words, such as ther, the and the like in sentence 1. These stops are removed before further processing to avoid disturbing the results.
And S3, inputting the preprocessed network comment library into an improved LDA model based on SentiWordNet and WordNet, and sampling by adopting Gibbs sampling to obtain a sampling result.
In an LDA (sentiWordNet WordNet-content Dirichlet Allocation) model, an aspect-level comment and emotion viewpoint word separation layer is introduced on the basis of an LDA theme model, and the problem is assisted and solved through semantic similarity calculation in WordNet and emotion factor calculation of sentiWordNet. Firstly, generating sentence-theme distribution, defining a theme for every sentence by means of polynomial distribution, then utilizing WordNet and SentiWordNet to determine ydAnd vdTwo influence factors to indicate the category of the word, finally one topic-word distribution is selected and the final word is determined.
As shown in fig. 2, the document generation process of the SWLDA model is as follows:
(1) from Dirichlet distribution betat,A、βt,P、βt,NMid-sampling, generating aspect-level object distributions for commentsPositive opinion word distribution for reviewsNegative opinion word distribution of reviews
(2) For each sentence, from the Dirichlet distribution αdMiddle sampling to generate topic distribution thetad~Dir(αd)。
(3) Multiple distribution of theta from the subjectdSampling to generate word w in sentence dd,nSubject z ofd,n~Multi(θd)。
(4) Calculated from WordNet and SentiWordNet, the parameter is pid,nIs [ 0,1 ], [ Bernoulli ] distribution and parameter is [ omega ]d,n(iii) a bernoulli distribution on {0,1 }.
(4.1) query of the current word w in WordNetd,nIs interpreted as sd,n,kCalculating wd,nEach semantic meaning sd,n,kWith various sub-words wtSemantics st,k0Similarity between Sim(s)d,n,k,st,k0) Taking the maximum value of the similarity result in all the calculation results, and determining k' when the result is maximum as the current word wd,nThe associated semantic s in the sentenced,n,k,。
(4.2) query s in SentiWordNetd,n,k,Score of each emotion of (1)(the semantic is an objective word),(the semantic is a positive emotion),(the semantic is a negative emotion).
Sentiherdnet is based on an emotion dictionary of WordNet, and a positive emotion score, a negative emotion score and a neutral emotion score are given for each explanation of each word in WordNet, and the sum of the positive emotion score, the negative emotion score and the neutral emotion score is 1, wherein the positive emotion score, the negative emotion score and the neutral emotion score are in the range of 0-1. The invention uses the cosine distance of the word vector to calculate the similarity between words.
(4.3) score based on feeling(the semantic is an objective word),(the semantic is a positive emotion),Calculating a parameter pid,nAnd Ωd,n。
Parameter is pid,nThe Bernoulli distribution (used for separating aspect level object words from emotion viewpoint words) and the parameter are omegad,nThe Bernoulli distribution (used to separate positive and negative emotion words) of (1) is related to WordNet and SentiWordNet. Pid,nIs dependent on the seed word wt。
(5) By a parameter of pid,nIs extracted to obtain the indicated word wd,nFor comment aspect level object word or comment opinion word yd,nBy a parameter of Ωd,nIs extracted to obtain the indicated word wd,nCommenting on words for positive or negative opinions vd,n。
Parameter is pid,nIs used to indicate the word wd,nObject words for comment aspect levelOr comment on the opinion word yd,n(ii) a Parameter is omegad,nIs used to indicate the word wd,nCommenting on words for positive or negative opinions vd,n。
To represent the separation of the aspect level object of the comment and the comment perspective, the variable y is introduceddAnd epsilon { A, O }. When y isdWhen a, it represents that the current word is a facet level object of the comment, when ydWhen O, the current word is a comment perspective. When v isdWhen P, the current word is positive, when vdWhen N, the current word is a negative emotion. y isdAnd vdBoth are determined by correlation algorithms based on WordNet and SentiWordNet.
(6) The word w is generated according to the following formulad,n。
Step S3 includes the following substeps:
and S31, directly setting the aspect of the network comment library as a seed word.
Directly setting the aspect of the network comment library as seed words, such as { place, staff, cook, service, deco } in sentence 1, and marking the seed words as wtT ∈ {1,..., T }. When setting seed word, it needs to clearly indicate the semantic of the word in WordNet, i.e. determine the seed word stSemantic interpretation s in WordNett,k0. When the semantics of various sub-words are determined, the sub-words can be used as topics.
And S32, dividing the comment texts in the network comment library by taking sentences as units to form a comment text sentence set.
S33, setting different parameters alpha for each sentence based on the similarity between the words and the seed wordsd(ii) a Based on semantic similarity between words and seed words, parameters beta are set for the aspect-level object words, the positive comment words and the negative comment words respectively and are set for each topict,A、βt,P、βt,N。
βt,A=sim(w,A)*βbase
βt,P=sim(w,P)*βbase
βt,N=sim(w,N)*βbase
Wherein N isdIs the number of all words in the current sentence, T is the number of topics, wd,iIs the ith word in the current sentence, t is the seed word, sim (w, t) represents the semantic similarity between w and the seed word t, alphabaseA constant parameter a representing the dirichlet-compliant distribution of the subject in the standard LDA model. sim (w, A) denotes that w belongs to the object word, sim (w, P) denotes the probability that w belongs to the active word, sim (w, N) denotes the probability that w belongs to the passive word, βbaseThe words in the standard LDA model are subject to a constant parameter β of the dirichlet distribution.
S34, adopting Gibbs sampling to carry out parameter estimation and reasoning on the improved LDA model based on SentiWordNet and WordNet.
The training process is as follows:
(1) and randomly assigning a theme number to each word in each sentence in the corpus, and randomly setting the values of an indicating variable y and an indicating variable v for all words in the sentences.
y belongs to { A, O }, and v belongs to { P, N }. Where y, u are digitized, A and P correspond to 0, O and N correspond to 1, i.e., y ∈ {0,1}, and v ∈ {0,1 }.
(2) Rescanning the corpus, resampling and updating the topic number of each word according to formula 1, updating the number of the word in the corpus, resampling and updating the indicator variable y and the indicator variable v according to formulas 2 and 3.
The gibbs sampling formula for the SWLDA model is as follows:
the right side is p (topic | doc) × p (word | topic), and the probability is the path probability of doc → topic → word.
(3) The resampling of the corpus above is repeated until the gibbs sampling converges.
(4) Counting the topics of each word in each sentence in the corpus to obtain the probability distribution of sentence-topicCounting the distribution of each subject word in the corpus to obtain the probability distribution of the subject and the word
The dirichlet distribution expectation calculation formula for the sentence d topic t is as follows:
when in useAnd considering the topic of the sentence d as t, and obtaining comment topic-sentence information.
As a word wd,nThe topic of (a) is t, and the category is a probability of y, where y-0 represents a comment object word and y-1 represents a viewpoint word. The specific calculation method is as follows:
with t as the topic, the word wd,nThe dirichlet distribution expectation calculation formula for the evaluated aspect-level object words is as follows:
with t as the topic, the word wd,nThe dirichlet distribution expectation formula for the evaluated positive opinion term is as follows:
with t as the topic, the word wd,nThe dirichlet distribution of negative opinion terms for evaluation is expected to be calculated as follows:
when in useAnd considering the topic of the word as t and the category as y, and obtaining comment topic-word information and comment topic-viewpoint word information.
And S4, sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words.
A final result is expressed in result. And (4) the result is subject information, namely a set seed word and can also be used as a comment category word. Word holds the original word. Word type is the category of the current word (facet comment object word, positive opinion word, negative opinion word). Sentecs is the sentence to which the word belongs. Prob is the probability that the word becomes a category word under the review category. The first m results are generated for each category under all topics. Based on the inverted index, the sentence to which the word belongs can be queried from, and additionally < subject, comment object, viewpoint, original sentence > information can be obtained.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (9)
1. A review aspect opinion level mining method based on a dictionary improvement LDA model, wherein the aspect is attribute detail of a related review object in a review text, the method comprising the steps of:
s1, constructing an inverted index list based on an original network comment library;
s2, carrying out word stop removing processing on each sentence of the original network comment library to obtain a preprocessed network comment library;
s3, inputting the preprocessed network comment library into an improved LDA model based on dictionaries SentiWordNet and WordNet, and sampling by adopting Gibbs to obtain a sampling result;
s4, sequencing the sampling results, selecting words m before the probability ranking belonging to the corresponding evaluation category, and finding out specific sentences according to the inverted indexes of the words;
step S3 includes the following substeps:
s31, directly setting the aspect of the network comment library as a seed word;
s32, dividing comment texts in the network comment library by taking sentences as units to form a comment text sentence set;
s33, setting different parameters alpha for each sentence based on the similarity between the words and the seed wordsd(ii) a Based on semantic similarity between words and seed words, parameters beta are set for the aspect-level object words, the positive comment words and the negative comment words respectively and are set for each topict,A、βt,P、βt,N;
S34, parameter estimation and reasoning are carried out on the improved LDA model based on the dictionaries sentiWordNet and WordNet by adopting the Gibbs sampling comment text sentence set.
2. The method of claim 1, wherein step S1 includes the sub-steps of:
s11, numbering words of each sentence in an original network comment library in a binary group < a, b >, wherein a represents the number of the sentence in which the word is located, and b represents the number of the word in the sentence;
s12, removing repeated words in the original network comment library, and recording the numbers of the remaining words;
and S13, generating an inverted index list based on the word number after the duplication removal.
3. The method of claim 1,
βt,A=sim(w,A)*βbase
βt,P=sim(w,P)*βbase
βt,N=sim(w,N)*βbase
wherein N isdIs the number of all words in the current sentence, T is the number of topics, wd,iIs the ith word in the current sentence, t is the seed word, sim (w, t) represents the semantic similarity between w and the seed word t, alphabaseA fixed value parameter alpha representing the distribution of subject obeying Dirichlet in the standard LDA model; sim (w, A) denotes the probability that w belongs to the object word, sim (w, P) denotes the probability that w belongs to the active word, sim (w, N) denotes the probability that w belongs to the passive word, βbaseThe words in the standard LDA model are subject to a constant parameter β of the dirichlet distribution.
4. The method of claim 1, wherein the step S34 includes the steps of:
(1) randomly assigning a theme number to each word in each sentence in the corpus, and randomly setting the values of an indication variable y and an indication variable v for all words in the sentence, wherein when y is equal to A, the current word is an aspect level object of a comment, when y is equal to O, the current word is a comment viewpoint, when v is equal to P, the current word is a positive emotion, and when v is equal to N, the current word is a negative emotion;
(2) rescanning the corpus, resampling and updating the topic number of each word according to a formula (1), updating the number of the word in the corpus, resampling and updating an indicator variable y and an indicator variable v according to formulas (2) and (3);
wherein z isd,nDenotes the subject to which the d-th comment sentence belongs, t denotes the subject number, yd,nIndicating the nth word of the nth sentence as an aspect level object word or an emotional viewpoint word, vd,nIndicates that the nth word of the nth sentence is a positive emotion word or a negative emotion word,representing the number of words v with topic t and category q,a dirichlet distribution parameter representing a word v with topic t and class q,representing the number of words v with topic t and category u,dirichlet distribution parameter representing a word v with topic t and category u, y representing the number of words in the corpus, wd,nRepresents the nth word in the d comment sentenceWord w with topic t and category ud,nThe number of the (c) is,word w with topic t and category ud,nThe parameter of the dirichlet distribution of (a),word w with topic t and category qd,nThe number of the (c) is,word w with topic t and category qd,nOf the Dirichlet distribution parameter, nd,tNumber of words, alpha, representing the topic of the d-th sentence as td,tA dirichlet distribution parameter representing that the subject of the d-th comment sentence is t,represents not i;
(3) repeating the resampling of the corpus until Gibbs sampling converges;
5. The method of claim 4,
the probability distribution calculation formula of the sentence d and the topic t is as follows:
with t as the topic, the word wd,nThe probability distribution calculation formula of the aspect level object words for evaluation is as follows:
with t as the topic, the word wd,nThe positive opinion term probability distribution calculation formula for evaluation is as follows:
with t as the topic, the word wd,nThe probability distribution calculation formula of the negative viewpoint words for evaluation is as follows:
wherein n isdRepresenting the number of words in the d-th sentence.
6. The method of claim 4, wherein the document generation process based on the dictionaries SentiWordNet and WordNet to improve the LDA model is as follows:
(1) from Dirichlet distribution betat,A、βt,P、βt,NMid-sampling, generating aspect-level object distributions for commentsPositive opinion word distribution for reviewsNegative opinion word distribution of reviews
(2) For each sentence, from the Dirichlet distribution αdMiddle sampling to generate topic distribution thetad~Dir(αd);
(3) Multiple distribution of theta from the subjectdSampling to generate word w in sentence dd,nSubject z ofd,n~Multi(θd);
(4) Calculated from the dictionaries WordNet and SentiWordNet, the parameter is pid,nIs [ 0,1 ], [ Bernoulli ] distribution and parameter is [ omega ]d,n(iii) bernoulli distribution on {0,1 };
(5) by a parameter of pid,nIs extracted to obtain the indicated word wd,nFor comment aspect level object word or comment opinion word yd,nBy a parameter of Ωd,nIs extracted to obtain the indicated word wd,nTo make an active view ofComment word or negative opinion comment word vd,n;
(6) The word w is generated according to the following formulad,n
7. The method of claim 6, wherein step (4) comprises the steps of:
(4.1) query of the current word w in WordNetd,nIs interpreted as sd,n,kCalculating wd,nEach semantic meaning sd,n,kWith various sub-words wtSemantics st,k0Similarity between Sim(s)d,n,k,st,k0) Taking the maximum value of the similarity result in all the calculation results, and determining k' when the result is maximum as the current word wd,nThe associated semantic s in the sentenced,n,k′;
9. a computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the dictionary improvement LDA model-based opinion-oriented opinion-level mining method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911058218.1A CN110837740B (en) | 2019-10-31 | 2019-10-31 | Comment aspect opinion level mining method based on dictionary improvement LDA model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911058218.1A CN110837740B (en) | 2019-10-31 | 2019-10-31 | Comment aspect opinion level mining method based on dictionary improvement LDA model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110837740A CN110837740A (en) | 2020-02-25 |
CN110837740B true CN110837740B (en) | 2021-04-20 |
Family
ID=69575829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911058218.1A Expired - Fee Related CN110837740B (en) | 2019-10-31 | 2019-10-31 | Comment aspect opinion level mining method based on dictionary improvement LDA model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110837740B (en) |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9536269B2 (en) * | 2011-01-19 | 2017-01-03 | 24/7 Customer, Inc. | Method and apparatus for analyzing and applying data related to customer interactions with social media |
CN103020851B (en) * | 2013-01-10 | 2015-10-14 | 山大地纬软件股份有限公司 | A kind of metric calculation method supporting comment on commodity data multidimensional to analyze |
CN103778207B (en) * | 2014-01-15 | 2017-03-01 | 杭州电子科技大学 | The topic method for digging of the news analysiss based on LDA |
CN103823893A (en) * | 2014-03-11 | 2014-05-28 | 北京大学 | User comment-based product search method and system |
CN105573985A (en) * | 2016-03-04 | 2016-05-11 | 北京理工大学 | Sentence expression method based on Chinese sentence meaning structural model and topic model |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN108920454A (en) * | 2018-06-13 | 2018-11-30 | 北京信息科技大学 | A kind of theme phrase extraction method |
CN109977413B (en) * | 2019-03-29 | 2023-06-06 | 南京邮电大学 | Emotion analysis method based on improved CNN-LDA |
-
2019
- 2019-10-31 CN CN201911058218.1A patent/CN110837740B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN110837740A (en) | 2020-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | Sentiment analysis of multimodal twitter data | |
Mostafa | Clustering halal food consumers: A Twitter sentiment analysis | |
CN106919619B (en) | Commodity clustering method and device and electronic equipment | |
Fiarni et al. | Sentiment analysis system for Indonesia online retail shop review using hierarchy Naive Bayes technique | |
Avasthi et al. | Techniques, applications, and issues in mining large-scale text databases | |
CN112991017A (en) | Accurate recommendation method for label system based on user comment analysis | |
CN108305180B (en) | Friend recommendation method and device | |
Zahid et al. | Roman urdu reviews dataset for aspect based opinion mining | |
CN115017303A (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
Rani et al. | Study and comparision of vectorization techniques used in text classification | |
JP2015075993A (en) | Information processing device and information processing program | |
Suresh et al. | Mining of customer review feedback using sentiment analysis for smart phone product | |
de Zarate et al. | Measuring controversy in social networks through nlp | |
KR20200064490A (en) | Server and method for automatically generating profile | |
Jeevanandam Jotheeswaran | Sentiment analysis: a survey of current research and techniques | |
Jayasekara et al. | Opinion mining of customer reviews: feature and smiley based approach | |
CN110837740B (en) | Comment aspect opinion level mining method based on dictionary improvement LDA model | |
CN115659961A (en) | Method, apparatus and computer storage medium for extracting text viewpoints | |
Jayawickrama et al. | Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts | |
Merayo-Alba et al. | Use of natural language processing to identify inappropriate content in text | |
Karim et al. | Classification of Google Play Store Application Reviews Using Machine Learning | |
Dandannavar et al. | A proposed framework for evaluating the performance of government initiatives through sentiment analysis | |
Alorini et al. | Machine learning enabled sentiment index estimation using social media big data | |
Machado et al. | Analysis of unsupervised aspect term identification methods for portuguese reviews | |
Shuvo et al. | Sentiment Analysis of Restaurant Reviews from Bangladeshi Food Delivery Apps |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210420 |
|
CF01 | Termination of patent right due to non-payment of annual fee |