CN108062304A

CN108062304A - A kind of sentiment analysis method of the comment on commodity data based on machine learning

Info

Publication number: CN108062304A
Application number: CN201711376954.2A
Authority: CN
Inventors: 沈琦; 程翔
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2018-05-22

Abstract

The invention discloses a kind of sentiment analysis method of the comment on commodity data based on machine learning, including：The acquisition and extraction of comment on commodity data；Data prediction, pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted；Text participle is carried out to the data of pretreatment based on stammerer segmenting method；Build sentiment analysis model：Generation term vector is trained based on neutral net language model NNLM, builds semantic network；Semantic excavation, unsupervised generation theme are carried out based on LDA topic models.The present invention realizes unsupervised sentiment analysis method, the results showed that, such sentiment analysis mode can effectively analyze the comment emotion of user.

Description

A kind of sentiment analysis method of the comment on commodity data based on machine learning

Technical field

The present invention relates to comment data sentiment analysis technical field more particularly to a kind of comments on commodity based on machine learning The sentiment analysis method of data.

Background technology

The growth of substantial amounts of user data night and day while development with electric business, these data be although Many entreprise costs and technical difficulty are brought on storage and maintenance, but wherein implicit commercial value is inestimable , in these electric quotient datas, it is exactly commenting for commodity to the data of the view of commodity and electric business platform that most can intuitively reflect user By data, these data can not only reflect opinion of the user for product, while can also believe the emotion of user Breath extracts, for changing for more users and electric business platform provider industry reference value, the recommendation to commodity, product Into and the mutual comparison of similar product one mode is provided.

Four-stage mainly is included with the current flow of process to the method for comment data sentiment analysis, the first stage is To the acquisitions of comment on commodity data with extracting work, this stage be mainly using the reptile instrument suitable for corresponding electric business to The comment on commodity data at family are acquired work, and store data as prior designed form；Second stage is data It explores and pretreatment stage, the data that collect is carried out with text duplicate removal, mechanical compression, short sentence is deleted, and becoming data can be with The data set used filters out numerous junk information for subsequent work；Phase III is the participle of text comments, to Chinese text This participle mainly has 4 kinds of modes at this stage：

String matching algorithm with the word in dictionary with single cent sheet, it is necessary to will match to segment；This segmenting method speed Soon, implement also very simple, but the non-typing word processing of ambiguity word dictionary is bad, such as " Changchun/Changchun/pharmacy " and " Changchun/the mayor/aphrodisiac/shop "；

Algorithm based on understanding, people segments for the understanding effect of sentence during simulation is real；This kind of segmenting method compares Complexity is, it is necessary to which substantial amounts of linguistry is used as and supports；

Algorithm based on machine learning, with having divided the text of word come training dataset；Shortcoming is exactly to need largely The data manually marked are come to training statistical model, and speed is slower, labor intensive；

Statistics-Based Method：Statistics-Based Method assert that the number that adjacent words occur jointly is more, becomes the general of word Rate is bigger, is segmented as standard；Without dictionary and cluster training, need to only unite to the word class frequency in language material Meter.

Rational participle is very big for the influential effect of data modeling afterwards, and the boundary between Chinese word and phrase compares It is fuzzy, the stage is often segmented just into text emotion analysis and the emphasis of subject distillation, therefore is selected according to the feature of data set Selecting suitable participle mode is particularly important；Fourth stage is exactly to build the emotion model stage, this stage is mainly by problem Machine Learning Problems are converted into, are trained using data, generate Sentiment orientation model, then in order to which problem understood in depth Be user be satisfied with or it is unsatisfied, it is necessary to after semantic analysis data carry out latent Dirichletal location (LDA) theme Structure searches out front or negative potential theme, then product is carried out corresponding aspect improvement or to electric business platform into Row is perfect.

Today for the sentiment analysis of Chinese short text, mostly based on being carried out on the basis of Chinese word segmentation, but in Text can have some and ask in reply or the rhetorical devices such as double denial in use, such as：" be not cannot ", " why so More people feel " or first half segment table negative, the complicated semantic clause of some of later half segment table certainly：" poor quality, outside It sees also plain but overall still very economical.", to also these send out the Chinese clause of some miscellaneous, it is rich using most emotion Rich method often can all draw the even opposite as a result, can generate larger bias for the generation of emotion model of some neutrality It influences, often semantic importance is greater than word in itself to Chinese.

Therefore can only simply be analyzed only by simple Chinese word segmentation and for these words structure neutral net short The literal semanteme of text comments, but the whole semanteme of comment text is but lost his information content in itself, even generates As a result opposite attitude is intended that with sentence.

The content of the invention

Shortcoming present in regarding to the issue above, the present invention provide a kind of comment on commodity data based on machine learning Sentiment analysis method.

To achieve the above object, the present invention provides a kind of sentiment analysis side of the comment on commodity data based on machine learning Method, including：

The acquisition and extraction of step 1, comment on commodity data；

Step 2, data prediction, the pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted；

Step 3 carries out text participle based on stammerer segmenting method to the data of pretreatment；

Step 4, structure sentiment analysis model：

Step 41 trains generation term vector based on neutral net language model NNLM；

Step 42, structure semantic network；

Step 43 carries out semantic excavation, unsupervised generation theme based on LDA topic models.

As a further improvement on the present invention, in step 1, comment on commodity data are adopted using octopus collector Collection.

As a further improvement on the present invention, in step 1, it is form by the comment data storage of extraction, and will Data save as UTF-8 forms.

As a further improvement on the present invention, in step 2, the text duplicate removal uses editing distance duplicate removal, the volume It is 2 to collect the threshold value apart from duplicate removal.

As a further improvement on the present invention, in step 2, the mechanical compression goes method of the word using two stacks.

As a further improvement on the present invention, in step 2, the short sentence is deleted is less than or equal to for deletion string length 3 short sentence.

As a further improvement on the present invention, in step 3, the stammerer participle side that the stammerer segmenting method is python Method, the stammerer segmenting method of python support three kinds of accurate model, syntype and search engine pattern participle patterns.

As a further improvement on the present invention, in step 3, the stammerer segmenting method of the python uses accurate model Text participle is carried out to the data of pretreatment.

As a further improvement on the present invention, further included between step 41 and step 42：

Word is divided into favorable comment and difference comments two groups of result sets by step 44.

Compared with prior art, beneficial effects of the present invention are：

The present invention captures context of co-text using neutral net language model NNLM, by carry the word of context semanteme to It measures to carry out the heartbeat conditions analysis of short text sentence, can thus make model more accurate, cover the word after participle More information content；Become in disorder sentence structure after segmenting by building semantic network to reintegrate, and By semantic network the good and bad point of specific products and electric business is represented intuitively to analyze.The word higher by judging similitude It converges and carries out semantic excavation using LDA topic models, unsupervised generation theme is found potential in front evaluation and unfavorable ratings Theme, the improvement for commodity and electric business provide reliable basis.

Description of the drawings

Fig. 1 is the sentiment analysis method of the comment on commodity data based on machine learning disclosed in an embodiment of the present invention Flow chart；

Fig. 2 is the matrix table diagram of new probability formula disclosed in an embodiment of the present invention；

Fig. 3 is the positive and negative comment comparison diagram that iPhoneX disclosed in an embodiment of the present invention is commented on.

Specific embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained on the premise of creative work is not made, belong to the scope of protection of the invention.

In the description of the present invention, it is necessary to explanation, term " " center ", " on ", " under ", "left", "right", " vertical ", The orientation or position relationship of the instructions such as " level ", " interior ", " outer " be based on orientation shown in the drawings or position relationship, merely to Convenient for the description present invention and simplify description rather than instruction or imply signified device or element must have specific orientation, With specific azimuth configuration and operation, therefore it is not considered as limiting the invention.In addition, term " first ", " second ", " the 3rd " is only used for description purpose, and it is not intended that instruction or hint relative importance.

In the description of the present invention, it is also necessary to explanation, unless otherwise clearly defined and limited, term " installation ", " connected ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected or integrally connect It connects；Can be mechanical connection or electrical connection；It can be directly connected, can also be indirectly connected by intermediary, it can To be the connection inside two elements.For the ordinary skill in the art, can above-mentioned term be understood with concrete condition Concrete meaning in the present invention.

The present invention is described in further detail below in conjunction with the accompanying drawings：

As shown in Figure 1, the present invention provides a kind of sentiment analysis method of the comment on commodity data based on machine learning, bag It includes：

The acquisition and extraction of S1, comment on commodity data：

Before sentiment analysis is carried out to the comment data of dependent merchandise, what is done is exactly to comment data under electric business platform Collecting work, but collecting work should be accomplished succinct easy to operate, and the too many time of this analysis method should not be occupied, normal After some reptile instruments are compared, using octopus collector when the present invention captures data set, only it need to pass through figure The succinct crawl for mainstream electric business website data can be realized in shape interface, by inputting the electric business comment collection to be captured The item that URL (uniform resource locator) and needs capture, you can quickly and easily crawl data.

After reptile instrument has captured mass data, the present invention needs to extract related data, and the present invention is adopted Example is that comment sentiment analysis is carried out to the mobile phone under certain electric business platform, it is therefore desirable to which the data of extraction are related a certain The comment data of money mobile phone is analyzed for this Mobile phone, and multigroup mobile phone comment carries out analysis comparison, and comment data is taken out The mobile phone of multiple brands is taken as, stores as form, and data is saved as into UTF-8 forms.

S2, data prediction：

, it is necessary to carry out basic cleaning and pretreatment operation to these data after reptile instrument has captured data, So that data become more valuable, result will be filtered out without the comment data entry influenced or deviation is larger, data Pretreatment is to final desired the result is that vital.The data prediction that the present invention uses mainly includes carrying out successively Three parts：Text duplicate removal, mechanical compression go word, short sentence delete operation.

Text duplicate removal

It is to repeat to have many comments on electric business platform, and there are mainly three types of sources for these comments repeated：

1) comment data that electric business is commented on and set certainly for the convenience of the user, some users are not sent out in long-time after the consumption Fraction is commented on or only beaten to table without commenting on, and electric business often sets program to carry out commenting for automation for this phenomenon By.

2) same user similarly comments on, and same user is likely to purchase more money mobile phones or other similar products, in order to Facilitate may a plurality of commodity comment using same or similar comment, even these comments are valuable also It needs to retain one or all delete.

3) different user is similarly commented on, under normal conditions different user be to the comment of same money commodity should not be complete Full weight is multiple, if the comment of different people repeats completely, although situation may be a variety of, but for result set, only Need reservation 1 useful.

Judge the method for text similarity, the technology of mainstream includes：Simhash algorithm duplicate removals, editing distance duplicate removal are based on K-Shingling duplicate removals etc..After the good and bad situation of each method is considered, the present invention using the smaller editor of threshold value away from It leaves away again：Between editing distance refers to two word strings, as the minimum edit operation number needed for one changes into another.If threshold Value sets excessive, and many mistakes can be caused to delete, and sets too small, can cause the loss of data, it is contemplated that comment data is short text And it is multiple multiple, therefore the threshold value used in the present invention is 2, i.e. the comment of the editor less than 2 needs to delete one.

Mechanical compression removes word

Dirty data is varied in electric business comment on commodity data, and another common data will calculate machinery and repeat to comment on , this kind of comment language material exist it is continuous repeat, be mostly consumer after consumption in order to gather enough comment number of words and what is carried out is not intended to Adopted machinery repeats to comment on, and real interest is not entertained in this kind of multipair comment of comment user, may be in order to save trouble as just progress Comment.

Mechanical compression go word needs do seek to by it is continuous burden repeat sentence be compressed, specific compression method sheet One international word is first put into first stack using the method for two stacks by invention, judge the latter word whether with bottommost element It is identical, the pop down if different；If the same it is added in second stack, then reads in the character stacking of equal length, so Judge whether the content of two stacks is identical afterwards, if the same empty second stack.But there are great for such judgement The problem of be similar to word as " studying hard ", it is therefore desirable to set stack length be more than or equal to 2 in the case of again Triggering judges, but is also present with " really very handy " such comment, so if character is identical with first bottommost element, Second stack also has during element, it is necessary to judge whether to repeat.Mechanical compression duplicate removal can be completed after considering above-mentioned several situations .

Short sentence is deleted

The very few information of number of words is difficult often to cover to the helpful information of result set, it is therefore desirable to be commented number of words is very few It is deleted by data, while the even length of the sentence after above-mentioned mechanical compression duplicate removal only has 1 or 2, for this purpose, the present invention will Short sentence of the length of character string less than or equal to 3 all filters out.

S3, text participle：

Chinese text participle is processing step specific to Chinese natural language processing, is fine for sentence and word in Chinese Identification, however the word of Chinese is but divided without specific boundary, even substantial amounts of cyberspeak, Chinese neologisms at any time with Ground generates, therefore a good Chinese word segmentation is to subsequently modeling important influence.

For existing segmenting method there are the defects of, the present invention use python stammerer (jieba) segmenting method, come pair Comment data is segmented, and supports three kinds of participle patterns：

1) accurate model, it is intended to sentence most accurately be cut, be suitble to text analyzing；

2) syntype can all scan all in sentence into the word of word, and speed is very fast, but cannot solve Certainly ambiguity；

3) search engine pattern on the basis of accurate model, to long word cutting again, improves recall rate, is suitable for Search engine segments.

Such as：To " the 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With a week Just evaluate " carry out text participle, wherein：

Accurate model：

The cut (" 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With week more than one Come what is evaluated ", cut_all=False)

Word segmentation result is：

" just/start with// 8P/./// stronger than Android/more be exactly it is/good/have to/admire/apple/system/./ use/ / more than mono-/week// come/evaluation/".

Syntype：

The cut (" 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With week more than one Come what is evaluated ", cut_all=True)

Word segmentation result is：

" just/start with// 8P//than/peace/Zhuo/strong/more// exactly/good/must not/have to/admire/apple/be System ///evaluated with// mono-/more than mono-/more week/week// come// "

Cut () method is there are two parameter, and to need the character string segmented, cut_all parameters are used for controlling first parameter Whether using syntype, the present invention segments the better accurate model of effect using short text after it compared these three patterns, together When traditional font is also supported to segment and support Custom Dictionaries, the present invention can specify dictionary, so as to comprising not having in stammerer dictionary Word.Although stammerer has new word identification ability, higher accuracy can be ensured by voluntarily adding neologisms.

S4, structure sentiment analysis model：

S41, training generation term vector：

In Chinese language, there is many nearly justice or the words of similar import, for solve the problems, such as it is such, it is necessary to The term vector that Distributed Representation are represented, different training methods or training can obtain different Term vector, final result can make similar in the meaning of a word term vector distance also closer, and the related little distance of the meaning of a word is also distant. Text data set is trained using Google open source projects word2vec in the present invention, using neutral net come for word The expression in a vector row space is found, that is, word is placed in sentence to understand, the word in so same sentence It is not just isolated word.

Using N-gram language models, next word, i.e. n-th of word are predicted using preceding n-1 term vector, however N- There is the shortcomings that excessively relying on language material in gram, while this model can not model the similarity between word, sometimes two tools There is the word of certain similitude, if after a word frequently appears in certain section of word, then perhaps another word appears in this section of word Probability below is also bigger, and combination is more in the language material of first word training, and second word lacks in expecting, then first word Probability will be much larger.

This in order to solve the problems, such as, the present invention establishes this prediction probability mould using neutral net language model NNLM Type：

NNLM is initially the neural network model with one three layers：Input layer, hidden layer, output layer.Wherein input layer is just It is the term vector of the n-1 m dimensions for prediction, and hidden layer is exactly to need obtained word associated vector, is to before output layer Parameter, and this by-product is required word correlation vector.

S42, structure semantic network：

Since participle can cause sentence overall structure to become in disorder, so that becoming not conforming to the complicated analysis of Related product It is actual, it is therefore necessary to certain methods is taken to reintegrate this in disorder phrase, complicated analysis is made to become simple, this Sample invention makes data analysis become convenient using semantic network, is particularly judging the advantage and disadvantage of product, electric business platform Have in shortcoming easily.

It needs that word is divided into favorable comment using some modes before structure semantic network and difference comments two groups of result sets, because favorable comment It is different to comment point of interest with difference, and the information reflected is also different, so favorable comment and difference scoring are not created as Favorable comment semantic network and difference comment semantic network.

S43, LDA topic model are analyzed：

LDA is the equal of to be clustered on the basis of sentence i.e. character string, is several themes by different Sentence Clusterings.Tradition The method for judging two document similarities is the number of the word occurred jointly by checking two documents, such as TF-IDF, this Kind of method does not account for the semantic association of word behind, may the word that two documents occur jointly seldom even without, But two documents are similar.

For example, distinguish there are two sentence as follows：

" Qiao Busi is from us.”

" apple price can or can not drop”

It can be seen that the word that the two sentences do not occur jointly above, but the two sentences are similar, if pressed Traditional method judges that the two sentences are certainly dissimilar, so being needed when text relevant is judged in view of text Semanteme, and the semantic sharp weapon excavated are topic models, LDA is exactly the relatively effective model of one of which.

In topic model, theme represents concept, an one side, shows as a series of relevant words, is these The conditional probability of word.For image, theme is exactly a bucket, and the inside has filled the higher word of probability of occurrence, these words with This theme has very strong correlation.

How theme could be generatedHow the theme of article should be analyzedThis is that topic model will solve the problems, such as.

It is possible, firstly, to use document from the point of view of generation model and theme this two pieces thing.So-called generation model, that is, we recognize Each word for an article is by " with certain probability selection some theme, and with certain probability from this theme Selecting some word " such a process obtains.So, if we will generate a document, each word inside it The probability of appearance is：

This new probability formula can be represented with matrix as shown in Figure 2：

Wherein " document-word " matrix represents the word frequency of each word in each document, that is, the probability occurred；" theme-word Language " matrix represents the probability of occurrence of each word in each theme；" document-theme " matrix represents each theme in each document The probability of appearance.

Given a series of document, by being segmented to document, calculates the word frequency of each word in each document Obtain the left side here " document-word " matrix.Topic model is exactly that this matrix is trained by the left side, learns the right two A matrix.

In general per first commenting on all there are a theme, if some potential theme is simultaneously at most comment, institute is common The popular focus of concern, and in potential theme about the Feature Words of high frequency be more likely to become much-talked-about topic concern word, and this A little keywords are exactly often key point information, can provide improvement idea and understanding competitive advantage institute for electric business platform or product .

S5, experimental simulation：

Sentiment analysis method is illustrated above, the method is next directed to and carry out simulated experiment, draw desired Emotion theme is analyzed and subject key words.Present invention is generally directed to the iPhoneX mobile phone products comments under the platform of Jingdone district store Related sentiment analysis is done, using this method come the good and bad point of comprehensive analysis product and to electric business platform and product improvement opinion And advantage competition power makes analysis, Fig. 3 is the positive and negative comment comparison of iPhoneX comments, and table 1 is the potential master of iPhoneX comments Topic.

Table 1

To sum up the high frequency words in theme can be seen that iPhone X mobile phone advantages and be embodied in facial appearance, frame and matter In amount, and user complains that point is embodied in the dispatching of logistics address, and uncomfortable, on thickness and there are the problem of blank screen.

The present invention considers the synonym of word and upper hyponym in text, and synonym and upper the next root increase according to similarity Respective word frequency, so as to reduce the synonymous influence to classification of more words.Different from conventional method to an eigenmatrix with single side Method does feature extraction, and the present invention carries out term vector expression by NNLM to word so that each word can cover the upper and lower of sentence Literary information content, word is put into sentence and is understood, carries out cluster analysis, no prison to the theme of sentence by emotion theme model afterwards The generation theme superintended and directed, can find the potential theme of concern of user in comment data, and the improvement for commodity and electric business provides Reliable basis.

It these are only the preferred embodiment of the present invention, be not intended to limit the invention, for those skilled in the art For member, the invention may be variously modified and varied.Any modification within the spirit and principles of the invention, being made, Equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims

A kind of 1. sentiment analysis method of the comment on commodity data based on machine learning, which is characterized in that including：

The acquisition and extraction of step 1, comment on commodity data；

Step 2, data prediction, the pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted；

Step 3 carries out text participle based on stammerer segmenting method to the data of pretreatment；

Step 4, structure sentiment analysis model：

Step 41 trains generation term vector based on neutral net language model NNLM；

Step 42, structure semantic network；

Step 43 carries out semantic excavation, unsupervised generation theme based on LDA topic models.
2. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 1, comment on commodity data are acquired using octopus collector.
3. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 1, it is form by the comment data storage of extraction, and data is saved as into UTF-8 forms.
4. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the text duplicate removal uses editing distance duplicate removal, and the threshold value of the editing distance duplicate removal is 2.
5. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the mechanical compression goes method of the word using two stacks.
6. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the short sentence is deleted to delete the short sentence that string length is less than or equal to 3.
7. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 3, the stammerer segmenting method that segmenting method is python of stammering, the stammerer segmenting method of python supports accurate mould Three kinds of formula, syntype and search engine pattern participle patterns.
8. the sentiment analysis method of the comment on commodity data based on machine learning as claimed in claim 7, which is characterized in that In step 3, the stammerer segmenting method of the python carries out text participle using accurate model to the data of pretreatment.
9. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that It is further included between step 41 and step 42：

Word is divided into favorable comment and difference comments two groups of result sets by step 44.