CN108062304A - A kind of sentiment analysis method of the comment on commodity data based on machine learning - Google Patents

A kind of sentiment analysis method of the comment on commodity data based on machine learning Download PDF

Info

Publication number
CN108062304A
CN108062304A CN201711376954.2A CN201711376954A CN108062304A CN 108062304 A CN108062304 A CN 108062304A CN 201711376954 A CN201711376954 A CN 201711376954A CN 108062304 A CN108062304 A CN 108062304A
Authority
CN
China
Prior art keywords
comment
sentiment analysis
word
machine learning
commodity data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711376954.2A
Other languages
Chinese (zh)
Inventor
沈琦
程翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201711376954.2A priority Critical patent/CN108062304A/en
Publication of CN108062304A publication Critical patent/CN108062304A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of sentiment analysis method of the comment on commodity data based on machine learning, including:The acquisition and extraction of comment on commodity data;Data prediction, pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted;Text participle is carried out to the data of pretreatment based on stammerer segmenting method;Build sentiment analysis model:Generation term vector is trained based on neutral net language model NNLM, builds semantic network;Semantic excavation, unsupervised generation theme are carried out based on LDA topic models.The present invention realizes unsupervised sentiment analysis method, the results showed that, such sentiment analysis mode can effectively analyze the comment emotion of user.

Description

A kind of sentiment analysis method of the comment on commodity data based on machine learning
Technical field
The present invention relates to comment data sentiment analysis technical field more particularly to a kind of comments on commodity based on machine learning The sentiment analysis method of data.
Background technology
The growth of substantial amounts of user data night and day while development with electric business, these data be although Many entreprise costs and technical difficulty are brought on storage and maintenance, but wherein implicit commercial value is inestimable , in these electric quotient datas, it is exactly commenting for commodity to the data of the view of commodity and electric business platform that most can intuitively reflect user By data, these data can not only reflect opinion of the user for product, while can also believe the emotion of user Breath extracts, for changing for more users and electric business platform provider industry reference value, the recommendation to commodity, product Into and the mutual comparison of similar product one mode is provided.
Four-stage mainly is included with the current flow of process to the method for comment data sentiment analysis, the first stage is To the acquisitions of comment on commodity data with extracting work, this stage be mainly using the reptile instrument suitable for corresponding electric business to The comment on commodity data at family are acquired work, and store data as prior designed form;Second stage is data It explores and pretreatment stage, the data that collect is carried out with text duplicate removal, mechanical compression, short sentence is deleted, and becoming data can be with The data set used filters out numerous junk information for subsequent work;Phase III is the participle of text comments, to Chinese text This participle mainly has 4 kinds of modes at this stage:
String matching algorithm with the word in dictionary with single cent sheet, it is necessary to will match to segment;This segmenting method speed Soon, implement also very simple, but the non-typing word processing of ambiguity word dictionary is bad, such as " Changchun/Changchun/pharmacy " and " Changchun/the mayor/aphrodisiac/shop ";
Algorithm based on understanding, people segments for the understanding effect of sentence during simulation is real;This kind of segmenting method compares Complexity is, it is necessary to which substantial amounts of linguistry is used as and supports;
Algorithm based on machine learning, with having divided the text of word come training dataset;Shortcoming is exactly to need largely The data manually marked are come to training statistical model, and speed is slower, labor intensive;
Statistics-Based Method:Statistics-Based Method assert that the number that adjacent words occur jointly is more, becomes the general of word Rate is bigger, is segmented as standard;Without dictionary and cluster training, need to only unite to the word class frequency in language material Meter.
Rational participle is very big for the influential effect of data modeling afterwards, and the boundary between Chinese word and phrase compares It is fuzzy, the stage is often segmented just into text emotion analysis and the emphasis of subject distillation, therefore is selected according to the feature of data set Selecting suitable participle mode is particularly important;Fourth stage is exactly to build the emotion model stage, this stage is mainly by problem Machine Learning Problems are converted into, are trained using data, generate Sentiment orientation model, then in order to which problem understood in depth Be user be satisfied with or it is unsatisfied, it is necessary to after semantic analysis data carry out latent Dirichletal location (LDA) theme Structure searches out front or negative potential theme, then product is carried out corresponding aspect improvement or to electric business platform into Row is perfect.
Today for the sentiment analysis of Chinese short text, mostly based on being carried out on the basis of Chinese word segmentation, but in Text can have some and ask in reply or the rhetorical devices such as double denial in use, such as:" be not cannot ", " why so More people feel " or first half segment table negative, the complicated semantic clause of some of later half segment table certainly:" poor quality, outside It sees also plain but overall still very economical.", to also these send out the Chinese clause of some miscellaneous, it is rich using most emotion Rich method often can all draw the even opposite as a result, can generate larger bias for the generation of emotion model of some neutrality It influences, often semantic importance is greater than word in itself to Chinese.
Therefore can only simply be analyzed only by simple Chinese word segmentation and for these words structure neutral net short The literal semanteme of text comments, but the whole semanteme of comment text is but lost his information content in itself, even generates As a result opposite attitude is intended that with sentence.
The content of the invention
Shortcoming present in regarding to the issue above, the present invention provide a kind of comment on commodity data based on machine learning Sentiment analysis method.
To achieve the above object, the present invention provides a kind of sentiment analysis side of the comment on commodity data based on machine learning Method, including:
The acquisition and extraction of step 1, comment on commodity data;
Step 2, data prediction, the pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted;
Step 3 carries out text participle based on stammerer segmenting method to the data of pretreatment;
Step 4, structure sentiment analysis model:
Step 41 trains generation term vector based on neutral net language model NNLM;
Step 42, structure semantic network;
Step 43 carries out semantic excavation, unsupervised generation theme based on LDA topic models.
As a further improvement on the present invention, in step 1, comment on commodity data are adopted using octopus collector Collection.
As a further improvement on the present invention, in step 1, it is form by the comment data storage of extraction, and will Data save as UTF-8 forms.
As a further improvement on the present invention, in step 2, the text duplicate removal uses editing distance duplicate removal, the volume It is 2 to collect the threshold value apart from duplicate removal.
As a further improvement on the present invention, in step 2, the mechanical compression goes method of the word using two stacks.
As a further improvement on the present invention, in step 2, the short sentence is deleted is less than or equal to for deletion string length 3 short sentence.
As a further improvement on the present invention, in step 3, the stammerer participle side that the stammerer segmenting method is python Method, the stammerer segmenting method of python support three kinds of accurate model, syntype and search engine pattern participle patterns.
As a further improvement on the present invention, in step 3, the stammerer segmenting method of the python uses accurate model Text participle is carried out to the data of pretreatment.
As a further improvement on the present invention, further included between step 41 and step 42:
Word is divided into favorable comment and difference comments two groups of result sets by step 44.
Compared with prior art, beneficial effects of the present invention are:
The present invention captures context of co-text using neutral net language model NNLM, by carry the word of context semanteme to It measures to carry out the heartbeat conditions analysis of short text sentence, can thus make model more accurate, cover the word after participle More information content;Become in disorder sentence structure after segmenting by building semantic network to reintegrate, and By semantic network the good and bad point of specific products and electric business is represented intuitively to analyze.The word higher by judging similitude It converges and carries out semantic excavation using LDA topic models, unsupervised generation theme is found potential in front evaluation and unfavorable ratings Theme, the improvement for commodity and electric business provide reliable basis.
Description of the drawings
Fig. 1 is the sentiment analysis method of the comment on commodity data based on machine learning disclosed in an embodiment of the present invention Flow chart;
Fig. 2 is the matrix table diagram of new probability formula disclosed in an embodiment of the present invention;
Fig. 3 is the positive and negative comment comparison diagram that iPhoneX disclosed in an embodiment of the present invention is commented on.
Specific embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained on the premise of creative work is not made, belong to the scope of protection of the invention.
In the description of the present invention, it is necessary to explanation, term " " center ", " on ", " under ", "left", "right", " vertical ", The orientation or position relationship of the instructions such as " level ", " interior ", " outer " be based on orientation shown in the drawings or position relationship, merely to Convenient for the description present invention and simplify description rather than instruction or imply signified device or element must have specific orientation, With specific azimuth configuration and operation, therefore it is not considered as limiting the invention.In addition, term " first ", " second ", " the 3rd " is only used for description purpose, and it is not intended that instruction or hint relative importance.
In the description of the present invention, it is also necessary to explanation, unless otherwise clearly defined and limited, term " installation ", " connected ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected or integrally connect It connects;Can be mechanical connection or electrical connection;It can be directly connected, can also be indirectly connected by intermediary, it can To be the connection inside two elements.For the ordinary skill in the art, can above-mentioned term be understood with concrete condition Concrete meaning in the present invention.
The present invention is described in further detail below in conjunction with the accompanying drawings:
As shown in Figure 1, the present invention provides a kind of sentiment analysis method of the comment on commodity data based on machine learning, bag It includes:
The acquisition and extraction of S1, comment on commodity data:
Before sentiment analysis is carried out to the comment data of dependent merchandise, what is done is exactly to comment data under electric business platform Collecting work, but collecting work should be accomplished succinct easy to operate, and the too many time of this analysis method should not be occupied, normal After some reptile instruments are compared, using octopus collector when the present invention captures data set, only it need to pass through figure The succinct crawl for mainstream electric business website data can be realized in shape interface, by inputting the electric business comment collection to be captured The item that URL (uniform resource locator) and needs capture, you can quickly and easily crawl data.
After reptile instrument has captured mass data, the present invention needs to extract related data, and the present invention is adopted Example is that comment sentiment analysis is carried out to the mobile phone under certain electric business platform, it is therefore desirable to which the data of extraction are related a certain The comment data of money mobile phone is analyzed for this Mobile phone, and multigroup mobile phone comment carries out analysis comparison, and comment data is taken out The mobile phone of multiple brands is taken as, stores as form, and data is saved as into UTF-8 forms.
S2, data prediction:
, it is necessary to carry out basic cleaning and pretreatment operation to these data after reptile instrument has captured data, So that data become more valuable, result will be filtered out without the comment data entry influenced or deviation is larger, data Pretreatment is to final desired the result is that vital.The data prediction that the present invention uses mainly includes carrying out successively Three parts:Text duplicate removal, mechanical compression go word, short sentence delete operation.
Text duplicate removal
It is to repeat to have many comments on electric business platform, and there are mainly three types of sources for these comments repeated:
1) comment data that electric business is commented on and set certainly for the convenience of the user, some users are not sent out in long-time after the consumption Fraction is commented on or only beaten to table without commenting on, and electric business often sets program to carry out commenting for automation for this phenomenon By.
2) same user similarly comments on, and same user is likely to purchase more money mobile phones or other similar products, in order to Facilitate may a plurality of commodity comment using same or similar comment, even these comments are valuable also It needs to retain one or all delete.
3) different user is similarly commented on, under normal conditions different user be to the comment of same money commodity should not be complete Full weight is multiple, if the comment of different people repeats completely, although situation may be a variety of, but for result set, only Need reservation 1 useful.
Judge the method for text similarity, the technology of mainstream includes:Simhash algorithm duplicate removals, editing distance duplicate removal are based on K-Shingling duplicate removals etc..After the good and bad situation of each method is considered, the present invention using the smaller editor of threshold value away from It leaves away again:Between editing distance refers to two word strings, as the minimum edit operation number needed for one changes into another.If threshold Value sets excessive, and many mistakes can be caused to delete, and sets too small, can cause the loss of data, it is contemplated that comment data is short text And it is multiple multiple, therefore the threshold value used in the present invention is 2, i.e. the comment of the editor less than 2 needs to delete one.
Mechanical compression removes word
Dirty data is varied in electric business comment on commodity data, and another common data will calculate machinery and repeat to comment on , this kind of comment language material exist it is continuous repeat, be mostly consumer after consumption in order to gather enough comment number of words and what is carried out is not intended to Adopted machinery repeats to comment on, and real interest is not entertained in this kind of multipair comment of comment user, may be in order to save trouble as just progress Comment.
Mechanical compression go word needs do seek to by it is continuous burden repeat sentence be compressed, specific compression method sheet One international word is first put into first stack using the method for two stacks by invention, judge the latter word whether with bottommost element It is identical, the pop down if different;If the same it is added in second stack, then reads in the character stacking of equal length, so Judge whether the content of two stacks is identical afterwards, if the same empty second stack.But there are great for such judgement The problem of be similar to word as " studying hard ", it is therefore desirable to set stack length be more than or equal to 2 in the case of again Triggering judges, but is also present with " really very handy " such comment, so if character is identical with first bottommost element, Second stack also has during element, it is necessary to judge whether to repeat.Mechanical compression duplicate removal can be completed after considering above-mentioned several situations .
Short sentence is deleted
The very few information of number of words is difficult often to cover to the helpful information of result set, it is therefore desirable to be commented number of words is very few It is deleted by data, while the even length of the sentence after above-mentioned mechanical compression duplicate removal only has 1 or 2, for this purpose, the present invention will Short sentence of the length of character string less than or equal to 3 all filters out.
S3, text participle:
Chinese text participle is processing step specific to Chinese natural language processing, is fine for sentence and word in Chinese Identification, however the word of Chinese is but divided without specific boundary, even substantial amounts of cyberspeak, Chinese neologisms at any time with Ground generates, therefore a good Chinese word segmentation is to subsequently modeling important influence.
For existing segmenting method there are the defects of, the present invention use python stammerer (jieba) segmenting method, come pair Comment data is segmented, and supports three kinds of participle patterns:
1) accurate model, it is intended to sentence most accurately be cut, be suitble to text analyzing;
2) syntype can all scan all in sentence into the word of word, and speed is very fast, but cannot solve Certainly ambiguity;
3) search engine pattern on the basis of accurate model, to long word cutting again, improves recall rate, is suitable for Search engine segments.
Such as:To " the 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With a week Just evaluate " carry out text participle, wherein:
Accurate model:
The cut (" 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With week more than one Come what is evaluated ", cut_all=False)
Word segmentation result is:
" just/start with// 8P/./// stronger than Android/more be exactly it is/good/have to/admire/apple/system/./ use/ / more than mono-/week// come/evaluation/".
Syntype:
The cut (" 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With week more than one Come what is evaluated ", cut_all=True)
Word segmentation result is:
" just/start with// 8P//than/peace/Zhuo/strong/more// exactly/good/must not/have to/admire/apple/be System ///evaluated with// mono-/more than mono-/more week/week// come// "
Cut () method is there are two parameter, and to need the character string segmented, cut_all parameters are used for controlling first parameter Whether using syntype, the present invention segments the better accurate model of effect using short text after it compared these three patterns, together When traditional font is also supported to segment and support Custom Dictionaries, the present invention can specify dictionary, so as to comprising not having in stammerer dictionary Word.Although stammerer has new word identification ability, higher accuracy can be ensured by voluntarily adding neologisms.
S4, structure sentiment analysis model:
S41, training generation term vector:
In Chinese language, there is many nearly justice or the words of similar import, for solve the problems, such as it is such, it is necessary to The term vector that Distributed Representation are represented, different training methods or training can obtain different Term vector, final result can make similar in the meaning of a word term vector distance also closer, and the related little distance of the meaning of a word is also distant. Text data set is trained using Google open source projects word2vec in the present invention, using neutral net come for word The expression in a vector row space is found, that is, word is placed in sentence to understand, the word in so same sentence It is not just isolated word.
Using N-gram language models, next word, i.e. n-th of word are predicted using preceding n-1 term vector, however N- There is the shortcomings that excessively relying on language material in gram, while this model can not model the similarity between word, sometimes two tools There is the word of certain similitude, if after a word frequently appears in certain section of word, then perhaps another word appears in this section of word Probability below is also bigger, and combination is more in the language material of first word training, and second word lacks in expecting, then first word Probability will be much larger.
This in order to solve the problems, such as, the present invention establishes this prediction probability mould using neutral net language model NNLM Type:
NNLM is initially the neural network model with one three layers:Input layer, hidden layer, output layer.Wherein input layer is just It is the term vector of the n-1 m dimensions for prediction, and hidden layer is exactly to need obtained word associated vector, is to before output layer Parameter, and this by-product is required word correlation vector.
S42, structure semantic network:
Since participle can cause sentence overall structure to become in disorder, so that becoming not conforming to the complicated analysis of Related product It is actual, it is therefore necessary to certain methods is taken to reintegrate this in disorder phrase, complicated analysis is made to become simple, this Sample invention makes data analysis become convenient using semantic network, is particularly judging the advantage and disadvantage of product, electric business platform Have in shortcoming easily.
It needs that word is divided into favorable comment using some modes before structure semantic network and difference comments two groups of result sets, because favorable comment It is different to comment point of interest with difference, and the information reflected is also different, so favorable comment and difference scoring are not created as Favorable comment semantic network and difference comment semantic network.
S43, LDA topic model are analyzed:
LDA is the equal of to be clustered on the basis of sentence i.e. character string, is several themes by different Sentence Clusterings.Tradition The method for judging two document similarities is the number of the word occurred jointly by checking two documents, such as TF-IDF, this Kind of method does not account for the semantic association of word behind, may the word that two documents occur jointly seldom even without, But two documents are similar.
For example, distinguish there are two sentence as follows:
" Qiao Busi is from us.”
" apple price can or can not drop”
It can be seen that the word that the two sentences do not occur jointly above, but the two sentences are similar, if pressed Traditional method judges that the two sentences are certainly dissimilar, so being needed when text relevant is judged in view of text Semanteme, and the semantic sharp weapon excavated are topic models, LDA is exactly the relatively effective model of one of which.
In topic model, theme represents concept, an one side, shows as a series of relevant words, is these The conditional probability of word.For image, theme is exactly a bucket, and the inside has filled the higher word of probability of occurrence, these words with This theme has very strong correlation.
How theme could be generatedHow the theme of article should be analyzedThis is that topic model will solve the problems, such as.
It is possible, firstly, to use document from the point of view of generation model and theme this two pieces thing.So-called generation model, that is, we recognize Each word for an article is by " with certain probability selection some theme, and with certain probability from this theme Selecting some word " such a process obtains.So, if we will generate a document, each word inside it The probability of appearance is:
This new probability formula can be represented with matrix as shown in Figure 2:
Wherein " document-word " matrix represents the word frequency of each word in each document, that is, the probability occurred;" theme-word Language " matrix represents the probability of occurrence of each word in each theme;" document-theme " matrix represents each theme in each document The probability of appearance.
Given a series of document, by being segmented to document, calculates the word frequency of each word in each document Obtain the left side here " document-word " matrix.Topic model is exactly that this matrix is trained by the left side, learns the right two A matrix.
In general per first commenting on all there are a theme, if some potential theme is simultaneously at most comment, institute is common The popular focus of concern, and in potential theme about the Feature Words of high frequency be more likely to become much-talked-about topic concern word, and this A little keywords are exactly often key point information, can provide improvement idea and understanding competitive advantage institute for electric business platform or product .
S5, experimental simulation:
Sentiment analysis method is illustrated above, the method is next directed to and carry out simulated experiment, draw desired Emotion theme is analyzed and subject key words.Present invention is generally directed to the iPhoneX mobile phone products comments under the platform of Jingdone district store Related sentiment analysis is done, using this method come the good and bad point of comprehensive analysis product and to electric business platform and product improvement opinion And advantage competition power makes analysis, Fig. 3 is the positive and negative comment comparison of iPhoneX comments, and table 1 is the potential master of iPhoneX comments Topic.
Table 1
To sum up the high frequency words in theme can be seen that iPhone X mobile phone advantages and be embodied in facial appearance, frame and matter In amount, and user complains that point is embodied in the dispatching of logistics address, and uncomfortable, on thickness and there are the problem of blank screen.
The present invention considers the synonym of word and upper hyponym in text, and synonym and upper the next root increase according to similarity Respective word frequency, so as to reduce the synonymous influence to classification of more words.Different from conventional method to an eigenmatrix with single side Method does feature extraction, and the present invention carries out term vector expression by NNLM to word so that each word can cover the upper and lower of sentence Literary information content, word is put into sentence and is understood, carries out cluster analysis, no prison to the theme of sentence by emotion theme model afterwards The generation theme superintended and directed, can find the potential theme of concern of user in comment data, and the improvement for commodity and electric business provides Reliable basis.
It these are only the preferred embodiment of the present invention, be not intended to limit the invention, for those skilled in the art For member, the invention may be variously modified and varied.Any modification within the spirit and principles of the invention, being made, Equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims (9)

  1. A kind of 1. sentiment analysis method of the comment on commodity data based on machine learning, which is characterized in that including:
    The acquisition and extraction of step 1, comment on commodity data;
    Step 2, data prediction, the pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted;
    Step 3 carries out text participle based on stammerer segmenting method to the data of pretreatment;
    Step 4, structure sentiment analysis model:
    Step 41 trains generation term vector based on neutral net language model NNLM;
    Step 42, structure semantic network;
    Step 43 carries out semantic excavation, unsupervised generation theme based on LDA topic models.
  2. 2. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 1, comment on commodity data are acquired using octopus collector.
  3. 3. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 1, it is form by the comment data storage of extraction, and data is saved as into UTF-8 forms.
  4. 4. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the text duplicate removal uses editing distance duplicate removal, and the threshold value of the editing distance duplicate removal is 2.
  5. 5. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the mechanical compression goes method of the word using two stacks.
  6. 6. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the short sentence is deleted to delete the short sentence that string length is less than or equal to 3.
  7. 7. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 3, the stammerer segmenting method that segmenting method is python of stammering, the stammerer segmenting method of python supports accurate mould Three kinds of formula, syntype and search engine pattern participle patterns.
  8. 8. the sentiment analysis method of the comment on commodity data based on machine learning as claimed in claim 7, which is characterized in that In step 3, the stammerer segmenting method of the python carries out text participle using accurate model to the data of pretreatment.
  9. 9. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that It is further included between step 41 and step 42:
    Word is divided into favorable comment and difference comments two groups of result sets by step 44.
CN201711376954.2A 2017-12-19 2017-12-19 A kind of sentiment analysis method of the comment on commodity data based on machine learning Pending CN108062304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711376954.2A CN108062304A (en) 2017-12-19 2017-12-19 A kind of sentiment analysis method of the comment on commodity data based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711376954.2A CN108062304A (en) 2017-12-19 2017-12-19 A kind of sentiment analysis method of the comment on commodity data based on machine learning

Publications (1)

Publication Number Publication Date
CN108062304A true CN108062304A (en) 2018-05-22

Family

ID=62139617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711376954.2A Pending CN108062304A (en) 2017-12-19 2017-12-19 A kind of sentiment analysis method of the comment on commodity data based on machine learning

Country Status (1)

Country Link
CN (1) CN108062304A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101478A (en) * 2018-06-04 2018-12-28 东南大学 A kind of Aspect grade sentiment analysis method towards electric business comment text
CN109241387A (en) * 2018-08-28 2019-01-18 四川长虹电器股份有限公司 Grab the crawler analysis method of social media information
CN109271623A (en) * 2018-08-16 2019-01-25 龙马智芯(珠海横琴)科技有限公司 Text emotion denoising method and system
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109977414A (en) * 2019-04-01 2019-07-05 中科天玑数据科技股份有限公司 A kind of internet financial platform user comment subject analysis system and method
CN110008807A (en) * 2018-12-20 2019-07-12 阿里巴巴集团控股有限公司 A kind of training method, device and the equipment of treaty content identification model
CN110457472A (en) * 2019-07-16 2019-11-15 天津大学 The emotion association analysis method for electric business product review based on SOM clustering algorithm
CN110688832A (en) * 2019-10-10 2020-01-14 河北省讯飞人工智能研究院 Comment generation method, device, equipment and storage medium
CN111488432A (en) * 2020-04-14 2020-08-04 广东科徕尼智能科技有限公司 Sentiment analysis method, equipment and storage medium based on user comments
CN111815358A (en) * 2020-07-09 2020-10-23 湖南数客星球信息技术有限公司 Big data user mining method and system based on cross-border e-commerce platform
CN112052306A (en) * 2019-06-06 2020-12-08 北京京东振世信息技术有限公司 Method and device for identifying data
CN112148947A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for mining and reviewing users in batches
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device
CN113627969A (en) * 2021-06-21 2021-11-09 杭州盟码科技有限公司 Product problem analysis method and system based on E-commerce platform user comments
CN114153952A (en) * 2021-12-22 2022-03-08 南京智浩软件科技有限公司 Interviewer management system and scoring quality monitoring and analyzing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484437A (en) * 2014-12-24 2015-04-01 福建师范大学 Network brief comment sentiment mining method
CN104778209A (en) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 Opinion mining method for ten-million-scale news comments
CN106339368A (en) * 2016-08-24 2017-01-18 乐视控股(北京)有限公司 Text emotional tendency acquiring method and device
CN107436942A (en) * 2017-07-28 2017-12-05 广州市香港科大霍英东研究院 Word embedding grammar, system, terminal device and storage medium based on social media

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484437A (en) * 2014-12-24 2015-04-01 福建师范大学 Network brief comment sentiment mining method
CN104778209A (en) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 Opinion mining method for ten-million-scale news comments
CN106339368A (en) * 2016-08-24 2017-01-18 乐视控股(北京)有限公司 Text emotional tendency acquiring method and device
CN107436942A (en) * 2017-07-28 2017-12-05 广州市香港科大霍英东研究院 Word embedding grammar, system, terminal device and storage medium based on social media

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WEIXIN_30553065: "文本预处理——压缩去词", 《HTTPS://BLOG.CSDN.NET/WEIXIN_30553065/ARTICLE/DETAILS/98111746》 *
冯淑慧: "基于数据挖掘的手机客户网络评论的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
朱振涛: "智能可穿戴设备在线评论有用性的信息采纳模型研究", 《南京工程学院学报(社会科学版)》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101478B (en) * 2018-06-04 2022-04-08 东南大学 Aspect-level emotion analysis method for E-commerce comment text
CN109101478A (en) * 2018-06-04 2018-12-28 东南大学 A kind of Aspect grade sentiment analysis method towards electric business comment text
CN109271623A (en) * 2018-08-16 2019-01-25 龙马智芯(珠海横琴)科技有限公司 Text emotion denoising method and system
CN109241387A (en) * 2018-08-28 2019-01-18 四川长虹电器股份有限公司 Grab the crawler analysis method of social media information
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109635279B (en) * 2018-11-22 2022-07-26 桂林电子科技大学 Chinese named entity recognition method based on neural network
CN110008807A (en) * 2018-12-20 2019-07-12 阿里巴巴集团控股有限公司 A kind of training method, device and the equipment of treaty content identification model
CN110008807B (en) * 2018-12-20 2023-08-18 创新先进技术有限公司 Training method, device and equipment for contract content recognition model
CN109977414A (en) * 2019-04-01 2019-07-05 中科天玑数据科技股份有限公司 A kind of internet financial platform user comment subject analysis system and method
CN109977414B (en) * 2019-04-01 2023-03-14 中科天玑数据科技股份有限公司 Internet financial platform user comment theme analysis system and method
CN112052306A (en) * 2019-06-06 2020-12-08 北京京东振世信息技术有限公司 Method and device for identifying data
CN112052306B (en) * 2019-06-06 2023-11-03 北京京东振世信息技术有限公司 Method and device for identifying data
CN110457472A (en) * 2019-07-16 2019-11-15 天津大学 The emotion association analysis method for electric business product review based on SOM clustering algorithm
CN110688832A (en) * 2019-10-10 2020-01-14 河北省讯飞人工智能研究院 Comment generation method, device, equipment and storage medium
CN110688832B (en) * 2019-10-10 2023-06-09 河北省讯飞人工智能研究院 Comment generation method, comment generation device, comment generation equipment and storage medium
CN111488432A (en) * 2020-04-14 2020-08-04 广东科徕尼智能科技有限公司 Sentiment analysis method, equipment and storage medium based on user comments
CN111815358A (en) * 2020-07-09 2020-10-23 湖南数客星球信息技术有限公司 Big data user mining method and system based on cross-border e-commerce platform
CN112148947A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for mining and reviewing users in batches
CN112148947B (en) * 2020-09-28 2024-03-22 微梦创科网络科技(中国)有限公司 Method and system for excavating and brushing users in batches
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device
CN113627969A (en) * 2021-06-21 2021-11-09 杭州盟码科技有限公司 Product problem analysis method and system based on E-commerce platform user comments
CN114153952A (en) * 2021-12-22 2022-03-08 南京智浩软件科技有限公司 Interviewer management system and scoring quality monitoring and analyzing method

Similar Documents

Publication Publication Date Title
CN108062304A (en) A kind of sentiment analysis method of the comment on commodity data based on machine learning
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN106484664B (en) Similarity calculating method between a kind of short text
CN107193801A (en) A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN105824959B (en) Public opinion monitoring method and system
CN108388660B (en) Improved E-commerce product pain point analysis method
CN108038725A (en) A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN112699246A (en) Domain knowledge pushing method based on knowledge graph
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
Banik et al. Evaluation of naïve bayes and support vector machines on bangla textual movie reviews
CN104881458B (en) A kind of mask method and device of Web page subject
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN111221962A (en) Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN106126619A (en) A kind of video retrieval method based on video content and system
CN109446313B (en) Sequencing system and method based on natural language analysis
CN107315734A (en) A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
Homoceanu et al. Will I like it? Providing product overviews based on opinion excerpts
CN102200973A (en) Equipment and method for generating viewpoint pair with emotional-guidance-based influence relationship
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN109947934A (en) For the data digging method and system of short text
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN112883182A (en) Question-answer matching method and device based on machine reading

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180522

RJ01 Rejection of invention patent application after publication