CN108062304A - A kind of sentiment analysis method of the comment on commodity data based on machine learning - Google Patents
A kind of sentiment analysis method of the comment on commodity data based on machine learning Download PDFInfo
- Publication number
- CN108062304A CN108062304A CN201711376954.2A CN201711376954A CN108062304A CN 108062304 A CN108062304 A CN 108062304A CN 201711376954 A CN201711376954 A CN 201711376954A CN 108062304 A CN108062304 A CN 108062304A
- Authority
- CN
- China
- Prior art keywords
- comment
- sentiment analysis
- word
- machine learning
- commodity data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0203—Market surveys; Market polls
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Theoretical Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Game Theory and Decision Science (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of sentiment analysis method of the comment on commodity data based on machine learning, including:The acquisition and extraction of comment on commodity data;Data prediction, pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted;Text participle is carried out to the data of pretreatment based on stammerer segmenting method;Build sentiment analysis model:Generation term vector is trained based on neutral net language model NNLM, builds semantic network;Semantic excavation, unsupervised generation theme are carried out based on LDA topic models.The present invention realizes unsupervised sentiment analysis method, the results showed that, such sentiment analysis mode can effectively analyze the comment emotion of user.
Description
Technical field
The present invention relates to comment data sentiment analysis technical field more particularly to a kind of comments on commodity based on machine learning
The sentiment analysis method of data.
Background technology
The growth of substantial amounts of user data night and day while development with electric business, these data be although
Many entreprise costs and technical difficulty are brought on storage and maintenance, but wherein implicit commercial value is inestimable
, in these electric quotient datas, it is exactly commenting for commodity to the data of the view of commodity and electric business platform that most can intuitively reflect user
By data, these data can not only reflect opinion of the user for product, while can also believe the emotion of user
Breath extracts, for changing for more users and electric business platform provider industry reference value, the recommendation to commodity, product
Into and the mutual comparison of similar product one mode is provided.
Four-stage mainly is included with the current flow of process to the method for comment data sentiment analysis, the first stage is
To the acquisitions of comment on commodity data with extracting work, this stage be mainly using the reptile instrument suitable for corresponding electric business to
The comment on commodity data at family are acquired work, and store data as prior designed form;Second stage is data
It explores and pretreatment stage, the data that collect is carried out with text duplicate removal, mechanical compression, short sentence is deleted, and becoming data can be with
The data set used filters out numerous junk information for subsequent work;Phase III is the participle of text comments, to Chinese text
This participle mainly has 4 kinds of modes at this stage:
String matching algorithm with the word in dictionary with single cent sheet, it is necessary to will match to segment;This segmenting method speed
Soon, implement also very simple, but the non-typing word processing of ambiguity word dictionary is bad, such as " Changchun/Changchun/pharmacy " and "
Changchun/the mayor/aphrodisiac/shop ";
Algorithm based on understanding, people segments for the understanding effect of sentence during simulation is real;This kind of segmenting method compares
Complexity is, it is necessary to which substantial amounts of linguistry is used as and supports;
Algorithm based on machine learning, with having divided the text of word come training dataset;Shortcoming is exactly to need largely
The data manually marked are come to training statistical model, and speed is slower, labor intensive;
Statistics-Based Method:Statistics-Based Method assert that the number that adjacent words occur jointly is more, becomes the general of word
Rate is bigger, is segmented as standard;Without dictionary and cluster training, need to only unite to the word class frequency in language material
Meter.
Rational participle is very big for the influential effect of data modeling afterwards, and the boundary between Chinese word and phrase compares
It is fuzzy, the stage is often segmented just into text emotion analysis and the emphasis of subject distillation, therefore is selected according to the feature of data set
Selecting suitable participle mode is particularly important;Fourth stage is exactly to build the emotion model stage, this stage is mainly by problem
Machine Learning Problems are converted into, are trained using data, generate Sentiment orientation model, then in order to which problem understood in depth
Be user be satisfied with or it is unsatisfied, it is necessary to after semantic analysis data carry out latent Dirichletal location (LDA) theme
Structure searches out front or negative potential theme, then product is carried out corresponding aspect improvement or to electric business platform into
Row is perfect.
Today for the sentiment analysis of Chinese short text, mostly based on being carried out on the basis of Chinese word segmentation, but in
Text can have some and ask in reply or the rhetorical devices such as double denial in use, such as:" be not cannot ", " why so
More people feel " or first half segment table negative, the complicated semantic clause of some of later half segment table certainly:" poor quality, outside
It sees also plain but overall still very economical.", to also these send out the Chinese clause of some miscellaneous, it is rich using most emotion
Rich method often can all draw the even opposite as a result, can generate larger bias for the generation of emotion model of some neutrality
It influences, often semantic importance is greater than word in itself to Chinese.
Therefore can only simply be analyzed only by simple Chinese word segmentation and for these words structure neutral net short
The literal semanteme of text comments, but the whole semanteme of comment text is but lost his information content in itself, even generates
As a result opposite attitude is intended that with sentence.
The content of the invention
Shortcoming present in regarding to the issue above, the present invention provide a kind of comment on commodity data based on machine learning
Sentiment analysis method.
To achieve the above object, the present invention provides a kind of sentiment analysis side of the comment on commodity data based on machine learning
Method, including:
The acquisition and extraction of step 1, comment on commodity data;
Step 2, data prediction, the pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted;
Step 3 carries out text participle based on stammerer segmenting method to the data of pretreatment;
Step 4, structure sentiment analysis model:
Step 41 trains generation term vector based on neutral net language model NNLM;
Step 42, structure semantic network;
Step 43 carries out semantic excavation, unsupervised generation theme based on LDA topic models.
As a further improvement on the present invention, in step 1, comment on commodity data are adopted using octopus collector
Collection.
As a further improvement on the present invention, in step 1, it is form by the comment data storage of extraction, and will
Data save as UTF-8 forms.
As a further improvement on the present invention, in step 2, the text duplicate removal uses editing distance duplicate removal, the volume
It is 2 to collect the threshold value apart from duplicate removal.
As a further improvement on the present invention, in step 2, the mechanical compression goes method of the word using two stacks.
As a further improvement on the present invention, in step 2, the short sentence is deleted is less than or equal to for deletion string length
3 short sentence.
As a further improvement on the present invention, in step 3, the stammerer participle side that the stammerer segmenting method is python
Method, the stammerer segmenting method of python support three kinds of accurate model, syntype and search engine pattern participle patterns.
As a further improvement on the present invention, in step 3, the stammerer segmenting method of the python uses accurate model
Text participle is carried out to the data of pretreatment.
As a further improvement on the present invention, further included between step 41 and step 42:
Word is divided into favorable comment and difference comments two groups of result sets by step 44.
Compared with prior art, beneficial effects of the present invention are:
The present invention captures context of co-text using neutral net language model NNLM, by carry the word of context semanteme to
It measures to carry out the heartbeat conditions analysis of short text sentence, can thus make model more accurate, cover the word after participle
More information content;Become in disorder sentence structure after segmenting by building semantic network to reintegrate, and
By semantic network the good and bad point of specific products and electric business is represented intuitively to analyze.The word higher by judging similitude
It converges and carries out semantic excavation using LDA topic models, unsupervised generation theme is found potential in front evaluation and unfavorable ratings
Theme, the improvement for commodity and electric business provide reliable basis.
Description of the drawings
Fig. 1 is the sentiment analysis method of the comment on commodity data based on machine learning disclosed in an embodiment of the present invention
Flow chart;
Fig. 2 is the matrix table diagram of new probability formula disclosed in an embodiment of the present invention;
Fig. 3 is the positive and negative comment comparison diagram that iPhoneX disclosed in an embodiment of the present invention is commented on.
Specific embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
The part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
Member's all other embodiments obtained on the premise of creative work is not made, belong to the scope of protection of the invention.
In the description of the present invention, it is necessary to explanation, term " " center ", " on ", " under ", "left", "right", " vertical ",
The orientation or position relationship of the instructions such as " level ", " interior ", " outer " be based on orientation shown in the drawings or position relationship, merely to
Convenient for the description present invention and simplify description rather than instruction or imply signified device or element must have specific orientation,
With specific azimuth configuration and operation, therefore it is not considered as limiting the invention.In addition, term " first ", " second ",
" the 3rd " is only used for description purpose, and it is not intended that instruction or hint relative importance.
In the description of the present invention, it is also necessary to explanation, unless otherwise clearly defined and limited, term " installation ",
" connected ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected or integrally connect
It connects;Can be mechanical connection or electrical connection;It can be directly connected, can also be indirectly connected by intermediary, it can
To be the connection inside two elements.For the ordinary skill in the art, can above-mentioned term be understood with concrete condition
Concrete meaning in the present invention.
The present invention is described in further detail below in conjunction with the accompanying drawings:
As shown in Figure 1, the present invention provides a kind of sentiment analysis method of the comment on commodity data based on machine learning, bag
It includes:
The acquisition and extraction of S1, comment on commodity data:
Before sentiment analysis is carried out to the comment data of dependent merchandise, what is done is exactly to comment data under electric business platform
Collecting work, but collecting work should be accomplished succinct easy to operate, and the too many time of this analysis method should not be occupied, normal
After some reptile instruments are compared, using octopus collector when the present invention captures data set, only it need to pass through figure
The succinct crawl for mainstream electric business website data can be realized in shape interface, by inputting the electric business comment collection to be captured
The item that URL (uniform resource locator) and needs capture, you can quickly and easily crawl data.
After reptile instrument has captured mass data, the present invention needs to extract related data, and the present invention is adopted
Example is that comment sentiment analysis is carried out to the mobile phone under certain electric business platform, it is therefore desirable to which the data of extraction are related a certain
The comment data of money mobile phone is analyzed for this Mobile phone, and multigroup mobile phone comment carries out analysis comparison, and comment data is taken out
The mobile phone of multiple brands is taken as, stores as form, and data is saved as into UTF-8 forms.
S2, data prediction:
, it is necessary to carry out basic cleaning and pretreatment operation to these data after reptile instrument has captured data,
So that data become more valuable, result will be filtered out without the comment data entry influenced or deviation is larger, data
Pretreatment is to final desired the result is that vital.The data prediction that the present invention uses mainly includes carrying out successively
Three parts:Text duplicate removal, mechanical compression go word, short sentence delete operation.
Text duplicate removal
It is to repeat to have many comments on electric business platform, and there are mainly three types of sources for these comments repeated:
1) comment data that electric business is commented on and set certainly for the convenience of the user, some users are not sent out in long-time after the consumption
Fraction is commented on or only beaten to table without commenting on, and electric business often sets program to carry out commenting for automation for this phenomenon
By.
2) same user similarly comments on, and same user is likely to purchase more money mobile phones or other similar products, in order to
Facilitate may a plurality of commodity comment using same or similar comment, even these comments are valuable also
It needs to retain one or all delete.
3) different user is similarly commented on, under normal conditions different user be to the comment of same money commodity should not be complete
Full weight is multiple, if the comment of different people repeats completely, although situation may be a variety of, but for result set, only
Need reservation 1 useful.
Judge the method for text similarity, the technology of mainstream includes:Simhash algorithm duplicate removals, editing distance duplicate removal are based on
K-Shingling duplicate removals etc..After the good and bad situation of each method is considered, the present invention using the smaller editor of threshold value away from
It leaves away again:Between editing distance refers to two word strings, as the minimum edit operation number needed for one changes into another.If threshold
Value sets excessive, and many mistakes can be caused to delete, and sets too small, can cause the loss of data, it is contemplated that comment data is short text
And it is multiple multiple, therefore the threshold value used in the present invention is 2, i.e. the comment of the editor less than 2 needs to delete one.
Mechanical compression removes word
Dirty data is varied in electric business comment on commodity data, and another common data will calculate machinery and repeat to comment on
, this kind of comment language material exist it is continuous repeat, be mostly consumer after consumption in order to gather enough comment number of words and what is carried out is not intended to
Adopted machinery repeats to comment on, and real interest is not entertained in this kind of multipair comment of comment user, may be in order to save trouble as just progress
Comment.
Mechanical compression go word needs do seek to by it is continuous burden repeat sentence be compressed, specific compression method sheet
One international word is first put into first stack using the method for two stacks by invention, judge the latter word whether with bottommost element
It is identical, the pop down if different;If the same it is added in second stack, then reads in the character stacking of equal length, so
Judge whether the content of two stacks is identical afterwards, if the same empty second stack.But there are great for such judgement
The problem of be similar to word as " studying hard ", it is therefore desirable to set stack length be more than or equal to 2 in the case of again
Triggering judges, but is also present with " really very handy " such comment, so if character is identical with first bottommost element,
Second stack also has during element, it is necessary to judge whether to repeat.Mechanical compression duplicate removal can be completed after considering above-mentioned several situations
.
Short sentence is deleted
The very few information of number of words is difficult often to cover to the helpful information of result set, it is therefore desirable to be commented number of words is very few
It is deleted by data, while the even length of the sentence after above-mentioned mechanical compression duplicate removal only has 1 or 2, for this purpose, the present invention will
Short sentence of the length of character string less than or equal to 3 all filters out.
S3, text participle:
Chinese text participle is processing step specific to Chinese natural language processing, is fine for sentence and word in Chinese
Identification, however the word of Chinese is but divided without specific boundary, even substantial amounts of cyberspeak, Chinese neologisms at any time with
Ground generates, therefore a good Chinese word segmentation is to subsequently modeling important influence.
For existing segmenting method there are the defects of, the present invention use python stammerer (jieba) segmenting method, come pair
Comment data is segmented, and supports three kinds of participle patterns:
1) accurate model, it is intended to sentence most accurately be cut, be suitble to text analyzing;
2) syntype can all scan all in sentence into the word of word, and speed is very fast, but cannot solve
Certainly ambiguity;
3) search engine pattern on the basis of accurate model, to long word cutting again, improves recall rate, is suitable for
Search engine segments.
Such as:To " the 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With a week
Just evaluate " carry out text participle, wherein:
Accurate model:
The cut (" 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With week more than one
Come what is evaluated ", cut_all=False)
Word segmentation result is:
" just/start with// 8P/./// stronger than Android/more be exactly it is/good/have to/admire/apple/system/./ use/
/ more than mono-/week// come/evaluation/".
Syntype:
The cut (" 8P just to have started with.Become stronger than Android and must not be exactly the apple system that has a very small admiration.With week more than one
Come what is evaluated ", cut_all=True)
Word segmentation result is:
" just/start with// 8P//than/peace/Zhuo/strong/more// exactly/good/must not/have to/admire/apple/be
System ///evaluated with// mono-/more than mono-/more week/week// come// "
Cut () method is there are two parameter, and to need the character string segmented, cut_all parameters are used for controlling first parameter
Whether using syntype, the present invention segments the better accurate model of effect using short text after it compared these three patterns, together
When traditional font is also supported to segment and support Custom Dictionaries, the present invention can specify dictionary, so as to comprising not having in stammerer dictionary
Word.Although stammerer has new word identification ability, higher accuracy can be ensured by voluntarily adding neologisms.
S4, structure sentiment analysis model:
S41, training generation term vector:
In Chinese language, there is many nearly justice or the words of similar import, for solve the problems, such as it is such, it is necessary to
The term vector that Distributed Representation are represented, different training methods or training can obtain different
Term vector, final result can make similar in the meaning of a word term vector distance also closer, and the related little distance of the meaning of a word is also distant.
Text data set is trained using Google open source projects word2vec in the present invention, using neutral net come for word
The expression in a vector row space is found, that is, word is placed in sentence to understand, the word in so same sentence
It is not just isolated word.
Using N-gram language models, next word, i.e. n-th of word are predicted using preceding n-1 term vector, however N-
There is the shortcomings that excessively relying on language material in gram, while this model can not model the similarity between word, sometimes two tools
There is the word of certain similitude, if after a word frequently appears in certain section of word, then perhaps another word appears in this section of word
Probability below is also bigger, and combination is more in the language material of first word training, and second word lacks in expecting, then first word
Probability will be much larger.
This in order to solve the problems, such as, the present invention establishes this prediction probability mould using neutral net language model NNLM
Type:
NNLM is initially the neural network model with one three layers:Input layer, hidden layer, output layer.Wherein input layer is just
It is the term vector of the n-1 m dimensions for prediction, and hidden layer is exactly to need obtained word associated vector, is to before output layer
Parameter, and this by-product is required word correlation vector.
S42, structure semantic network:
Since participle can cause sentence overall structure to become in disorder, so that becoming not conforming to the complicated analysis of Related product
It is actual, it is therefore necessary to certain methods is taken to reintegrate this in disorder phrase, complicated analysis is made to become simple, this
Sample invention makes data analysis become convenient using semantic network, is particularly judging the advantage and disadvantage of product, electric business platform
Have in shortcoming easily.
It needs that word is divided into favorable comment using some modes before structure semantic network and difference comments two groups of result sets, because favorable comment
It is different to comment point of interest with difference, and the information reflected is also different, so favorable comment and difference scoring are not created as
Favorable comment semantic network and difference comment semantic network.
S43, LDA topic model are analyzed:
LDA is the equal of to be clustered on the basis of sentence i.e. character string, is several themes by different Sentence Clusterings.Tradition
The method for judging two document similarities is the number of the word occurred jointly by checking two documents, such as TF-IDF, this
Kind of method does not account for the semantic association of word behind, may the word that two documents occur jointly seldom even without,
But two documents are similar.
For example, distinguish there are two sentence as follows:
" Qiao Busi is from us.”
" apple price can or can not drop”
It can be seen that the word that the two sentences do not occur jointly above, but the two sentences are similar, if pressed
Traditional method judges that the two sentences are certainly dissimilar, so being needed when text relevant is judged in view of text
Semanteme, and the semantic sharp weapon excavated are topic models, LDA is exactly the relatively effective model of one of which.
In topic model, theme represents concept, an one side, shows as a series of relevant words, is these
The conditional probability of word.For image, theme is exactly a bucket, and the inside has filled the higher word of probability of occurrence, these words with
This theme has very strong correlation.
How theme could be generatedHow the theme of article should be analyzedThis is that topic model will solve the problems, such as.
It is possible, firstly, to use document from the point of view of generation model and theme this two pieces thing.So-called generation model, that is, we recognize
Each word for an article is by " with certain probability selection some theme, and with certain probability from this theme
Selecting some word " such a process obtains.So, if we will generate a document, each word inside it
The probability of appearance is:
This new probability formula can be represented with matrix as shown in Figure 2:
Wherein " document-word " matrix represents the word frequency of each word in each document, that is, the probability occurred;" theme-word
Language " matrix represents the probability of occurrence of each word in each theme;" document-theme " matrix represents each theme in each document
The probability of appearance.
Given a series of document, by being segmented to document, calculates the word frequency of each word in each document
Obtain the left side here " document-word " matrix.Topic model is exactly that this matrix is trained by the left side, learns the right two
A matrix.
In general per first commenting on all there are a theme, if some potential theme is simultaneously at most comment, institute is common
The popular focus of concern, and in potential theme about the Feature Words of high frequency be more likely to become much-talked-about topic concern word, and this
A little keywords are exactly often key point information, can provide improvement idea and understanding competitive advantage institute for electric business platform or product
.
S5, experimental simulation:
Sentiment analysis method is illustrated above, the method is next directed to and carry out simulated experiment, draw desired
Emotion theme is analyzed and subject key words.Present invention is generally directed to the iPhoneX mobile phone products comments under the platform of Jingdone district store
Related sentiment analysis is done, using this method come the good and bad point of comprehensive analysis product and to electric business platform and product improvement opinion
And advantage competition power makes analysis, Fig. 3 is the positive and negative comment comparison of iPhoneX comments, and table 1 is the potential master of iPhoneX comments
Topic.
Table 1
To sum up the high frequency words in theme can be seen that iPhone X mobile phone advantages and be embodied in facial appearance, frame and matter
In amount, and user complains that point is embodied in the dispatching of logistics address, and uncomfortable, on thickness and there are the problem of blank screen.
The present invention considers the synonym of word and upper hyponym in text, and synonym and upper the next root increase according to similarity
Respective word frequency, so as to reduce the synonymous influence to classification of more words.Different from conventional method to an eigenmatrix with single side
Method does feature extraction, and the present invention carries out term vector expression by NNLM to word so that each word can cover the upper and lower of sentence
Literary information content, word is put into sentence and is understood, carries out cluster analysis, no prison to the theme of sentence by emotion theme model afterwards
The generation theme superintended and directed, can find the potential theme of concern of user in comment data, and the improvement for commodity and electric business provides
Reliable basis.
It these are only the preferred embodiment of the present invention, be not intended to limit the invention, for those skilled in the art
For member, the invention may be variously modified and varied.Any modification within the spirit and principles of the invention, being made,
Equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.
Claims (9)
- A kind of 1. sentiment analysis method of the comment on commodity data based on machine learning, which is characterized in that including:The acquisition and extraction of step 1, comment on commodity data;Step 2, data prediction, the pretreatment includes text duplicate removal, mechanical compression removes word and short sentence is deleted;Step 3 carries out text participle based on stammerer segmenting method to the data of pretreatment;Step 4, structure sentiment analysis model:Step 41 trains generation term vector based on neutral net language model NNLM;Step 42, structure semantic network;Step 43 carries out semantic excavation, unsupervised generation theme based on LDA topic models.
- 2. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 1, comment on commodity data are acquired using octopus collector.
- 3. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 1, it is form by the comment data storage of extraction, and data is saved as into UTF-8 forms.
- 4. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the text duplicate removal uses editing distance duplicate removal, and the threshold value of the editing distance duplicate removal is 2.
- 5. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the mechanical compression goes method of the word using two stacks.
- 6. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 2, the short sentence is deleted to delete the short sentence that string length is less than or equal to 3.
- 7. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that In step 3, the stammerer segmenting method that segmenting method is python of stammering, the stammerer segmenting method of python supports accurate mould Three kinds of formula, syntype and search engine pattern participle patterns.
- 8. the sentiment analysis method of the comment on commodity data based on machine learning as claimed in claim 7, which is characterized in that In step 3, the stammerer segmenting method of the python carries out text participle using accurate model to the data of pretreatment.
- 9. the sentiment analysis method of the comment on commodity data based on machine learning as described in claim 1, which is characterized in that It is further included between step 41 and step 42:Word is divided into favorable comment and difference comments two groups of result sets by step 44.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711376954.2A CN108062304A (en) | 2017-12-19 | 2017-12-19 | A kind of sentiment analysis method of the comment on commodity data based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711376954.2A CN108062304A (en) | 2017-12-19 | 2017-12-19 | A kind of sentiment analysis method of the comment on commodity data based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108062304A true CN108062304A (en) | 2018-05-22 |
Family
ID=62139617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711376954.2A Pending CN108062304A (en) | 2017-12-19 | 2017-12-19 | A kind of sentiment analysis method of the comment on commodity data based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108062304A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101478A (en) * | 2018-06-04 | 2018-12-28 | 东南大学 | A kind of Aspect grade sentiment analysis method towards electric business comment text |
CN109241387A (en) * | 2018-08-28 | 2019-01-18 | 四川长虹电器股份有限公司 | Grab the crawler analysis method of social media information |
CN109271623A (en) * | 2018-08-16 | 2019-01-25 | 龙马智芯(珠海横琴)科技有限公司 | Text emotion denoising method and system |
CN109635279A (en) * | 2018-11-22 | 2019-04-16 | 桂林电子科技大学 | A kind of Chinese name entity recognition method neural network based |
CN109977414A (en) * | 2019-04-01 | 2019-07-05 | 中科天玑数据科技股份有限公司 | A kind of internet financial platform user comment subject analysis system and method |
CN110008807A (en) * | 2018-12-20 | 2019-07-12 | 阿里巴巴集团控股有限公司 | A kind of training method, device and the equipment of treaty content identification model |
CN110457472A (en) * | 2019-07-16 | 2019-11-15 | 天津大学 | The emotion association analysis method for electric business product review based on SOM clustering algorithm |
CN110688832A (en) * | 2019-10-10 | 2020-01-14 | 河北省讯飞人工智能研究院 | Comment generation method, device, equipment and storage medium |
CN111488432A (en) * | 2020-04-14 | 2020-08-04 | 广东科徕尼智能科技有限公司 | Sentiment analysis method, equipment and storage medium based on user comments |
CN111815358A (en) * | 2020-07-09 | 2020-10-23 | 湖南数客星球信息技术有限公司 | Big data user mining method and system based on cross-border e-commerce platform |
CN112052306A (en) * | 2019-06-06 | 2020-12-08 | 北京京东振世信息技术有限公司 | Method and device for identifying data |
CN112148947A (en) * | 2020-09-28 | 2020-12-29 | 微梦创科网络科技(中国)有限公司 | Method and system for mining and reviewing users in batches |
CN112380342A (en) * | 2020-11-10 | 2021-02-19 | 福建亿榕信息技术有限公司 | Electric power document theme extraction method and device |
CN113627969A (en) * | 2021-06-21 | 2021-11-09 | 杭州盟码科技有限公司 | Product problem analysis method and system based on E-commerce platform user comments |
CN114153952A (en) * | 2021-12-22 | 2022-03-08 | 南京智浩软件科技有限公司 | Interviewer management system and scoring quality monitoring and analyzing method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484437A (en) * | 2014-12-24 | 2015-04-01 | 福建师范大学 | Network brief comment sentiment mining method |
CN104778209A (en) * | 2015-03-13 | 2015-07-15 | 国家计算机网络与信息安全管理中心 | Opinion mining method for ten-million-scale news comments |
CN106339368A (en) * | 2016-08-24 | 2017-01-18 | 乐视控股(北京)有限公司 | Text emotional tendency acquiring method and device |
CN107436942A (en) * | 2017-07-28 | 2017-12-05 | 广州市香港科大霍英东研究院 | Word embedding grammar, system, terminal device and storage medium based on social media |
-
2017
- 2017-12-19 CN CN201711376954.2A patent/CN108062304A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104484437A (en) * | 2014-12-24 | 2015-04-01 | 福建师范大学 | Network brief comment sentiment mining method |
CN104778209A (en) * | 2015-03-13 | 2015-07-15 | 国家计算机网络与信息安全管理中心 | Opinion mining method for ten-million-scale news comments |
CN106339368A (en) * | 2016-08-24 | 2017-01-18 | 乐视控股(北京)有限公司 | Text emotional tendency acquiring method and device |
CN107436942A (en) * | 2017-07-28 | 2017-12-05 | 广州市香港科大霍英东研究院 | Word embedding grammar, system, terminal device and storage medium based on social media |
Non-Patent Citations (3)
Title |
---|
WEIXIN_30553065: "文本预处理——压缩去词", 《HTTPS://BLOG.CSDN.NET/WEIXIN_30553065/ARTICLE/DETAILS/98111746》 * |
冯淑慧: "基于数据挖掘的手机客户网络评论的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
朱振涛: "智能可穿戴设备在线评论有用性的信息采纳模型研究", 《南京工程学院学报(社会科学版)》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101478B (en) * | 2018-06-04 | 2022-04-08 | 东南大学 | Aspect-level emotion analysis method for E-commerce comment text |
CN109101478A (en) * | 2018-06-04 | 2018-12-28 | 东南大学 | A kind of Aspect grade sentiment analysis method towards electric business comment text |
CN109271623A (en) * | 2018-08-16 | 2019-01-25 | 龙马智芯(珠海横琴)科技有限公司 | Text emotion denoising method and system |
CN109241387A (en) * | 2018-08-28 | 2019-01-18 | 四川长虹电器股份有限公司 | Grab the crawler analysis method of social media information |
CN109635279A (en) * | 2018-11-22 | 2019-04-16 | 桂林电子科技大学 | A kind of Chinese name entity recognition method neural network based |
CN109635279B (en) * | 2018-11-22 | 2022-07-26 | 桂林电子科技大学 | Chinese named entity recognition method based on neural network |
CN110008807A (en) * | 2018-12-20 | 2019-07-12 | 阿里巴巴集团控股有限公司 | A kind of training method, device and the equipment of treaty content identification model |
CN110008807B (en) * | 2018-12-20 | 2023-08-18 | 创新先进技术有限公司 | Training method, device and equipment for contract content recognition model |
CN109977414A (en) * | 2019-04-01 | 2019-07-05 | 中科天玑数据科技股份有限公司 | A kind of internet financial platform user comment subject analysis system and method |
CN109977414B (en) * | 2019-04-01 | 2023-03-14 | 中科天玑数据科技股份有限公司 | Internet financial platform user comment theme analysis system and method |
CN112052306A (en) * | 2019-06-06 | 2020-12-08 | 北京京东振世信息技术有限公司 | Method and device for identifying data |
CN112052306B (en) * | 2019-06-06 | 2023-11-03 | 北京京东振世信息技术有限公司 | Method and device for identifying data |
CN110457472A (en) * | 2019-07-16 | 2019-11-15 | 天津大学 | The emotion association analysis method for electric business product review based on SOM clustering algorithm |
CN110688832A (en) * | 2019-10-10 | 2020-01-14 | 河北省讯飞人工智能研究院 | Comment generation method, device, equipment and storage medium |
CN110688832B (en) * | 2019-10-10 | 2023-06-09 | 河北省讯飞人工智能研究院 | Comment generation method, comment generation device, comment generation equipment and storage medium |
CN111488432A (en) * | 2020-04-14 | 2020-08-04 | 广东科徕尼智能科技有限公司 | Sentiment analysis method, equipment and storage medium based on user comments |
CN111815358A (en) * | 2020-07-09 | 2020-10-23 | 湖南数客星球信息技术有限公司 | Big data user mining method and system based on cross-border e-commerce platform |
CN112148947A (en) * | 2020-09-28 | 2020-12-29 | 微梦创科网络科技(中国)有限公司 | Method and system for mining and reviewing users in batches |
CN112148947B (en) * | 2020-09-28 | 2024-03-22 | 微梦创科网络科技(中国)有限公司 | Method and system for excavating and brushing users in batches |
CN112380342A (en) * | 2020-11-10 | 2021-02-19 | 福建亿榕信息技术有限公司 | Electric power document theme extraction method and device |
CN113627969A (en) * | 2021-06-21 | 2021-11-09 | 杭州盟码科技有限公司 | Product problem analysis method and system based on E-commerce platform user comments |
CN114153952A (en) * | 2021-12-22 | 2022-03-08 | 南京智浩软件科技有限公司 | Interviewer management system and scoring quality monitoring and analyzing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108062304A (en) | A kind of sentiment analysis method of the comment on commodity data based on machine learning | |
CN108573411B (en) | Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments | |
CN106484664B (en) | Similarity calculating method between a kind of short text | |
CN107193801A (en) | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN105824959B (en) | Public opinion monitoring method and system | |
CN108388660B (en) | Improved E-commerce product pain point analysis method | |
CN108038725A (en) | A kind of electric business Customer Satisfaction for Product analysis method based on machine learning | |
CN112699246A (en) | Domain knowledge pushing method based on knowledge graph | |
CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
Banik et al. | Evaluation of naïve bayes and support vector machines on bangla textual movie reviews | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN104268160A (en) | Evaluation object extraction method based on domain dictionary and semantic roles | |
CN111221962A (en) | Text emotion analysis method based on new word expansion and complex sentence pattern expansion | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN109446313B (en) | Sequencing system and method based on natural language analysis | |
CN107315734A (en) | A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme | |
Homoceanu et al. | Will I like it? Providing product overviews based on opinion excerpts | |
CN102200973A (en) | Equipment and method for generating viewpoint pair with emotional-guidance-based influence relationship | |
CN110674378A (en) | Chinese semantic recognition method based on cosine similarity and minimum editing distance | |
CN109947934A (en) | For the data digging method and system of short text | |
CN111488429A (en) | Short text clustering system based on search engine and short text clustering method thereof | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN112883182A (en) | Question-answer matching method and device based on machine reading |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180522 |
|
RJ01 | Rejection of invention patent application after publication |