CN108038725A

CN108038725A - A kind of electric business Customer Satisfaction for Product analysis method based on machine learning

Info

Publication number: CN108038725A
Application number: CN201711303030.XA
Authority: CN
Inventors: 徐新胜; 余建浙
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-05-15

Abstract

The invention discloses a kind of electric business Customer Satisfaction for Product analysis method based on machine learning, wherein the described method includes：Electric business product review text is obtained, is segmented, the data prediction such as part-of-speech tagging；Selected Chinese Chunk label symbol carries out word segmentation result manual mark；Based on Lib SVM instruments, training pattern, and then nominal Chinese Chunk is obtained as candidate's product features, calculate TF IDF value filtering characteristics；Sentiment dictionary is built, calculates the emotion score of each feature of commodity；Training term vector language model, obtains the vector representation of product features；Product features are carried out CSAT cluster by word-based vector similarity, and calculate total score.The method of the present invention can be applied in the commercial product recommending system based on comment on commodity text, pass through Evaluation of Customer Satisfaction Degree, the aspect of product features five is clustered out, product features dimension and openness is reduced, makes designed commending system that there is performance more fast and accurately.

Description

A kind of electric business Customer Satisfaction for Product analysis method based on machine learning

Technical field

The present invention relates to natural language processing and Data Mining, especially a kind of commodity evaluation based on comment text Method.

Background technology

With the fast development and popularization of Internet technology, the increase of network information explosion type is brought.In information " explosion " Epoch, traditional shops's sales mode cannot meet consumer demand, and e-commerce is come into being.The appearance of e-commerce, On the one hand, broadening consumer's goods chooses scope；On the other hand, consumer can deliver viewpoint and view to electric business product.Visitor Family satisfaction, is also public satisfaction index, is the abbreviation to the Customer Satisfaction Survey system of service trade, is that client passes through The index drawn more afterwards compared with its desired value to a kind of appreciable effect of product, comment on commodity are exactly a kind of CSAT Embodiment.By excavating end article comment information, individualized feature preference, the end article CSAT for obtaining user are real Commercial product recommending now is carried out to user.

Analysis for Customer Satisfaction for Product has currently had many methods, and such as customer satisfaction ad hoc survey, complain The customer that suggestion system, mysterious buyer and research are lost in.These methods obtain the customer satisfactory index of end article, still Time and effort consuming, passively obtains information and information content is less than normal.

When for the CSAT situation of electric business product, the above method is all inapplicable.Therefore, by analyzing target business Opinion is judged, the corresponding Affective Evaluation of product features is excavated, as CSAT situation.But comment user has difference Education degree, culture background, the user of folkways and customs take different form of presentation, cause a kind of feature of end article to have A variety of expression ways.Such case is not only likely to occur product features dimension disaster, but also increases the openness of feature, is unfavorable for Analyze Affective Evaluation of the user to product features.

Using CSAT clustering method, five aspects proposed according to CSAT, reliability, specialty degree, have Shape degree, similarly degree and degree of reaction, according to five degree expression contents, same feature or similar characteristics cluster is expressed to five by actual In a degree.It not only can effectively solve the problems, such as product features, but also use CSAT clustering method, more brief introduction is efficiently commented Valency commodity.CSAT evaluation electric business product is currently based on, also few people propose effective ways.

The content of the invention

The technical problems to be solved by the invention are：A kind of electric business Customer Satisfaction for Product analysis method is provided, using name The product features of user comment are clustered CSAT by part of speech Chinese Chunk as product features, and using machine learning method Five aspects, solve product features dimension disaster and openness, and are more concisely and efficiently evaluation end article, push away commodity The result recommended is quick and precisely.

For this reason, a kind of electric business Customer Satisfaction for Product analysis method based on machine learning proposed by the present invention is including as follows Step：

Step S1：Design crawler algorithm crawls end article comment text from electric business platform, is persisted to local data Storehouse, segments the comment text crawled using participle instrument, part-of-speech tagging, and statistics word segmentation result obtains word frequency, according to stopping Word and low-frequency word dictionary filter word segmentation result.

Step S2：Selected Chinese Chunk label symbol, each root of word segmentation result is given according to part of speech and context relation Manual Chinese Chunk is given to mark；

Step S3：Using the Chinese Chunk marked by hand as training set, trained and automatically extracted based on Lib-Svm instruments Chinese Chunk model, then model is applied in whole comments, finally automatically extracts the nominal Chinese using the result after mark Language chunking gives certain threshold value and carries out TF-IDF filterings to each candidate feature word as candidate products Feature Words；

Step S4：Sentiment dictionary on collecting net, and quantized to dictionary according to the intensity of emotion, in every comment The product features word of appearance carries out emotion score calculating；

Step S5：The term vector model of training product features word, obtains the vector representation set of Feature Words；

Step S6：Product features are carried out CSAT cluster by the similarity of feature based word term vector, and combine business Product feature emotion score, finally provides commodity evaluation total score；

The beneficial effect that the present invention is compared with the prior art is：The present invention proposes a kind of electric business production based on machine learning Product Evaluation of Customer Satisfaction Degree method, according to investigation, it is found that a RATER that can effectively weigh customer service quality refers to Number.RATER indexes are the abbreviations of five English words, represent reliability (reliability), assurance (specialties respectively Degree), tangibles (tangible degree), empathy (similarly spend), responsiveness (degree of reaction), based on public satisfaction index Degree provides five aspects of electric business product evaluation.By the product features cluster in user comment text to five sides of CSAT Face, not only can effectively solve the problems, such as product features, but also use CSAT clustering method, more succinct efficient evaluations business Product；In order to which more accurately extraction product features, the present invention use Chinese Chunk labeling form, model trained based on SVM methods； Sentiment orientation is quantized compared to product features Sentiment orientation, the present invention, calculates each product features emotion score；It is given The each several most significant product features words of aspect of CSAT, using term vector model, this word vectors have very strong Semantic expressiveness ability, semantically similar word, the distance in vector space in the mapped also can be very close, will be remaining Product features word, calculates the feature Word similarity with each aspect, takes the aspect of average similarity maximum, sorted out；Most close With reference to product features emotion score, Customer Satisfaction for Product final score is provided.During evaluating commodity, Chinese is utilized Chunking recombinates the orientation cluster of product features and CSAT, can reduce the dimension of product features, and use feature Emotion score, commending system that can be designed has to be recommended more fast and accurately.

Brief description of the drawings

Fig. 1 is a kind of electric business Customer Satisfaction for Product analysis side based on machine learning in the specific embodiment of the invention The flow diagram of method.

Embodiment

To understand the object, technical solutions and advantages of the present invention, the embodiment of the present invention will be carried out below Clear, complete description.

As shown in Figure 1, it is a kind of electric business Customer Satisfaction for Product point based on machine learning in present embodiment The flow chart of analysis method.

This method includes：Step S1 designs crawler algorithm crawls end article comment text, persistence from electric business platform To local data base, the comment text crawled is segmented using participle instrument, part-of-speech tagging, statistics word segmentation result obtains word Frequently, word segmentation result is filtered according to stop words and low-frequency word dictionary；Step S2, selectes Chinese Chunk label symbol, to dividing Each root of word result is given Chinese Chunk according to part of speech and context relation and is marked by hand；Step S3, by what is marked by hand Chinese Chunk trains the Chinese Chunk model automatically extracted based on Lib-Svm instruments, then model application as training set In whole comments, nominal Chinese Chunk finally is automatically extracted as product feature word by the use of the result after mark, obtains commodity Feature Words candidate collection, gives certain threshold value and carries out TF-IDF filterings to each candidate feature word；Step S4, feelings on collecting net Feel dictionary, and quantized to dictionary according to the intensity of emotion, emotion is carried out to the product features word occurred in every comment Score calculates；Step S5, the term vector model of training product features word, obtains the vector representation set of Feature Words；Step S6, base In the similarity of Feature Words term vector, product features are carried out with CSAT cluster, and combines product features emotion score, most After provide commodity evaluation total score.

In specific embodiments, it can operate that (in following operation statement, we will be with to Taobao by following mode Exemplified by the Evaluation of Customer Satisfaction Degree of certain in website mobile phone, after each operating procedure, specific example is provided)：

Step S1：Using the Scrapy reptile frames of python, end article comment text is crawled, is then persisted to In Mysql databases, user comment corpus is obtained.Then comment text is pre-processed, mainly includes text participle, word Property mark and word frequency statistics, be then based on stop words and low-frequency word filtering word segmentation result.Subdivided step is as follows：1) text participle and Part-of-speech tagging：It is using space as nature delimiter between word, and Chinese is it is known that in the style of writing of English Word, sentence and section can simply be demarcated by obvious delimiter, only the formal delimiter of word neither one, although English Equally exist the partition problem of phrase, but on word this layer, Chinese than complicated more, difficult more of English.Chinese Participle (Chinese Word Segmentation) refers to a Chinese character sequence being cut into single word one by one.Word Property mark be part of speech that each word is marked to above-mentioned word segmentation result, the word of Modern Chinese can be divided into two classes, 14 kinds of parts of speech.It is existing It is relatively more in the Chinese word segmentation and part-of-speech tagging instrument that can be selected, such as, ICTCLAS：Chinese lexical analysis system, this is earliest Chinese increase income one of participle project, activity obtains first place to ICTCLAS in the evaluation and test of 973 expert groups tissue at home, Multinomial first place is all obtained in the evaluation and test of first (2003) world Chinese language processing research institution SigHan tissues；Language cloud (language technology platform cloud LTP-Cloud) is the high in the clouds nature language by Harbin Institute of Technology's social computing and the research and development of Research into information retrieval center Speech processing service platform.Rear end relies on language technology platform, language cloud provided to the user including participle, part-of-speech tagging, according to Deposit the abundant efficient natural language processing service including syntactic analysis, name Entity recognition, semantic character labeling；In " stammerer " Text participle, makees best Python Chinese word segmentation components.We consider the accuracy rate, high efficiency and easy Sexual behavior mode of participle " stammerer " Chinese word segmentation instrument (tool web site：http://www.oschina.net/p/jieba).2) word segmentation result is carried out Word frequency statistics：A dictionary container is created, using the word of word segmentation result as key, is worth the frequency occurred for word, its main feature is that key-value pair Storage, and the key stored cannot uniquely must repeat, and word segmentation result is traveled through, and store into dictionary container, obtain complete The word frequency of the word segmentation result in portion.

3) filtering of low-frequency word and stop words：Low-frequency word refers to the word that occurrence number is less in word frequency statistics, general mistake The occurrence number filtered is less than 3 word；Stop words refers in information retrieval, to save memory space and improving search effect Rate, before or after processing natural language data (or text) can automatic fitration fall some words or word, such as " ", " I " etc. Word, these words or word are referred to as Stop Words (stop words).These stop words are all manually entered, non-automated generation , the stop words after generation can form a deactivation vocabulary.4) filtering of word segmentation result：, filter out the appearance in word segmentation result Low-frequency word and stop words.

We select following several to be used as example from the comment text of Taobao's money mobile phone commodity：

1 " very good mobile phone, workmanship texture is fabulous, and face is worth quick-fried table.”

2 " logistics in Jingdone district is super to praise, and mobile phone has begun to use, and function is normal, quality-high and inexpensive, be worth recommend.”

3 " mobile phone is fine, and quickly, telephone sound quality is pretty good for the speed of service.”

" stammerer " Chinese word segmentation and part-of-speech tagging official are described as：jieba.posseg.POSTokenizer (tokenizer=None) self-defined segmenter is created, tokenizer parameters may specify that inside uses Jieba.Tokenizer segmenter.Jieba.posseg.dt is acquiescence part-of-speech tagging segmenter.Mark each after sentence segments The part of speech of word, using the labelling method with ictclas compatibilities.Specifically used method is as follows：

Import jieba.posseg as pseg

Mobile phone very good sentence=', workmanship texture is fabulous, and face is worth quick-fried table.’

Result=[str (a) for a in pseg.cut (sentence)]

print(″″.join(result))

The above-mentioned participle of carry out and part-of-speech tagging step to sample text 1, the display format after processing is that space-separated is each A word, the part of speech of backslash this word after each word, the result finally shown are as follows：

" very/d is pretty good/a /uj mobile phones/n ,/x workmanships/v texture/n be fabulous/d /uj ,/x face value/n Quick-fried table/v./ x ", wherein, v represent verb, n representation nouns, a represent adjective, d represents adverbial word, uj represents auxiliary word, x represent it is non- Morpheme word.

It is as follows that word frequency statistics specific method is carried out to above-mentioned word segmentation result：

Counting the result after word frequency is：{ ' very '：1, ' good '：1, ' '：2, ' mobile phone '：1, ' workmanship '：1, ' matter Sense '：1, ' fabulous '：1, ' face value '：1, ' quick-fried table '：1 }, dictionary appearance is stored into using the combining form of word and word frequency as key-value pair In device, certain threshold value is given, using the word less than this threshold value as low-frequency word.

Step S2：Chunk parsing is a kind of syntactic analysis.It can both be used as in natural language processing system and analyze syntax The subtask of function, can also be transitioned into a bridge block of syntactic analysis as morphological analysis.Chinese Chunk analysis is for process The sequence of terms of pretreatment, mainly produces two parts information --- word circle block after analysis：The sequence of terms of identical component is divided In same piece, continuous word circle block sequence is formed；Block component marks：A block is assigned into minute mark for each Chinese block Note.First have to determine Chinese block mark, according to the present invention the characteristics of with reference to Chinese Chunk importance, choose 8 kinds of Chinese Chunks Form, as Chinese Chunk manual markings symbol, is respectively：Np (name block), vp (verb block), ap (adjective), mp (quantity Word block), sp (space block), tp (time block), dp (adverbial word block) and pp (preposition block).Then the block component mark present invention uses The mark set of IOB2, the mark set include the mark of three types：B-X represents that Chinese Chunk type is X, and is the Chinese The starting word of language chunking, I-X represent that Chinese Chunk type is X, and are the non-starting words of the Chinese Chunk, and 0 represents not in office Word in what Chinese Chunk.The word segmentation result obtained according to step S1, with reference to the word relation up and down of each word, gives each word Chinese Language chunking craft label symbol, composing training model sample.To the word segmentation result of sample text 1, and carry out low-frequency word and deactivation The result that Chinese Chunk after word marks by hand is as follows：

Very/d B-ap

Well/a I-ap

Mobile phone/n B-np

,/x 0

Workmanship/v B-np

Texture/n I-np

Fabulous/d B-dp

,/x 0

Face value/n B-np

Quick-fried table/v B-ap

。/x 0

Step S3：Support vector machines is as a kind of machine learning algorithm for having supervision, it is necessary to is provided by user a series of Feature is as classification foundation.Word (w), part of speech (p) and the chunking category label that comment text context diverse location is occurred (c) it is used as assemblage characteristic Training Support Vector Machines model.So disaggregated model x can be as follows by 12 character representations：

X=(w_i-2, p_i-2, c_i-2, w_i-1, p_i-1, c_i-1, w_i, p_i, w_i+1, p_i+1, w_i+2, p_i+2) (1)

Wherein, w_iRepresent the word of current location, p_iRepresent the part of speech mark of current word；w_i-n：Represent past from current location The word that n-th of preceding number, p_i-nRepresent the part of speech mark of i-n, t_i-nRepresent the chunk type mark of i-n；w_i+n：Represent from current Position several i-th of word backward；p_i+nRepresent the part of speech mark of i+n, n takes 1 and 2.

Only receive digitized value for feature SVM two-values grader, in order to meet this limitation, by building a pass In the inverted index table Inv Tab of feature, therein to be each recorded as two tuples (f, indexw), wherein index is that feature f exists Feature list in position.Such as (w_i-2=color, 2451), represent " w_i-2=color " this be characterized in feature list 2451 elements.

To solve the problems, such as that data set is unbalanced, the present invention uses one-against-one method.In addition, common Kernel letters Number has：

Sigmoid kernel function tanh ((xx_i)+t), a, t are constants, and tanh is Sigmoid functions；

Polynomial kernel function K (x, x_i)=[((xx_i)+1)]^d, d is natural number；

Radial basis kernel function

Polynomial function form is simple and classifying quality when can intuitively compare various features various combination, therefore uses d Order polynomial is as Kernel functions.

After the combinations of features and kernel function that have selected SVM, Chinese Chunk mark is carried out to whole comments using Lib-Svm Note, then extracts nominal Chinese Chunk as candidate's commodity Feature Words, the TF-IDF for calculating each candidate word was carried out Filter.Wherein, TF-IDF (English：Term frequency-inverse document frequency) it is that one kind is used for information The common weighting technique of retrieval and text mining.TF-IDF is a kind of statistical method, to assess a words for a file The significance level of collection or a copy of it file in a corpus.The number that the importance of words occurs hereof with it Directly proportional increase, but the frequency that can occur at the same time with it in corpus is inversely proportional decline.Calculation formula is as follows：

TF-IDF=TF_{I, j}×IDF_i (2)

Wherein, n_ijIt is some term in document d_jThe number of middle appearance, andIt is the word occurred in the document Quantity summation.| D | represent file summation, | { j：t_i∈D_j}|：Include word t_iNumber of files (i.e. n_{I, j}≠ 0 number of files) If the word, not in corpus, may result in dividend is zero, therefore uses 1+ under normal circumstances | { j：t_i∈D_j}|。

By observing non-product feature and its TF-IDF values, it is found that the TF-IDF values of most of non-product feature exist More than 0.0045, therefore the present invention can be obtained using 0.0045 as filtering threshold after being filtered to candidate products characteristic set To final product feature set.

The present invention is using 5000 comments on commodity as data source, wherein 1000 mark by manual, as training set and survey The summation of collection is tried, remaining 4000 use as verification collection, specifically as follows using step：

D:Libsvm windows 1 train.txt ＞ train-to-one.txt/ of svm-scale-1 0-u/numerical value Normalization

from svmutil import*

Y, x=svm_read_problem (r ' train-to-one.txt ')

M=svm_train (y [：840], x [：840], '-c 4 ')

P_label, p_acc, p_val=svm_predict (y [840：], x [840：], m)

Operation result：Accuracy=86.7273%

The candidate's commodity Feature Words extracted by Lib-Svm, then calculate TF-IDF values and are filtered, obtain commodity Feature Words.

Step S4：Emotion score calculating is carried out to each Feature Words of commodity, is broadly divided into four steps：1) comment text is split： By the symbol of the division such as most common fullstop, branch, question mark, exclamation mark sentence meaning in comment text Chinese, different sentences are cut into Son；Then the sense-group (minimum unit for representing emotion) in sentence is marked off with comma.2) carried out Chinese Chunk as neologisms Loading, segments sense-group using Python-Jieba.3) sentiment analysis based on sentiment dictionary：It is main to collect Hownet (Hownet) and Chinese feeling polarities dictionary (NTUSD), and the emotion word of field of mobile phones uniqueness is combined.It is broadly divided into three classes：

Emotion word：Such as：It is beautiful, practical etc., express the attitude of product features.

Degree word：Such as：Very, very, pole etc., when degree word modifies emotion word, give expression to the emotion to product features word Intensity.

Negative word：Such as：Not, not etc., the processing of negative word in two kinds of situation, first, modification emotion word, first, modification Degree word, the intensity that two kinds of emotions on commodity influence are different, it is necessary to do corresponding adjustment.

4) emotion score calculates：First, the search commercial articles Feature Words in sense-group, if it is present carrying out emotion score meter Calculate, otherwise next sense-group；Then, the word obtained after second step is segmented is searched in sentiment dictionary successively, if can find, Feeling polarities and weights (S=feeling polarities × weights) are read, are not otherwise emotion words；Secondly, emotion word searches for forward degree Word, finds degree word or runs into emotion word and just stop searching, and corresponding weight value (K) is recorded if degree word is found.Again, emotion word Negative word is searched for forward, has been searched for current sense-group or has been run into emotion word and just stops.Negative word is handled one by one, if being located at emotion word Before, then take W_i=-1, if before degree word, takes W_i=0.5, if there is no W if negative word_i=1.By operation above, Sense-group division has been completed, while it is also proposed emotion word, degree word and negative word, and has imparted corresponding weights.Finally, count Calculate sense-group emotion score：Feature Words_i=AVG (SUN (Feature Words_i, Feature Words_i..)), Feature Words_i=S × K × W_i, subscript i-th Represent the i-th Feature Words.

The above-mentioned Feature Words emotion score of progress to sample text 1 calculates, and specific step is as follows：

Sentence=" very good mobile phone, workmanship texture is fabulous, and face is worth quick-fried table ".

Re_string=", |.| |！”

Sens_list=re.split (re_string, sentence)

[" very good mobile phone ", " workmanship texture is fabulous ", " face is worth quick-fried table "] three sense-groups are obtained after processing.

Emotion score calculating is carried out by taking " very good mobile phone " as an example, first segments " very good mobile phone " with sky Lattice are separator,

" very " it is wherein degree word, " good " is emotion word, " " word is off, " mobile phone " is product features word, root Calculated according to above-mentioned formula：Score (" mobile phone ")=S × K × W=1 × 3 × 1=3.Finally, the emotion for calculating each Feature Words obtains Point, each Feature Words score is averaged, as final goods feature score.

Step S5：Product features set of words is obtained by step S3, the training data of term vector model is obtained as training, The vectorial subdivided step for obtaining word is as follows：1) using the model that Word2Vec instruments train the tool interior to include of increasing income；2) Obtained word vectors represent, are a kind of successive value vectors compared with low dimensional, and each word vectors have identical dimension, tie up The value of the size K of degree is usually manually specified before training, and K values are tieed up relatively common with 50 peacekeepings 100.Word2Vec be by Term vector learning tool (the instrument network address of increasing income of Google exploitations：https://code.google.com/p/word2vec/), The tool interior realizes bilingual model：Continuous bag of words (continuous bag-of-word, CBOW) and company Continuous skip-gram models, CBOW are that the context of known centre word predicts the probability distribution of centre word, and skip-gram is Know centre word to predict the probability distribution of context words, two kinds of models all with the one hot of word vectors (i.e. current word is 1, Other words are 0) to be expressed as inputting, and after training model, just obtain the word vectors that the word insertion of our needs represents.

Using the word2vec instruments realized in Python, foregoing description be embodied as follows：

import warnings

Warnings.filterwarnings (action=' ignore ', category=UserWarning, module =' gensim)

from gensim.models import word2vec

import logging

Logging.basicConfig (format=' % (asctime) s：% (levelname) s：% (message) S ', level=logging.INFO)

Sentences=word2vec.Text8Corpus (r " feature set of words ") # loads language material

Model=word2vec.Word2Vec (sentences, size=100, min_count=1) # gives tacit consent to window =5

Model.save (u " Feature Words vector model .model ")

Feature set of words training data is loaded, Feature Words vector model is obtained by training, then has following method can be with Call：Check specific vectorial numerical value expression=model [" Feature Words "] of specific characteristic word；Calculate similarity/phase of two words Pass degree=model.similarity (word1, word2)；Calculate related words list=model.most_ of some word Similar (" call ", topn=20) #20 most related；Searching correspondence=model.most_similar ([" memory ", " too small "], [" resolution ratio "], topn=10)；Finding unsocial word=model.doesnt_match, (" memory fortune deposits appearance Measure screen " .split ()).

Step S6：The subdivided step that commodity are carried out with CSAT cluster analysis is as follows：1) CSAT is specific Introduce：CSAT can be embodied by the size of RATER indexes, and score value is higher, and client is more satisfied, and wherein RATER indexes are distinguished Represent reliability (reliability), assurance (professional degree), tangibles (tangible degree), empathy (similarly spend), Responsiveness (degree of reaction).The information of specific each classification is as follows：

Reliability：Refer to whether an enterprise can consistently fulfil the promise that oneself makes client, when this When a enterprise is truly realized this point, good reputation, the trust of Win Clients will be possessed.

Professional degree：Refer to professional knowledge, technical ability and professionalism that the attendant of enterprise possesses.Including：There is provided excellent The ability of matter service, the skill to the courtesy of client and respect, with client's effective communication.

Tangible degree：Refer to the help and pass of tangible service facility, environment, the instrument and service of attendant to client The tangible performance in bosom.Service is a kind of invisible product in itself, but is provided in clean and tidy service environment, dining room for child special With stewardess that child sings and dances festively etc. is led in seat, McDonald, this immaterial product of service can be made to become to have Shape is got up.

Similarly spend：Refer to that attendant can consider with seeing things as one would if he were in someone else's place at any time for client, veritably sympathetic understanding client Situation, the demand for understanding client.

Degree of reaction：Refer to that attendant gives for the demand of client and respond and can provide rapidly the hope of service in time. When servicing when something goes wrong, respond at once, rapid solution can bring active influence to service quality.As client, it is necessary to Be aggressive attitude.

2) cluster " seed " product features of CSAT are selected：Five classifications defined according to CSAT, will Product features word is mapped in five classifications.The description of commodity Feature Words itself is corresponding " reliability ", takes " mobile phone screen ", " battery The high frequency Feature Words such as capacity ", " speed of service ", " screen resolution " are used as " reliability " cluster " seed " product features word；Customer service Service, after-sale service description are corresponding " professional degree "；Product accessories packaging is corresponding " tangible degree "；Courier, logistics service correspond to " same Reason degree "；Goods return and replacement service is corresponding " degree of reaction "；Similarly, to remaining four classification, selection is corresponding respectively clusters " seed " business Product Feature Words.

3) similitude clustering of product features word：In the product features word term vector set obtained from S5, removing is chosen to be " seed " product features word, then traversal set, calls term vector similarity function, calculates each product features word and each class The similarity of each " seed " Feature Words, then takes the classification of average similarity to be sorted out in not.

4) calculating of CSAT：Based on the emotion score of the step S4 each product features words obtained, to each class In product features word carry out the summation of emotion score and be averaging operation, finally calculate the final scores of each five classifications.

Described according to above-mentioned steps, it is as follows mainly to introduce 2 specific implementation step of subdivided step：

List_1, list_2, list_3, list_4, list_5=[' mobile phone screen ', ' battery capacity ' ..], [...], [...], [...], the seed characteristics word of [...] # five classification

List_feature=[' resolution ratio ', ' cell phone appearance ' ...] # mobile phone feature set of words

for word in list_feature：

# calculates the similarity of each classification

Score=compute (list_1, list_2, list_3, list_4, list_5, word)

Add_list (scour) # takes the classification of similarity maximum to be sorted out

After all product features term clusterings, the public satisfaction index of commodity can be obtained by calculating, more entirely Face, efficient and succinct evaluation method, make the result of commercial product recommending quick and precisely.

Need to illustrate：Above content is to combine specific preferred embodiment, and a part of the invention is implemented specifically It is bright, it is impossible to assert that the present invention is confined to these explanations.For general technical staff of the technical field of the invention, The every other instantiation obtained on the premise of not departing from present inventive concept, should all be considered as belonging to the protection model of the present invention Enclose.

Claims

A kind of 1. electric business Customer Satisfaction for Product analysis method based on machine learning, it is characterized in that including the following steps：

Step S1：The powered-down business's platform of slave phase obtains electric business product review text, is segmented, the data prediction such as part-of-speech tagging；

Step S2：Selected Chinese Chunk label symbol, carries out manual mark, as acquisition to the word segmentation result obtained in step S1 The training sample of Chinese Chunk model；

Step S3：Using the training sample obtained in the step S2, trained based on Lib-SVM instruments, obtaining can be in commodity The model of automatic marking Chinese Chunk in comment text, is then all carrying out the automatic marking of Chinese Chunk in comment, and then Nominal Chinese Chunk is chosen to filter according to certain rule as candidate's commodity Feature Words set, and to candidate collection；

Step S4：Sentiment dictionary is built, using the product features set of words obtained in the step S3, calculates each feature of commodity Emotion score；

Step S5：Using the product features set of words obtained in the step S3, the term vector model of training characteristics word, obtains business The vector representation of product Feature Words；

Step S6：Utilize the product features term vector obtained in the step S5, word-based vector similarity, to product features word CSAT cluster analysis is carried out, using the emotion score of each feature of commodity obtained in the step S4, is calculated every A kind of average mark is as final score.
2. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S1, data prediction includes comment participle, part-of-speech tagging, word frequency statistics, stop words filtering and the filtering of low frequency word.
3. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S2, the method for Chinese Chunk mark is：A total of 13 kinds of Chinese Chunk, most common 8 kinds are selected according to importance As label symbol, Chinese Chunk is marked using the mark set of IOB2, according to the part of speech of forward and backward 2 word of each word And dependence, the Chinese Chunk for giving each word by hand mark；After the completion of Chinese Chunk mark, each word and Chinese group is completed Block corresponds.
4. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S3, the method for product features word extraction is：According to the Chinese Chunk marked in step S2, word, the word of each word are chosen Property and input feature vector as training of the words of forward and backward 2 words, part of speech and Chinese Chunk mark, based on Lib-SVM instruments, instruct Practise Chinese Chunk extraction model；Nominal Chinese Chunk is extracted as candidate's commodity on whole comment texts by the use of model Feature set of words, calculates the TF-IDF values of each candidate feature word, gives certain threshold value and is filtered.
5. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S4, the method for product features word emotion score is：Online sentiment dictionary is collected and integrates, mainly with the emotion of Hownet Based on dictionary and the feeling polarities dictionary of Taiwan Univ., quantize to different classes of word；In each comment, give A certain distance, the product features set of words obtained according to step S3, calculates what is contained in every comment with reference to sentiment dictionary Product features word emotion score.
6. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S5, using the product features set of words obtained in S3, obtaining the method for the vector representation of word is：Use Word2Vec Open-Source Tools, training product features set of words；Then obtain Feature Words vector representation, be a kind of successive value compared with low dimensional to Amount, each word vectors have identical dimension, and the size of dimension is manually specified as hyper parameter before training, common 50 tie up or 100 dimensions.
7. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S6, the Feature Words similarity calculating method based on term vector is：Represented according to the term vector of Feature Words in step S5, profit The similarity measure instrument carried with Word2Vec, can calculate the similarity of each Feature Words and residue character word, obtain The result is that 0 to 1 decimal, numerical value is bigger represent it is more similar.
8. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S6, CSAT clustering method is：According to the five of CSAT class declarations, including reliability, specialty degree, Tangible degree, similarly degree and reflection degree, based on the product features set of words extracted in step S3, go out for each classification artificial screening Ten most represent and the product features word of unique classification, with reference to the Feature Words similarity calculating method of step S6 term vectors, Remaining each Feature Words carry out similarity measure with ten Feature Words in five classifications successively, finally choose average similarity Maximum classification is sorted out.
9. the electric business Customer Satisfaction for Product analysis method based on machine learning as claimed in claim 1, it is characterized in that, it is described In step S6, CSAT total score computational methods：It is special with reference to step S6 CSATs cluster result and step S4 commodity Word emotion score is levied, all product feature word emotion scores in each class of CSAT can be obtained, then to each classification In Feature Words weighted average, the final score as each classification of CSAT.