CN108563647A

CN108563647A - A kind of automobile Method for Sales Forecast method based on comment sentiment analysis

Info

Publication number: CN108563647A
Application number: CN201711229414.1A
Authority: CN
Inventors: 周应华; 商楠
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2018-09-21

Abstract

The present invention claims a method for predicting car sales based on sentiment analysis, which obtains comment data from a car review website to preprocess the data, and uses a multi-label classification method to classify the comment data into safety, comfort, control, and power according to the user's experience. , six aspects of economy and service; each aspect of emotional factors were integrated into the model to establish an emotional prediction model. Predict car sales, find out which aspect of car performance consumers pay more attention to, and use it as a guide for future production. The operation process of this method: the user inputs past sales data, brings the data into the model, and obtains the sales forecast data for the next quarter. The prediction method improves the prediction accuracy.

Description

A car sales forecasting method based on comment sentiment analysis

技术领域technical field

本发明属于汽车销量分析预测领域，具体属于一种涉及评论情感分析的评论情感分析的汽车销量。The invention belongs to the field of analysis and forecasting of automobile sales, in particular to automobile sales of comment sentiment analysis involving comment sentiment analysis.

背景技术Background technique

汽车销量预测技术指的是根据以往的销售数据和其他数据对下个某个阶段的销量进行估计。现有的汽车销量预测技术主要是根据以往的销售数据，使用自回归模型或者灰色模型预测技术。基于这些预测方法的局限在于，深入在以往的销售数据忽略了用户的评论数据的影响。根据研究在线评论数据有助于提高销量预测模型的准确率。Automobile sales forecasting technology refers to the estimation of sales in the next stage based on past sales data and other data. Existing auto sales forecasting techniques are mainly based on past sales data, using autoregressive model or gray model forecasting techniques. The limitation of these forecasting methods is that the influence of user's review data is neglected by digging in the past sales data. According to research, online review data can help improve the accuracy of sales forecasting models.

基于汽车评论数据进行预测是当前研究的热门方向，但存在一些难点如在自然语言处理方面(现在的评论语言种类繁多，随意性大，网络用语较多)。Prediction based on car review data is a popular research direction at present, but there are some difficulties such as natural language processing (there are many types of review languages, large randomness, and many online terms).

发明内容Contents of the invention

本发明旨在解决以上现有技术的问题。提出了一种提高预测的准确性的基于评论情感分析的汽车销量预测方法。本发明的技术方案如下：The present invention aims to solve the above problems of the prior art. A car sales forecasting method based on review sentiment analysis is proposed to improve the accuracy of forecasting. Technical scheme of the present invention is as follows:

一种基于评论情感分析的汽车销量预测方法，其包括如下步骤：A method for forecasting car sales based on comment sentiment analysis, comprising the steps of:

1)、对汽车评论数据进行包括统一格式并剔除重复词汇在内的预处理；1) Carry out preprocessing on the car review data, including unified format and elimination of repeated words;

2)、利用中科院汉语语法系统对经过预处理后的汽车评论数据进行分词处理，去除停用词；2), use the Chinese Grammar System of the Chinese Academy of Sciences to perform word segmentation processing on the preprocessed car review data, and remove stop words;

3)、利用多标签分类技术对对步骤2分词处理后的评论数据集进行多标签分类；3), using multi-label classification technology to carry out multi-label classification to the comment data set after step 2 word segmentation processing;

4)、使用互信息技术对情感值进行量化，求得评论文本集的情感值；4), use mutual information technology to quantify the emotional value, and obtain the emotional value of the comment text set;

5)、将情感值融合进入回归模型预测下个阶段的汽车销量。5). Integrate the emotional value into the regression model to predict the car sales in the next stage.

进一步的，所述步骤1)将汽车评论数据分为舒适、动力、操控、服务、经济和安全六个方面，首先求出一个评论词与类标签之间的关系，公式如下：Further, the step 1) divides the car review data into six aspects of comfort, power, handling, service, economy and safety, and first finds the relationship between a review word and the class label, the formula is as follows:

其中，n表示文档总数，表示词word不在文档D_i中，x²表示某一个词word和汽车某一方面l_j之间的相关性，表示不含有l_j方面，即p(word,l_j)表示词Word在文档D_i中出现的次数且l_ij＝1，l_j表示汽车的某一方面性能,使用L＝{l₁,l₂,....,l_j,…,l₆}表示由6种标签构成的标记集合。具体为文档集合D所涉及的多个性能构成的方面集合，使用汽车的舒适性、动力性、操控性、服务性、经济性和安全性六个性能方面。j表示其中某一种性能(1≤j≤6)，i表示第i篇文档。p(word)表示词word在文档D_i中出现的次数，p(l_j)文本集中l_j出现的次数，表示词word不在文档D_i出现的次数。Among them, n represents the total number of documents, Indicates that the word word is not in the document D _i , x ² indicates the correlation between a certain word word and a certain aspect of the car l _j , Indicates that it does not contain l _j aspects, ie p(word,l _j ) represents the number of times the word Word appears in the document D _i and l _ij =1, l _j represents a certain aspect of the performance of the car, using L={l ₁ ,l ₂ ,...,l _j ,…,l ₆ } represent a set of labels consisting of 6 types of labels. Specifically, it is an aspect set composed of multiple performances involved in the document collection D, using six performance aspects of the car, namely, comfort, power, handling, serviceability, economy, and safety. j represents one of the properties (1≤j≤6), and i represents the i-th document. p(word) indicates the number of occurrences of the word word in the document D _i , p(l _j ) the number of occurrences of l _j in the text set, Indicates the number of times word word does not appear in document D _i .

进一步的，所述步骤1)使用中科院计算所的汉语词法分析系统ICTCLAS3，首先将搜狗输入法中与汽车行业相关的细胞词库导入汉语词法分析系统，利用UltraEdit编辑器将非文本格式的词库解析出来，统一格式并剔除重复词汇。Further, said step 1) uses the Chinese lexical analysis system ICTCLAS3 of the Institute of Computing Technology, Chinese Academy of Sciences, first imports the cell lexicon related to the automobile industry in the Sogou input method into the Chinese lexical analysis system, and utilizes the UltraEdit editor to convert the non-text format lexicon into the Chinese lexical analysis system. Parse it out, unify the format and remove repeated words.

进一步的，所述步骤2)将数词、代词、量词、拟声词、方位词、连词、叹词、后接成分和助词作为停用词。Further, the step 2) uses numerals, pronouns, quantifiers, onomatopoeias, localizers, conjunctions, interjections, subsequent components and auxiliary words as stop words.

进一步的，所述使用平均X²的聚合策略来度量X²的值，公式如下：Further, the aggregation strategy using the average X2 to ^measure the value of X2, the ^formula is as follows:

将X²的值从高到低排序选取部分词作为特征项，一词频作为特征项的权值，使用向量空间模型对文本进行表示，并求得每篇评论文档的特征向量d_i，采用SVM对文档进行分类。Sort the value of X ² from high to low, select some words as feature items, and the frequency of a word as the weight of feature items, use the vector space model to represent the text, and obtain the feature vector d _i of each review document, using SVM Classify documents.

进一步的，所述步骤4)对情感值进行量化具体包括：Further, said step 4) quantifying the emotional value specifically includes:

当评价分数小于等于2时，认为是负向文本，归属于负向文本集；当评价分数为5时，认为是正向文本，并入正向文本集，文本中每个词word的情感值S(word)计算方式为：When the evaluation score is less than or equal to 2, it is considered as a negative text and belongs to the negative text set; when the evaluation score is 5, it is considered as a positive text and incorporated into the positive text set. The emotional value S of each word in the text (word) is calculated as:

S(word)＝P(word,pos)-P(word,neg)S(word)＝P(word,pos)-P(word,neg)

其中f(word,pos)表示word在正向文本集只出现的频次，f(word)表示word在整个文本集中出现的次数；f(pos)表示正向文档的数量；M表示整个文本集的数量，同理可计算P(word,neg)的值。P(word,neg)词word与负向文档之间的互信息。Among them, f(word, pos) indicates the frequency of word only appearing in the forward text set, f(word) indicates the number of times word appears in the entire text set; f(pos) indicates the number of forward documents; M indicates the number of the entire text set In the same way, the value of P(word,neg) can be calculated. P(word,neg) Mutual information between the word word and the negative document.

S(word)计算公式可化简为The calculation formula of S(word) can be simplified as

则第i篇评论的情感值S_rev(r_k)为：f(neg)表示负向文档的数量。Then the sentiment value S _rev (r _k ) of the i-th review is: f(neg) represents the number of negative documents.

q表示第i篇评论文档中含有q个情感词典中的词，即每篇评论文本的情感值由每个词的情感值累加而成。q indicates that the i-th review document contains q words in the sentiment dictionary, that is, the sentiment value of each comment text is accumulated by the sentiment value of each word.

进一步的，所述步骤5)使用修改的回归模型AR模型进行预测，用y_t表示第t个月的销售量，t＝1,2,…,n；n表示未来某个月。Further, the step 5) uses the modified regression model AR model to predict, and y _t represents the sales volume of the t-th month, t=1,2,...,n; n represents a certain month in the future.

q表示第t个月之前q个月的情感因素的影响，w_t表示第t个月的情感影响，α_i为最小二乘法得到的模型参数，P表示要考察的第t个月之前的P个月，i表示前P个月中的某个月，α₀表示常数项，ε_t表示误差项，将各个标签下的情感因素分别代入模型，通过训练集的对比可以找出消费者更看中汽车性能的哪一个方面。q represents the influence of emotional factors in q months before the tth month, w _t represents the emotional influence in the tth month, α _i is the model parameter obtained by the least square method, and P represents the P month, i represents a certain month in the previous P months, α ₀ represents a constant item, ε _t represents an error term, and the emotional factors under each label are substituted into the model, and the comparison of the training set can find out what consumers are more interested in. Which aspect of the performance of the car.

本发明的优点及有益效果如下：Advantage of the present invention and beneficial effect are as follows:

1、有别于传统预测，使用评论数据，考虑用户对于产品的喜好程度。避免造成数据浪费。1. Different from traditional forecasting, it uses review data and considers the user's preference for the product. Avoid data waste.

2、可以分别使用汽车某一方面性能的评论数据进行预测，找出消费者更看中汽车的哪一方面性能。2. You can use the review data of a certain aspect of the car's performance to make predictions, and find out which aspect of the car's performance consumers are more interested in.

3、使预测更加精确。3. Make forecasts more accurate.

附图说明Description of drawings

图1是本发明提供优选实施例操作流程图；Fig. 1 is the flow chart of the operation of the preferred embodiment provided by the present invention;

图2是本发明的多标签分类结果图。Fig. 2 is a multi-label classification result diagram of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the problems of the technologies described above is:

对网络评论进行预处理。使用中科院计算所的汉语词法分析系统(ICTCLAS3)。首先搜狗输入法中与汽车行业相关的细胞词库，导入语法系统，利用UltraEdit编辑器将非文本格式的词库解析出来，统一格式并剔除重复词汇。依据分词结果去除停用词，将数词、代词、量词、拟声词、方位词、连词、叹词、后接成分和助词作为停用词。Preprocess web reviews. The Chinese Lexical Analysis System (ICTCLAS3) of the Institute of Computing Technology, Chinese Academy of Sciences was used. Firstly, the cell lexicon related to the automobile industry in the Sogou input method is imported into the grammar system, and the non-text format lexicon is parsed out using the UltraEdit editor, and the format is unified and duplicate words are eliminated. Stop words are removed according to word segmentation results, and numerals, pronouns, quantifiers, onomatopoeias, localizers, conjunctions, interjections, suffixes and particles are used as stop words.

1)对多标记分类1) Multi-label classification

由汽车评论文本构成的多标记训练数据集用(D，T，L)表示，D＝{D₁,D₂,…,D_n}＝{(d₁,y₁),(d₂,y₂),…(d_n,y_n)}，表示由汽车这一石头的n篇评论文档构成的多标记数据集，每篇文档D_i由特征向量d_i和标记向量y_i组成(1＜＜i＜＜n)，T＝(t₁,t₂,…t_p)表示n篇评论文档中选择的p个关键词构成的特征集合。L＝{l₁，l₂，…，l₆}表示由6种标签构成的标记集合(舒适、动力、操控、服务、经济和安全)。特征向量d_i＝{w_1i，w_2i，...，w_ji，...，w_pi}w_ij表示关键词t_j在文档D_i中的相应权值。每篇文档对应于标记集合L中的一个或者多个性能标签，并有0和1构成一个二值向量y_i，如果D_i包含类别l_j，则y_ji＝1，否则为0。The multi-label training data set composed of car review texts is represented by (D, T, L), D={D ₁ ,D ₂ ,…,D _n }={(d ₁ ,y ₁ ),(d ₂ ,y ₂ ),…(d _n ,y _n )}, representing a multi-label data set composed of n review documents of the car, each document D _i is composed of a feature vector d _i and a label vector y _i (1<<i<<n), T=(t ₁ ,t ₂ ,...t _p ) represents a feature set composed of p keywords selected from n review documents. L={l ₁ , l ₂ , . . . , l ₆ } represents a tag set consisting of 6 tags (comfort, power, handling, service, economy, and safety). The feature _vector d _i ₌ {w _1i , _w _2i , . . . , _w _ji , . Each document corresponds to one or more performance labels in the label set L, and has 0 and 1 to form a binary vector y _i , if D _i contains category l _j , then y _ji =1, otherwise it is 0.

a)以X²统计度量一个词一某一个标签之间的相关性，公式如下：a) The correlation between a word and a certain label is measured by X ² statistics, the formula is as follows:

其中，n表示文档总数，p(word,l_j)表示词Word在文档D_i中出现的次数(且l_ij＝1)，同理表示不在文档D_i中 Among them, n represents the total number of documents, p(word,l _j ) represents the number of times the word Word appears in the document D _i (and l _ij =1), similarly Indicates not in document D _i

b)使用平均X²的聚合策略来度量X²的值，公式如下：b) Use the aggregation strategy of average X ² to measure the value of X ² , the formula is as follows:

将X²的值从高到低排序选取部分词作为特征项，一词频作为特征项的权值，使用向量空间模型对文本进行表示，并求得每篇评论文档的特征向量d_i。Sort the value of X ² from high to low, select some words as feature items, and the frequency of a word as the weight of feature items, use the vector space model to represent the text, and obtain the feature vector d _i of each review document.

c)采用SVM对文档进行分类，c) Use SVM to classify documents,

3)情感值的确定3) Determination of emotional value

根据新浪汽车的评价体系，当消费者对某项评价为1分或2分时，表示消费者对该项非常不满意；而给出5分时，则认为消费者对该项满意。对于一条评论文本，当评价分数小于等于2时，认为是负向文本，归属于负向文本集；当评价分数为5时，认为是正向文本，并入正向文本集。文本中每个词word的情感值S(word)计算方式为：According to the evaluation system of Sina Automobile, when consumers give an evaluation of 1 or 2 points, it means that consumers are very dissatisfied with the item; and when they give 5 points, it means that consumers are satisfied with the item. For a review text, when the evaluation score is less than or equal to 2, it is considered a negative text and belongs to the negative text set; when the evaluation score is 5, it is considered a positive text and incorporated into the positive text set. The sentiment value S(word) of each word word in the text is calculated as:

S(word)＝P(word,pos)-P(word,neg)S(word)＝P(word,pos)-P(word,neg)

其中f(word,pos)表示word在正向文本集只出现的频次，f(word)表示word在整个文本集中出现的次数；f(pos)表示正向文档的数量；M表示整个文本集的数量。同理可计算P(word,neg)的值。S(word)计算公式可化简为Among them, f(word, pos) indicates the frequency of word only appearing in the forward text set, f(word) indicates the number of times word appears in the entire text set; f(pos) indicates the number of forward documents; M indicates the number of the entire text set quantity. Similarly, the value of P(word, neg) can be calculated. The calculation formula of S(word) can be simplified as

则第i篇评论的情感值S_rev(r_k)为：Then the sentiment value S _rev (r _k ) of the i-th review is:

q表示第i篇评论文档中含有q个情感词典中的词。即每篇评论文本的情感值由每个词的情感值累加而成。q indicates that the i-th review document contains words in q sentiment dictionaries. That is, the sentiment value of each comment text is accumulated by the sentiment value of each word.

则某种型号汽车的评论情感值为：Then the comment sentiment value of a certain model of car is:

即为每一个篇论文本的情感值累加而成。将文本的分类之后的结果(分为六个方面：舒适、动力、操控、服务、经济和安全)分别计算其情感值和综合情感值。It is formed by accumulating the emotional value of each essay text. Calculate the emotional value and comprehensive emotional value of the text classification results (divided into six aspects: comfort, power, handling, service, economy, and safety).

4)预测4) Forecast

使用修改的AR模型进行预测。用y_t表示第t个月的销售量(t＝1,2,…,n)。Prediction using a modified AR model. Use y _t to represent the sales volume in month t (t=1,2,...,n).

q表示第t个月之前q个月的情感因素的影响。w_t表示第t个月的情感影响。将各个标签下的情感因素分别代入模型，通过训练集的对比可以找出消费者更看中汽车性能的哪一个方面。对汽车的下一阶段的生产作指导。q represents the impact of emotional factors in q months before month t. w _t represents the emotional impact in month t. Substitute the emotional factors under each label into the model, and compare the training set to find out which aspect of car performance consumers value more. Guidance on the next phase of production of the car.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood as only for illustrating the present invention but not for limiting the protection scope of the present invention. After reading the contents of the present invention, skilled persons can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. A method for forecasting car sales based on comment sentiment analysis, characterized in that, comprising the steps:

1) Carry out preprocessing on the car review data, including unified format and elimination of repeated words;

2), use the Chinese Grammar System of the Chinese Academy of Sciences to perform word segmentation processing on the preprocessed car review data, and remove stop words;

3), using multi-label classification technology to carry out multi-label classification to the comment data set after step 2 word segmentation processing;

4), use mutual information technology to quantify the emotional value, and obtain the emotional value of the comment text set;

5). Integrate the emotional value into the regression model to predict the car sales in the next stage.

2. the automobile sales prediction method based on comment sentiment analysis according to claim 1, is characterized in that, described step 1) automobile comment data is divided into six aspects of comfort, power, handling, service, economy and safety, at first To find the relationship between a comment word and the class label, the formula is as follows:

where n represents the total number of documents, Indicates that the word word is not in the document D _i , x ² indicates the correlation between a certain word word and a certain aspect of the car l _j , Indicates that it does not contain l _j aspects, ie p(word, l _j ) represents the number of times the word Word appears in the document D _i and l _ij = 1, l _j represents a certain aspect of the performance of the car, and j represents a certain performance number (1≤j≤6), i represents the i-th document. p(word) indicates the number of occurrences of the word word in the document D _i , p(word) indicates the number of occurrences of the word word in the document D _i , p(l _j ) the number of occurrences of l _j in the text set, Indicates the number of times word word does not appear in document D _i .

3. according to claim 1 or 2 described based on the car sales forecasting method of comment sentiment analysis, it is characterized in that, described step 1) uses the Chinese lexical analysis system ICTCLAS3 of Institute of Computing Technology, Chinese Academy of Sciences, at first in Sogou input method and automobile industry The relevant cell lexicon is imported into the Chinese lexical analysis system, and the non-text format lexicon is parsed out using the UltraEdit editor, the format is unified and duplicate words are eliminated.

4. the automobile sales prediction method based on comment sentiment analysis according to claim 3, is characterized in that, described step 2) numeral, pronoun, quantifier, onomatopoeia, location word, conjunction, interjection, suffix Components and particles serve as stop words.

5. the method for predicting car sales based on comment sentiment analysis according to claim 2, characterized in that, the aggregation strategy using average X ² measures the value of X ² , and the formula is as follows:

Sort the value of X ² from high to low, select some words as feature items, and the frequency of a word as the weight of feature items, use the vector space model to represent the text, and obtain the feature vector d _i of each review document, using SVM Classify documents.

6. the automobile sales forecasting method based on comment sentiment analysis according to claim 5, is characterized in that, described step 4) quantifying emotional value specifically comprises:

When the evaluation score is less than or equal to 2, it is considered as a negative text and belongs to the negative text set; when the evaluation score is 5, it is considered as a positive text and incorporated into the positive text set. The emotional value S of each word in the text (word) is calculated as:

S(word)＝P(word,pos)-P(word,neg)

Among them, f(word, pos) indicates the frequency of word only appearing in the forward text set, f(word) indicates the number of times word appears in the entire text set; f(pos) indicates the number of forward documents; M indicates the number of the entire text set Quantity, in the same way, the value of P(word, neg) can be calculated, and P(word, neg) represents the point relationship between the word word and the negative document;

The calculation formula of S(word) can be simplified as

Then the sentiment value S _rev (r _k ) of the i-th review is: f(neg) represents the number of negative documents

q indicates that the i-th review document contains q words in the sentiment dictionary, that is, the sentiment value of each comment text is accumulated by the sentiment value of each word.

7. the automobile sales prediction method based on comment sentiment analysis according to claim 6, is characterized in that, described step 5) uses the regression model AR model of revision to predict, represents the sales volume of the t month with y _t , t=1,2,...,n; n represents a certain month in the future;

q represents the influence of emotional factors in q months before the tth month, w _t represents the emotional influence in the tth month, α _i is the model parameter obtained by the least square method, and P represents the P month, i represents a certain month in the previous P months, α ₀ represents a constant item, ε _t represents an error term, and the emotional factors under each label are substituted into the model, and the comparison of the training set can find out what consumers are more interested in. Which aspect of the performance of the car.