CN114168730A

CN114168730A - Consumption tendency analysis method based on BilSTM and SVM

Info

Publication number: CN114168730A
Application number: CN202111416830.9A
Authority: CN
Inventors: 贾海涛; 唐小龙; 周焕来; 乔磊崖; 林思远; 陈泓秀; 张博阳; 王俊
Original assignee: Yituo Communications Group Co ltd
Current assignee: Yituo Communications Group Co ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-11

Abstract

The invention discloses a consumption tendency analysis method based on BilSTM and SVM, which comprises the following steps: determining whether the single sentence is relevant to consumption; judging the type of the commodity appearing in the sentence through a Bag-of-words model; judging the emotional attitude of the character in the sentence, namely support or objection, through the BilSTM and SVM models; performing comprehensive calculation according to the type of the commodity and the emotional attitude of the character, and judging the consumption tendency of the user in the sentence; and (4) carrying out the operations on each sentence of the text in sequence, and counting to obtain the consumption tendency of the user in the whole text. By the invention, the tendency of the user in online consumption can be analyzed according to the network talk, evaluation and the like of the user, such as pursuing quality or paying attention to price balance, so that commodities can be recommended to the user more accurately.

Description

Consumption tendency analysis method based on BilSTM and SVM

Technical Field

The invention relates to a consumption tendency analysis method based on BilSTM and SVM, belonging to the field of natural language processing.

Background

Today, the Internet and information technology are rapidly developing, and electronic commerce is a business model which is relatively advanced and has a trend of vigorous development. Electronic commerce is mainly characterized in that various commerce activities are developed on line by buyers and sellers by means of business activities developed by information technology based on application modes such as a server, a browser and the like, so that online shopping, online payment, various business activities and the like of customers are realized.

Personalized recommendation is a new e-commerce service mode, and mainly aims to analyze personal preference, personality, habits and the like of users by combining the needs of vast users, accurately provide interested services or information for the users, and further better solve various problems caused by information loss and information overload.

The existing solutions mainly include a user-based collaborative filtering algorithm (UserCF) and an item-based collaborative filtering (ItemCF). UserCF is a recommendation for items that are liked by a user who has a common interest and preference, and ItemCF is a recommendation for items that are similar to items that he previously liked. However, the UserCF cannot perform personalized recommendation immediately after a new user acts on few items, because the user similarity table is calculated offline at intervals, and the algorithm itself has difficulty in providing a recommendation explanation which is convincing to the user; ItemCF has no way to recommend new items to a user without updating the item similarity table offline. Meanwhile, the two algorithms hardly contain the side features of the articles, and the analysis is not performed aiming at the network talk, evaluation and the like of the user, so that important information possibly contained in the algorithms is ignored.

Meanwhile, emotion analysis can only judge whether a person is in a positive attitude or a negative attitude, and the user preference cannot be obtained because the emotion analysis does not analyze the characteristics of the commodity. Therefore, the invention provides a consumption tendency analysis method based on BilSTM and SVM, and aims to provide a method for analyzing the consumption tendency of a user, so that commodities can be recommended to the user more accurately.

Disclosure of Invention

The invention provides a consumption tendency analysis method based on BilSTM and SVM. The invention aims to provide a method for analyzing the consumption tendency of a user, so that commodities can be recommended to the user more accurately.

The technical scheme of the invention is as follows:

step one, data preprocessing is carried out, and whether a single sentence is related to consumption or not is judged through a Bag-of-words model. And if one word in the preprocessed words exists in the consumption-related dictionary, judging that the words are related to consumption.

And step two, carrying out named entity identification by using LTP to find the commodity in the LTP. Then, the type of the commodities appearing in the sentence is judged, and the commodities with the two tendencies are represented by positive scores and negative scores respectively.

And step three, judging the emotional attitude of the character in the sentence through the BilSTM and SVM models, namely supporting or resisting. Firstly, acquiring depth word vector characteristics of emotion classification by using a BilSTM model, and then classifying and judging the depth word vector characteristics by using an SVM model.

And step four, multiplying the scores of the first two steps to judge the consumption tendency of the user in the sentence, namely which product is preferred to buy.

And step five, counting and judging the consumption tendency of the user in the whole text.

The invention has the beneficial effects that: the method for analyzing the consumption tendency of the user is provided, and the problems that the conventional collaborative filtering algorithm is difficult to contain the side features of articles and cannot analyze the network speech, evaluation and the like of the user, and the emotional analysis can only judge the attitude of a person, whether the person is positive or negative, does not analyze the features of commodities and cannot obtain the preference of the user are solved. The method is based on the BilSTM and SVM models, comprehensive analysis is carried out on the characteristics of the commodities and the emotional attitude of the user, the preference of the user is obtained, and therefore the commodities can be recommended to the user more accurately.

Drawings

FIG. 1 is an overall block diagram of the algorithm of the present invention;

FIG. 2 is a block diagram of a bidirectional recurrent neural network BilSTM;

FIG. 3 is a process of maximum pooling layer;

FIG. 4 is a training process of a word vector model based on BiLSTM's sentiment classification;

FIG. 5 is a schematic diagram of determining the emotional attitude of a user in a sentence.

Detailed Description

The idea of the algorithm will be described below, and specific steps of the algorithm will be given.

The method comprises the following steps: determining whether a single sentence is relevant to consumption

Firstly, reading data, and preprocessing input data by utilizing jieba, wherein the preprocessing comprises word segmentation, removal of stop words and stop symbols. A user-defined dictionary is used to avoid erroneous segmentation. And after preprocessing the data, reading the text line by line to a words list. Creating a merchandish _ list, and sequentially storing all E-commerce thesaurus words: including trade names, store names, platforms, holidays, idioms, consumption-sensitive words, and the like. Reading the list requires the use of the Bag-of-words model, i.e. the document is converted into an unordered matrix by ignoring the grammar and the sequence of words.

The principle of the Bag-of-words model is that for a corpus of documents, C ═ doc₁，doc₂，....，doc_mIntegrating all the entries (Tokens) into a large lexicon (Lexicons) L_cIn this patent, the term corresponds to the e-commerce thesaurus list merchandish _ list. For arbitrary text doc_i，i∈R⁺The word segmentation result is W_iThen the text is represented as V_i，|V_i|＝len(L_c). For document doc_iW of (2)_iIf the jth occurrence in the thesaurus is W_iIn, thatVector component V of the document_ijIs its word frequency

Otherwise, it is 0, i.e.:

and retrieving the E-commerce thesaurus list merchandish _ list while reading the words list. And setting a variable flag to store the number of consumed emotional words of the sentence. As long as the words involved in the merchandish _ list exist in the words list, the sentence is judged to be related to the e-commerce, that is: the flag value is incremented by one each time a related word (also present in the merchandish _ list) in words is retrieved. If the flag value is 0, judging that the sentence does not relate to the E-commerce; and if the flag is not 0, judging that the sentence relates to the E-commerce.

Step two: the type of the commodity appearing in the sentence is judged.

In processing the input text, named entity recognition is performed using LTP. After preprocessing the data in the step, performing part-of-speech tagging on the segmented sentences by using a part-of-speech tagging tool Postagger, and storing the part-of-speech tagged sentences in a parameter postags, wherein the LTP adopts a BIESO tagging system. B represents an entity start word, I represents an entity intermediate word, E represents an entity end word, S represents a single entity, and O represents a non-constituent named entity. Named entity recognition is performed after part-of-speech tagging with Postagger. The named entity types provided by LTP are: person name (Nh), place name (Ns), organization name (Ni). After the data is initialized, the named body recognition model ner _ model recognition data of Pyltp is loaded and stored in the parameter postags.

After all identification is successful, the data is stored as a ternary list zip (words, prestags, networks), the first parameter words is a word, the second parameter prestags is a part of speech, and the third parameter networks is a named body type. And extracting the commodities in the input sentence according to the commodity dictionary, wherein the commodity dictionary covers the commodities possibly related to the input sentence.

The consumption attitude of a zip (words, posts, nets) character in a ternary list is judged by searching an active dictionary and a passive dictionary. The positive (pos _ dic)/negative (neg _ dic) dictionary is a binary list, the first column of parameters is words and the second column of parameters is emotion weights. The emotion weight is positive number and represents positive, and the more obvious the positive attitude is, the larger the weight is; negative emotion weight means negative, and the more obvious the negative attitude, the larger the absolute value of the weight. The input sequence is retrieved according to the active and passive dictionaries, and each occurrence of an active and passive word gives corresponding weight to the characters in the sentence and stores the weight in the parameter score.

Step three: determining emotional attitude of user in sentence

3.1 building a BilSTM depth word vector feature model

The keras interface of tenserflow is used to build the model. The model consists of an input layer, a word embedding layer, a BilSTM layer, a polymerization layer, a maximum pooling layer, a full-link layer and a classification layer, wherein the output of each layer is the input of the next layer. The depth word vector characteristics of emotion classification can be obtained by using the model, and the structure of the BilSTM model is shown in FIG. 2:

3.1.1 input layer

This layer is the input part of the model, i.e. a section of text T in the corpus is input for subsequent processing. It is assumed here that there is a document whose text T is "new package feeling good". After data pre-processing, the input text may be represented as: t [ 'new', 'package', 'feel', 'also', 'good' ].

3.1.2 word embedding layer

And training the corpus by word2vec to obtain a context vector list, and searching word vectors corresponding to all words of the input text in the context vector list and combining the word vectors. In this way the input sequence T can be expressed as:

wherein: the ith line in Z represents the m-dimensional word vector corresponding to the ith word in the input text T.

3.1.3 BilsTM layer

The layer is equivalent to a feature extraction part, and information is acquired from two opposite directions by constructing two LSTM neural networks, so that the method is more favorable for capturing the long dependency of sentences and the deep semantic expression of texts on the whole, and the input of the two neural networks is consistent. The LSTM has the advantage that it has three special gate functions: the input gate, the forgetting gate and the output gate are used for controlling the memory of the neural network. The forward calculation process of a single LSTM memory cell at a certain time t is:

forgetting a door mechanism:

f_t＝σ(W_f·[h_t-1，X_t]+b_f)

an input gate mechanism:

i_t＝σ(W_i·[h_t-1，X_t]+b_f)

an output gate mechanism:

o_t＝σ(W_o[h_t-1，X_t]+b_o)

Z_t＝o_t×tanh(C_t)

wherein: { W_*，b_*Is the parameter set of neural network training;

f_t、i_t、o_trespectively representing the output values of an input unit, a forgetting gate, an input gate and an output gate of the memory unit at the time t; h is_t-1、x_tRespectively representing the input of a memory unit at the time t and the input of the current memory unit; c_tRepresents the internal state of the memory cell at time t; z_tIndicating the output of the memory cell at time t. Based on the above structure, this layer operates as:

wherein LSTM_f、LSTM_bRespectively representing forward propagation and backward propagation of the LSTM neural network;

representing the output vectors of forward LSTM and backward LSTM, respectively, at time t. After bi-directional LSTM layer processing, Z after word embedding becomes the following form:

the column number c here represents the number of neurons in the LSTM unit.

3.1.4 polymeric layer

The layer is mainly formed by splicing forward propagation output vectors and backward propagation output vectors obtained by the previous layer, namely:

after the polymeric layer is treated, the layer above

Integration is in the form:

3.1.5 maximum pooling layer

The layer mainly performs maximum pooling operation to obtain the most significant characteristic value in the vector, and the influence of data sparsity on the performance of the classifier is reduced to a certain extent. Meanwhile, because the number of words contained in each input text is inconsistent, the input text is universalThe over-pooling operation can also obtain a feature vector M with a fixed length_t. The calculation is as follows:

M_t＝max{Z_t(i)}1≤i≤c

the specific operation is shown in fig. 3. The left rectangular frame represents a matrix vector obtained through polymerization layer processing, the width and the height of the selected pooling unit are both 2, the step length is also set to be 2, and after pooling processing, the original matrix space is changed into a matrix space shown by the right rectangular frame.

Thus, the feature extraction work of the text data of one document is completed.

3.1.6 full connection layer

The above process describes the feature extraction process of BilSTM, and the features of all documents are converged into the depth word vector feature M finally used for emotion classification at the full link layer:

wherein: m_iAnd (1 is not less than i and not more than n) represents the depth word vector characteristics corresponding to the ith document. .

3.1.7 Classification layer

In training the neural network, the classification layer employs a softmax function. The characteristics M output by the full connection layer output the judgment categories (positive 1, neutral 0 and negative-1) of the emotional tendency by utilizing softmax, and the parameters in the network are updated in a gradient manner by adopting a back propagation algorithm in the training process.

3.2 train BilSTM depth word vector feature model

The results of word embedding are input into the model to set the number of iterations, the number of samples in a batch. After the gradient of the whole BilSTM neural network model is updated and converged, depth word vector features (i.e. features output by the full connection layer) can be obtained and used as emotion classification features, and a specific processing algorithm is shown in FIG. 4.

3.3 Classification and discrimination of depth word vector features based on SVM model

After the gradient updating of the BilSTM neural network model is converged, the depth word vector characteristics, namely the characteristics output by the full connection layer, can be obtained as emotion classification characteristics, and then the SVM is used for carrying out model training and classification judgment on the depth word vector characteristics corresponding to the samples in the training set and the test set. The overall schematic diagram for judging the emotional attitude of the user in the sentence is shown in FIG. 5.

The SVM is a classification algorithm, and the data are separated on two sides of a plane by finding a classification plane, so that the classification purpose is achieved based on the structure risk minimization principle. The SVM is a prediction tool with good generalization capability, and is widely applied to the fields of face recognition, text classification and the like. However, the emotion classification task performed in the text is oriented to a three-classification problem, and the traditional SVM algorithm is not applicable, so that one-summary-one SVM is adopted for classification, one SVM classifier is designed between any two classes, and three SVM classifiers need to be designed in total. When the emotional tendency judgment is carried out on the testcase of an unknown sample, the three designed SVM classifiers are used for carrying out classification judgment voting, and the class with the largest number of votes is the emotional class of the sample. The specific classification decision voting algorithm is as follows:

pos, Neu and Neg are training samples of positive, neutral and negative categories respectively, and generate three classifier classifiers after training₁，classifier₂，classifier₃：

classifier₁＝SVM(Pos，Neu)

classifier₂＝SVM(Pos，Neg)

classifier₃＝SVM(Neu，Neg)

Initializing Pos Neu Neg 0, and predicting the emotion classification of the unknown sample testcase according to the following formula:

the final emotion category of testcase, namely, table, is:

lable＝max(Pos，Neg，Neu)

step four: calculating and judging consumption tendency of user in sentence

The row number of an input matrix of an input text after data preprocessing is set to be m, and the length of the longest row is set to be n. As can be seen from the third step and the fourth step, the consumption weight of the person in the jth row of data sentence is

Is the sum of the consumption weights of all persons in the row, i.e.

From step three, the emotional tendency of the jth line sentence is x_j，

x_j＝line_j.emotendency j∈m，x_j∈(0，1)

Let the consumption emotional tendency of the jth sentence be output_jTherefore, it is

Namely, the consumption emotional tendency of the jth line of sentence is the product of the consumption weight of the character in the jth line of sentence and the emotional tendency of the jth line of sentence. When the output is positive, the emotional attitude in the sentence is represented to be positive, and the larger the result is, the more obvious the positive attitude is represented; when the output is negative, the emotional attitude in the sentence is negative, and the larger the absolute value of the result is, the more obvious the passive attitude is; when the output is approximately 0, it indicates that the emotional attitude in the sentence is neutral.

Step five: calculating and judging consumption tendency of users in the whole text

And (4) carrying out sentence segmentation on the whole text, solving the consumption tendency of the user in a single sentence through the step four, sequentially solving the consumption emotional tendency of the user in each sentence, and further solving the consumption tendency of the user in the whole text.

Two thresholds a and b are set, where 0 < a < b < 1, and then the proportion of sentences in a certain tendency (set as tendency A) to the full text is calculated. If the ratio is between b and 1, the user is considered to be inclined to A; if the proportion is between a and b, the consumption tendency of the user is considered to be neutral; if the ratio is between 0 and a, the user is considered to be less agreeable to the tendency A. The solution of the thresholds a and b will be determined by experimental accuracy.

The invention is not the best known technology.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A consumption tendency analysis method based on BilSTM and SVM is characterized by comprising the following steps:

step one, data preprocessing is carried out, and whether a single sentence is related to consumption or not is judged through a Bag-of-words model. Firstly, input data is preprocessed by the jieba, and the preprocessing comprises word segmentation, word stop removal and stop characters. If one word exists in the pre-processed words in the consumption-related dictionary, the word is judged to be related to consumption, otherwise, the word is judged to be unrelated to consumption, and subsequent operation is not performed.

And step two, judging the type of the commodity appearing in the sentence. LTP is used for named entity recognition to find the commodity in the LTP. Then, the commodities with two tendencies are represented by positive scores and negative scores respectively, for example, the quality type commodity is given a positive score, and the flat type commodity is given a negative score.

And step three, judging whether the emotional attitude of the character in the sentence is supporting or resisting through the BilSTM and SVM models. Firstly, acquiring depth word vector characteristics of emotion classification by using a BilSTM model, and then classifying and judging the depth word vector characteristics by using an SVM model.

And step four, calculating and judging the consumption tendency of the user in the sentence, namely which product is preferred to be bought. Multiplying the scores of the first two steps together, with the result being a positive number indicating that the user prefers the product represented by the positive number, and vice versa.

And step five, counting and judging the consumption tendency of the user in the whole text. Dividing the whole text, calculating the consumption tendency of a single sentence through the step four, calculating the consumption tendency of each sentence in turn, setting two threshold values, judging that the user is inclined to a certain product when the proportion of sentences inclined to the certain product in the text is higher than the larger threshold value, judging that the user does not like the certain product when the proportion of sentences inclined to the certain product in the text is lower than the smaller threshold value, and judging that the user is neutral when the proportion of sentences inclined to the certain product in the text is between the two threshold values.

2. The method of claim 1, wherein the second step of determining the type of the commodity appearing in the sentence is followed by assigning a positive score or a negative score according to the tendency. For example, if the user wants to analyze whether to pursue quality or pay price, two commodity vocabularies of a good quality type and a good price type are organized, each vocabularies has two columns, the first column is a word, and the second column is a score. The score of the quality type is positive, and the higher the price is, the larger the score is; the score of the flat type is a negative number, and the lower the price, the larger the absolute value of the score. And after word segmentation, if commodities in the word list appear, acquiring a corresponding score.

3. The method of claim 1, wherein the bilateral long-short term memory network BilSTM and the SVM are used in the third step to determine whether the sentence has an emotional attitude, i.e., a positive attitude or a negative attitude. Firstly, acquiring depth word vector characteristics of emotion classification by using a BilSTM model, acquiring the depth word vector characteristics after gradient updating convergence of the BilSTM neural network model, namely, acquiring the characteristics output by a full-link layer as emotion classification characteristics, and further performing model training and classification judgment on the depth word vector characteristics corresponding to samples in a training set and a test set by using an SVM (support vector machine).

4. The method of claim 1, wherein the fourth step multiplies the scores of the first two steps, and if the result is positive, the product is indicated that the user prefers to be indicated by the positive number, and the larger the result is, the more obvious the attitude is; a negative result indicates that the user is more inclined to the item represented by the negative number, and a larger absolute value of the result indicates a more obvious attitude.