CN107944014A

CN107944014A - A kind of Chinese text sentiment analysis method based on deep learning

Info

Publication number: CN107944014A
Application number: CN201711307041.5A
Authority: CN
Inventors: 严勤; 丁聪; 陈葛恒; 肖丽莎
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2018-04-20

Abstract

The invention discloses a kind of Chinese text sentiment analysis method based on deep learning, network text is obtained, Chinese sentence is rationally designed to the conversion logic of mathematic vector, first term vector dictionary is constructed with Chinese words segmentation combination term vector learning tool, then sentence vector conversion is carried out with LSTM MP models, representative sentences vector is subjected to just negative emotional semantic classification eventually through Softmax graders, achieve the purpose that sentiment analysis, the algorithm classification accuracy rate is high, it is efficient, flexibility ratio is high and avoids a large amount of manual works of supervised learning method, it is effectively improved the efficiency and accuracy of text emotion tendency classification, automated integration degree is high to save a large amount of manpowers.

Description

A kind of Chinese text sentiment analysis method based on deep learning

Technical field

The present invention relates to a kind of Chinese text sentiment analysis method based on deep learning, belong to natural language processing and depth Spend learning art field.

Background technology

Developing rapidly for internet makes microblogging and social networks become popular communication exchange form.Hundreds of millions of reflection people The information of viewpoint and attitude issued by platforms such as Twitter, Facebook and shared with owner daily, this is just to monitoring Chance is provided with analysis individual enterprise or social public sphere viewpoint, mood.

Text emotion analysis is the viewpoint to people, mood, attitude and real to product, service, tissue, event etc. Body Sentiment orientation makes effectively analysis and then further does a kind of technology of information induction.The sea produced for network media Data are measured, extract valuable mood and viewpoint, and accurate text emotion analysis is made to it, having in numerous areas should With value, such as：Enterprise can carry out after-sale service adjustment and market strategy formulation according to the relevant feedback mood of its product； Government can analyze according to a large amount of text moods of social platform and make the policy-system for being more in line with masses' demand；Finance side Face can also extract according to the mood viewpoint of various financal messagings and excavate trend prediction for carrying out a certain financial market etc..

The comparison that research in terms of text (English) sentiment analysis carries out abroad is early, achievement also comparative maturity, such as： The polarity that Turney and Pang realizes product and film comment with different methods respectively is classified, and subsequent Pang and Snyder again will A variety of methods are combined and attempted again.In addition, Pang and Lee also expands the polarity basis of classification task of film comment Star rating prediction is opened up, while restaurant is commented on after doing depth analysis and is used for predicting that the grade of restaurant's each side is commented by Snyder Usual practice such as food, environment etc. (totally 5 grades).Gruhl et al. passes through the sentiment analysis prognostic chart to online Internet chat information Book sales trend, Mishne et al. capture the information with mood to predict box office receipts from blog.Yet with English with The otherness of Chinese text structure, the sentiment analysis research difficulty of Chinese text is much greater, along with starting evening, mark text Expect that limited, technical method such as falls behind at the factor, the research of Chinese text sentiment analysis still has greatly improved space.

Chinese text sentiment analysis method is mostly based on rule and has supervision to be based on machine learning, limitation at present at present There is the following aspects：(1) due to varying with each individual to linguistry rule, Judgment by emotion Rulemaking is by formulation people's research level Limitation；(2) certain methods artificially carry out Feature Selection, therefore sentiment analysis effect is by people by rule of thumb when sentence characteristics extract Influenced for factor bigger etc..

The content of the invention

In order to solve the above technical problem, the present invention provides a kind of Chinese text sentiment analysis side based on deep learning Method.

In order to achieve the above object, the technical solution adopted in the present invention is：

A kind of Chinese text sentiment analysis method based on deep learning, comprises the following steps,

Step 1, training LSTM-MP models and Softmax graders；

Detailed process is as follows：

Obtain network text；

The network text of acquisition is pre-processed, obtains the Chinese sentence in network text；

Centering sentence carries out Chinese word segmentation and builds term vector dictionary；

Some Chinese sentences are manually marked, as LSTM-MP model experiment data, remaining Chinese sentence conduct LSTM-MP model training data；

LSTM-MP models are trained with LSTM-MP model trainings data；

LSTM-MP model experiments data are all converted into sentence vector with trained LSTM-MP models；

Tested some vectors as Softmax classifier training data, remaining vector as Softmax graders Data；

Softmax graders are trained with Softmax classifier trainings data, are tested with Softmax graders test data Trained Softmax graders；

Step 2, sentiment analysis is carried out with trained LSTM-MP models and Softmax graders.

The process that design multithreading reptile carries out network text acquisition is,

Choose the url list of appropriate website homepage URL initialization reptiles；

The html document of each website homepage is obtained, parses message in html document corresponding URL, it is corresponding to message Url list is added to after URL duplicate removals；

If there is the message newly issued, the corresponding URL of new information is added to url list；

Corresponding html document is obtained according to URL；

The html document that will be got, carries out information extraction using information extraction technique, extracts the message text of page-out Behind part, local data base is stored according to form is formulated.

The process that network text is pre-processed is the escape character in network text to be replaced, in network text Punctuation mark lack of standardization be replaced.

Centering sentence carries out Chinese word segmentation and simultaneously builds term vector dictionary, and detailed process is,

Centering sentence carries out Chinese word segmentation；

Term vector learning tool is debugged；

The Chinese word input word vector learning tool that Chinese word segmentation is obtained, carries out term vector dictionary creation.

Best Match Method is selected to carry out Chinese word segmentation.

LSTM-MP model training data are converted into term vector sequence, then train LSTM-MP models；

LSTM-MP model experiment data are converted into term vector sequence, then with trained LSTM-MP models by its turn It is changed to sentence vector.

The process that sentiment analysis is carried out with trained LSTM-MP models and Softmax graders is,

Obtain the network text that need to be analyzed；

The network text that need to be analyzed is pre-processed, obtains the Chinese sentence that need to be analyzed；

Chinese word segmentation is carried out to the Chinese sentence that need to be analyzed and builds term vector dictionary；

The Chinese sentence that need to be analyzed is converted into term vector sequence；

Term vector sequence is converted into sentence vector with trained LSTM-MP models；

Sentence vector is subjected to mood classification with trained Softmax graders.

The beneficial effect that the present invention is reached：The present invention obtains network text, rationally designs Chinese sentence to number The conversion logic of vector is learned, constructs term vector dictionary, Ran Houyong with Chinese words segmentation combination term vector learning tool first LSTM-MP models carry out sentence vector conversion, and representative sentences vector is carried out positive negative emotion point eventually through Softmax graders Class, achievees the purpose that sentiment analysis, and the algorithm classification accuracy rate is high, efficient, flexibility ratio is high and avoids supervised learning side A large amount of manual works of method, are effectively improved the efficiency and accuracy of text emotion tendency classification, automated integration degree Gao Jie Save a large amount of manpowers.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is multithreading reptile structure chart；

Fig. 3 is Recursive Neural Network Structure figure；

Fig. 4 is LSTM structure charts；

Fig. 5 is LSTM-MP structure charts.

Embodiment

The invention will be further described below in conjunction with the accompanying drawings.Following embodiments are only used for clearly illustrating the present invention Technical solution, and be not intended to limit the protection scope of the present invention and limit the scope of the invention.

As shown in Figure 1, a kind of 1, Chinese text sentiment analysis method based on deep learning, comprises the following steps：

Step 1, training LSTM-MP models and Softmax graders；

Detailed process is as follows：

A the progress network text acquisition (as shown in Figure 2) of multithreading reptile) is designed, is comprised the following steps that：

A1 the url list of appropriate website homepage URL initialization reptiles) is chosen；

Investigated by collecting, choose content of text source --- Baidu's news, Sina's finance and economics, Jingdone district with emotion viewpoint Commented on Deng store, with the url list of website homepage URL initialization reptiles.

A2 the html document of each website homepage) is obtained, the corresponding URL of message in html document is parsed, to prevent data Impact analysis is repeated as a result, being added to url list after URL corresponding to message progress duplicate removals, will if there is the message newly issued The corresponding URL of new information is added to url list.

A3 corresponding html document) is obtained according to URL, queue management scheduling gives download queue according to the idle condition of thread URL distributes thread.

A4) the html document that will be got, carries out information extraction using information extraction technique, is extracting the information of page-out just Behind literary part, local data base is stored according to form is formulated.

B) network text (html text) of acquisition is pre-processed, obtains the Chinese sentence in network text.

Some additional characters are understood by escape to be distinguished with the key symbol in html text, therefore get Html text needs to be replaced escape symbol, specific as shown in table 1.

1 escape symbol of table is replaced

Due to not having separator between word, along with the inadequate specification of network text, Chinese and English punctuation mark is used with, full-shape Half-angle is mashed up etc., and participle is difficult, and in order to improve the accuracy of participle, the punctuation mark lack of standardization of network text is carried out Replace, it is specific as shown in table 2.

2 punctuation mark of table is replaced

Punctuate before replacement	Punctuate after replacement
		【】	[]
,	,
		.	。

C) centering sentence carries out Chinese word segmentation and builds term vector dictionary, comprises the following steps that：

C1) Best Match Method is selected to carry out Chinese word segmentation.So-called Best Match Method is exactly on the basis of Forward Maximum Method A kind of method of enterprising line efficiency optimization, the basic step of Forward Maximum Method are：It is 4 to define word maximum length first, so The matching of 4 words is from left to right proceeded by afterwards, is continued if dictionary matching success, it is unsuccessful to cut a word continuation from back Matching remains an individual character until successful match or only, and best match is exactly on this basis by the word order of dictionary according to probability of occurrence Size is ranked up, probability of occurrence it is big come before, it is so more efficient in word match.

C2) term vector learning tool is debugged.

Need to debug learning tool word2vec before term vector study, choose different parameters (sample rate, line Number of passes, window size etc.) to test repeatedly, the optimized parameter determined in debugging process is：Window size 5, sampling threshold 500, line Number of passes 12, term vector dimension 50.

C3) the Chinese word input word vector learning tool word2vec for obtaining Chinese word segmentation, carries out term vector dictionary Structure, specific instructions are：./word2vec–train/home/exer/gold.txt–output/home/exer/ golddic.txt–cbow 0–size 50–window 5–negative 0–hs 1–sample 500–thread 12– binary 0。

D) some Chinese sentences are manually marked, as LSTM-MP model experiment data, remaining Chinese sentence conduct LSTM-MP model training data.

Assuming that there are 48786 Chinese sentences, then 34150 are used as LSTM-MP model training data, 14636 conducts LSTM-MP model experiment data, LSTM-MP model experiment data probably account for 30%.

E) with LSTM-MP model trainings data training LSTM-MP models, comprise the following steps that：

E1) LSTM-MP modellings.

LSTM model refinement Self-Recursive Neural Network.Each sentence is that the word for having varying number is formed, it is assumed that word Quantity is n, and the dimension of each term vector is to be fixed as m, and the vector dimension of that each sentence is n × m, because n is change, So n × m cannot be fixed, it is impossible to it is trained with common feedforward neural network, and the appearance of recurrent neural network is fine Solve the problems, such as this, it allows sentence word recursion cycle to input.

But recurrent neural network, there is also problem, when sentence is long, the recursive number of plies can excessively cause whole network Gradient disappears or explosion, and LSTM is just into the solver of this problem, it inherits the tactic pattern of Recursive Networks model more Its shortcoming has been mended, mnemon has been introduced as calculate node, avoids gradient disappearance problem, a mnemon is by 4 parts Composition：One input gate, the neural unit that a band connects certainly, an out gate and a forgetting door.Input gate can determine Whether input signal can influence and change the current state of mnemon, and out gate can then determine current mnemon Whether location mode that will be coupled to other has an impact state, and forgeing door can currently be determined by adjusting from connection Whether the previous state of (removing) this mnemon is forgotten.

Recurrence son coding is concrete structure as shown in figure 3, being a vector lists A (x by a sentence expression₁,x₂, x₃,...,x_n), wherein x₁,x₂,x₃,...,x_nTo form the term vector of sentence, first x_nWith x_n-1It is spliced into 2m dimensional vectors (x_n-1, x_n), as the input layer of 2m node, it is m (identical with term vector dimension) then to define hidden layer number of nodes, and output layer is certain And 2m nodes, hidden layer output h is obtained after calculating₁, then the h obtained with first layer₁With x_n-2It is spliced into new 2m dimensional vectors (x_n-2,h₁) input as second layer hidden layer, repeat step n-1 times, finishes to obtain one until n term vector all calculates The vectorial s of a whole sentence identical with term vector dimension, its dimension are identical with term vector.

During this, a sentence is decomposed for a binary tree structure, and recurrence is a triple each time, by Two child nodes and father node composition (P, a C₁C₂), wherein C₁,C₂It is two term vector x₁,x₂Corresponding word, P are by x₁,x₂ The hidden layer being calculated exports corresponding word.Corresponding triple child nodes are probably term vector x_i, i ∈ [1, n], also may be used It can be the node of non-terminal location.First node h first₁Vector is by (C₁.C₂)=(x₄.x₅) (() represents two term vectors Combination) be calculated：

P=f (W⁽¹⁾[C₁；C₂]+b⁽¹⁾)

Wherein W⁽¹⁾It is the coefficient matrix of m × 2m, b⁽¹⁾For amount of bias, function f is activation primitive tanh tanh functions, P is that hidden layer exports, [C '₁；C′₂] represent the corresponding phrase of term vector group after reconstruct, if by [C '₁；C′₂] reconstruct can To be expressed as：

[C′₁；C′₂]=W⁽²⁾p+b⁽²⁾

Wherein, C '₁,C′₂Be reconstruct after the corresponding word of term vector, W⁽²⁾It is the coefficient matrix of 2m × m, b⁽²⁾It is inclined to reconstruct The amount of putting.

Reconstructed error is minimized in the training process, with the Euclidean distance between input and reconstruct weigh them it Between error be：

Wherein, E_rec() is the standard for weighing error between input and reconstruct.

Here it is once recursive calculating process now we obtained h₁, then (C₁.C₂)=(x₃.h₁) it is exactly next Input, repeats above-mentioned calculation procedure until obtaining the vector representation s of sentence.

LSTM is to introduce mnemon on the basis of recurrence own coding to be improved, to tackle the defects of its gradient disappears, Work as last word w according to the input logic identical with the sub- coding of recurrence_nAfter the completion of input, h_nFor output such as Fig. 4 institutes of LSMT Show, it is considered that can be as the vector table of whole sentence using last term vector as the final output obtained after output Show carry out classification based training, but test proves that the result that so obtains and be not so good as people's will, added on the basis of LSTM average Pond layer is improved as shown in Figure 5.LSTM-MP is not with final output h_nFor sentence vector representation, but by all notes of LSTM The average pond layer of output input of unit is recalled, so as to obtain final sentence vector

E2 Chinese sentence) is converted into term vector sequence.Keep former order constant after Chinese sentence participle, form Chinese word Word order arranges, and the term vector for finding each word is then corresponded according to the term vector dictionary built, Chinese sequence of terms is turned It is changed to term vector sequence.

E3 the term vector sequence training LSTM-MP models after LSTM-MP model training data conversions) are used.

F LSTM-MP model experiment data) are converted into term vector sequence, then with trained LSTM-MP models by its Be converted to sentence vector.

G) surveyed some vectors as Softmax classifier training data, remaining vector as Softmax graders Try data.Here 70% Softmax classifier training data are used as, 30% is used as Softmax grader test datas.

H) with Softmax classifier trainings data training Softmax graders；Surveyed with Softmax graders test data Trained Softmax graders are tried, i.e., Softmax graders test data are inputted into Softmax graders, by classification results It is compared with manually mark, if accuracy rate is not less than the threshold value of setting, then it is assumed that Softmax classifier trainings are completed.

Step 2, sentiment analysis is carried out with trained LSTM-MP models and Softmax graders, concretely comprised the following steps：

21) network text that need to be analyzed is obtained；

22) network text that need to be analyzed is pre-processed, obtains the Chinese sentence that need to be analyzed；

23) Chinese word segmentation is carried out to the Chinese sentence that need to be analyzed and builds term vector dictionary；

24) the Chinese sentence that need to be analyzed is converted into term vector sequence,

25) term vector sequence is converted into sentence vector with trained LSTM-MP models；

26) sentence vector is subjected to mood classification with trained Softmax graders.

The above method obtains network text, the conversion logic of the Chinese sentence of rational design to mathematic vector, first Term vector dictionary is constructed with Chinese words segmentation combination term vector learning tool, then carrying out sentence vector with LSTM-MP models turns Change, representative sentences vector is subjected to just negative emotional semantic classification eventually through Softmax graders, achievees the purpose that sentiment analysis, should Algorithm classification accuracy rate is high, efficient, flexibility ratio and a large amount of manual works for avoiding supervised learning method, effective to improve The efficiency and accuracy of text emotion tendency classification, automated integration degree is high to save a large amount of manpowers；This method entirety accuracy rate is 78.02%, the classification accuracy of wherein active mood reaches 77.58%, recall rate 81.19%, and F-measure is 79.10%；The classification accuracy of mood expected to fall reaches 78.55%, recall rate 74.59%, F-measure 74.55%.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, some improvement and deformation can also be made, these are improved and deformation Also it should be regarded as protection scope of the present invention.

Claims

A kind of 1. Chinese text sentiment analysis method based on deep learning, it is characterised in that：Comprise the following steps,

Step 1, training LSTM-MP models and Softmax graders；

Detailed process is as follows：

Obtain network text；

The network text of acquisition is pre-processed, obtains the Chinese sentence in network text；

Centering sentence carries out Chinese word segmentation and builds term vector dictionary；

Some Chinese sentences are manually marked, as LSTM-MP model experiment data, remaining Chinese sentence is as LSTM- MP model training data；

LSTM-MP models are trained with LSTM-MP model trainings data；

LSTM-MP model experiments data are all converted into sentence vector with trained LSTM-MP models；

Using some vectors as Softmax classifier training data, remaining vector number is tested as Softmax graders According to；

Softmax graders are trained with Softmax classifier trainings data, is tested and trained with Softmax graders test data Good Softmax graders；

Step 2, sentiment analysis is carried out with trained LSTM-MP models and Softmax graders.
A kind of 2. Chinese text sentiment analysis method based on deep learning according to claim 1, it is characterised in that：If The process that meter multithreading reptile carries out network text acquisition is,

Choose the url list of appropriate website homepage URL initialization reptiles；

The html document of each website homepage is obtained, parses message in html document corresponding URL, URL corresponding to message is gone Url list is added to after weight；

If there is the message newly issued, the corresponding URL of new information is added to url list；

Corresponding html document is obtained according to URL；

The html document that will be got, carries out information extraction using information extraction technique, extracts the message text part of page-out Afterwards, according to formulation form deposit local data base.
A kind of 3. Chinese text sentiment analysis method based on deep learning according to claim 1, it is characterised in that：Net The process that network text is pre-processed is that the escape character in network text is replaced, to the mark lack of standardization in network text Point symbol is replaced.
A kind of 4. Chinese text sentiment analysis method based on deep learning according to claim 1, it is characterised in that：It is right Chinese sentence carries out Chinese word segmentation and simultaneously builds term vector dictionary, and detailed process is,

Centering sentence carries out Chinese word segmentation；

Term vector learning tool is debugged；

The Chinese word input word vector learning tool that Chinese word segmentation is obtained, carries out term vector dictionary creation.
A kind of 5. Chinese text sentiment analysis method based on deep learning according to claim 4, it is characterised in that：Choosing Chinese word segmentation is carried out with Best Match Method.
A kind of 6. Chinese text sentiment analysis method based on deep learning according to claim 1, it is characterised in that：Will LSTM-MP model training data are converted to term vector sequence, then train LSTM-MP models；

LSTM-MP model experiment data are converted into term vector sequence, are then converted into trained LSTM-MP models Sentence vector.
A kind of 7. Chinese text sentiment analysis method based on deep learning according to claim 1, it is characterised in that：With The process that trained LSTM-MP models and Softmax graders carry out sentiment analysis is,

Obtain the network text that need to be analyzed；

The network text that need to be analyzed is pre-processed, obtains the Chinese sentence that need to be analyzed；

Chinese word segmentation is carried out to the Chinese sentence that need to be analyzed and builds term vector dictionary；

The Chinese sentence that need to be analyzed is converted into term vector sequence；

Term vector sequence is converted into sentence vector with trained LSTM-MP models；

Sentence vector is subjected to mood classification with trained Softmax graders.