CN108009148B

CN108009148B - Text emotion classification representation method based on deep learning

Info

Publication number: CN108009148B
Application number: CN201711137565.4A
Authority: CN
Inventors: 王宝亮; 么素素
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2021-04-27
Anticipated expiration: 2037-11-16
Also published as: CN108009148A

Abstract

The invention relates to a text emotion classification representation method based on deep learning, which comprises the following steps: preprocessing a text; word vectorization: a. distributed word feature vector representation; b. performing vectorization representation on the shallow term features; fusing the distributed characteristics of the words and the shallow characteristics to obtain a characteristic fusion matrix; extracting abstract features by using a convolutional neural network; and training a text emotion classification model by using sentence characteristics.

Description

Text emotion classification representation method based on deep learning

Technical Field

The invention relates to a text emotion classification representation method.

Background

In order for a computer to be able to process text, the text must be represented as a mathematical vector that the computer can process. The current text representation model mainly comprises a vector space model, a probability model and a language model.

The Vector Space Model (VSM) reduces the processing of text content to vector operations in vector space and represents the similarity of text semantics in terms of similarity in vector space. The text vectorization process comprises the following steps: 1) word segmentation; 2) stop words; 3) selecting a feature term; 4) calculating the weight of the feature item; 5) and (5) feature normalization. The characteristic lexical item weight calculation method comprises Boolean weight calculation, word frequency weight calculation and word frequency inverse document frequency. The weight of each term represents its degree of importance.

The probabilistic model is a text representation model based on the principle of probabilistic queuing. The probability queuing principle is that when the texts are arranged according to the principle of probability descending, the best retrieval performance can be obtained. For a query given by a user, the probability model calculates the probability of all documents and arranges the texts in descending order according to the size of the document probability. The probability model is a text representation model for information retrieval by using conceptual correlations between terms and between terms and documents, and overcomes the defect that the VSM model and the Boolean model ignore the correlations of terms.

The language model defines the probability distribution of the marker sequences in natural language. The tokens may be words, characters or even bytes, depending on the design of the particular model. Where the labels represent discrete entities. The earliest successful language model was a fixed length sequence based marker model called n-gram. An n-gram is a sequence comprising n markers. It is basically assumed that the current tag is correlated into the first n-1 tags. Unlike n-grams, neural network language models learn a distributed representation of words through a neural network, enabling the model to recognize two similar words without losing the ability to encode each word as a different value.

Whether the information processing task is difficult or not depends greatly on the representation form of the information. The method is a basic principle widely applied to daily life, scientific calculation and machine learning. In machine learning, finding the proper representation corresponding to a task during data processing facilitates training of the model. The representation learning based on deep learning does not impose any condition on the learned intermediate features definitely, and other representation learning algorithms often design the representation definitely in a certain specific representation mode. The current text representation method based on deep learning utilizes the linear expression capability of distributed word vectors and a deep learning model to help to improve the abstract capability of text features.

Disclosure of Invention

The invention aims to provide a text representation method based on deep learning, which is applied to text emotion classification. The expression method fuses deep features and shallow features of words and phrases, and learns vector expression of sentences through a Convolutional Neural Network (CNN). Sentence information can be effectively utilized, and subsequent emotion classification model training is facilitated. The technical scheme is as follows:

a text emotion classification representation method based on deep learning comprises the following steps:

1) text pre-processing

a. Description of data: classifying the emotion of the text, wherein the data types comprise positive emotion, neutral emotion and negative emotion;

b. constructing a data set: after data cleaning, randomly selecting 80% of data from the data as training data, using the rest 20% of the data as test data for performance evaluation of a classification model, and using all the data for training a word vector matrix;

2) word vectorization:

a. distributed term feature vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W₁,w₂,…,w_n-each word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS₁,pos₂,…,pos_nThe part of speech of each word is represented by m-dimensional vectors, wherein the vector representations of the words and the part of speech are obtained by word2vec tool training;

b. shallow term feature vectorization represents: for a piece of text, the word sequence after word segmentation preprocessing is expressed as NEG ═ NEG₁,neg₂,…,neg_nAnd expressing the named entity recognition result of the sentence as a binary vector, wherein if the word is the named entity, the word is set to be 0, otherwise, the word is 1, and introducing the position information of the word in each text, wherein the position information is expressed as P ═ P₁,p₂,…,p_n}＝{1,2,...,n}；

c. Fusing the distributed features of the words with the shallow features, wherein each word is represented as a vector with the length of k + m +2, and if l is k + m +2, each text is represented as a feature fusion matrix of l multiplied by n;

3) extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a characteristic fusion matrix obtained after a certain text is subjected to steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, a pooling method can be adopted to select the maximum value after convolution of each convolution kernel as the local feature under the convolution kernel, and the local feature replaces the abstract feature of the text.

4) And (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).

The invention has the advantages that: a word level feature selection method based on shallow feature fusion is provided, and the method is different from a traditional feature extraction method and does not require a user to have strong prior knowledge. Meanwhile, the vector representation fuses the traditional word vector representation and the characteristics of words, so that the finally obtained word vector has richer information; as shown in fig. 1, a general framework is proposed in which a sentence-vectorization model can be adapted to a model structure according to a specific task or represented using a recurrent neural network. Meanwhile, the emotion classifier can be selected according to actual requirements, so that the emotion classifier is flexible to realize and has certain universality.

Drawings

FIG. 1 is a text emotion classification flow chart based on deep learning

FIG. 2 is a feature fusion process

FIG. 3 is a text sub-vectorization process based on convolutional neural network

Detailed Description

The invention provides a text emotion classification representation method based on deep learning, which integrates the characteristics of words in addition to distributed word vector representation to obtain the vector representation of each word in a text. And simultaneously, abstract features of the text are extracted by utilizing the deep neural network. The text representation mode is beneficial to the training of a follow-up emotion classification model, so that the emotion analysis is more accurate. FIG. 1 shows the process of the invention for implementing text emotion classification based on deep learning. FIG. 2 shows the word-level feature fusion process, after fusion the vectors are

Fig. 3 shows a process of convolution to achieve text feature extraction.

The method specifically comprises the following steps:

2) text pre-processing

c. Description of data: in this patent, emotion classification is performed for a text, and data categories include positive emotion, neutral emotion, and negative emotion.

d. Constructing a data set: after data cleansing, 80% of the data were randomly selected as training data. The remaining 20% served as test data for classification model performance evaluation. Where all data is used to train the word vector matrix.

2) Word vectorization:

a. distributed word vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W₁,w₂,…,w_nEach word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS₁,pos₂,…,pos_nAnd the part of speech of each word is represented by an m-dimensional vector. Wherein, the vector representation of the words and the parts of speech is obtained by word2vec tool training.

b. The superficial word characteristic vectorization represents that: setting a text s to be composed of n words, and expressing a word sequence after word segmentation preprocessing as NEG (new) NEG₁,neg₂,…,neg_nAnd expressing the named entity recognition result of the statement as a binary vector, wherein if the term is the named entity, the term is set to be 0, and otherwise, the term is 1. And simultaneously introducing position information of words in each text, wherein the position information is expressed as P ═ P₁,p₂,…,p_n}＝{1,2,...,n}。

c. Fusing the distributed features with the shallow features of the words, each word being represented as a vector of length k + m +2, let l be k + m + 2. Then each text is represented as an l n matrix

3) Extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a matrix obtained by a certain text after the steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, a pooling method can be adopted to select the maximum value after convolution of each convolution kernel as the local feature under the convolution kernel, and the local feature replaces the abstract feature of the text.

The invention is properly adjusted according to specific use scenes in use. When training the vectorization matrix based on word2vec, the algorithm hyper-parameters including vector representation dimension, corpus iteration times, word2vec training method, etc. should be selected according to the actual situation. Generally, English text should be mapped into 50-dimensional vector, Chinese text should be mapped into 300-dimensional vector, and attention should be paid to increase the number of training iterations in case of insufficient corpus resources. For the long text emotion classification task, the invention adopts CNN to extract abstract features of sentences, and the RNN can be considered to realize sentence coding expression when the length of the sentences is shorter or the length difference is higher. The final emotion classification model also selects an appropriate classifier according to the actual application scene.

Claims

1. A text emotion classification representation method based on deep learning comprises the following steps:

1) text pre-processing

2) word vectorization:

a. distributed term feature vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W₁,w₂,...,w_n-each word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS₁,pos₂,...,pos_nThe part of speech of each word is represented by m-dimensional vectors, wherein the vector representations of the words and the part of speech are obtained by word2vec tool training;

b. shallow term feature vectorization represents: for a piece of text, the word sequence after word segmentation preprocessing is expressed as NEG ═ NEG₁,neg₂,...,neg_nExpressing the named entity recognition result of the statement as a binary vector,setting the word to be 0 if the word is a named entity, otherwise, setting the word to be 1, and introducing the position information of the word in each text, wherein the position information is expressed as P ═ { P ═₁,p₂,...,p_n}＝{1,2,...,n}；

3) extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a characteristic fusion matrix obtained after a certain text is subjected to steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, selecting the maximum value after convolution of each convolution kernel as a local feature under the convolution kernel by adopting a pooling method, and replacing the abstract feature of the text with the local feature;