CN108009148B - Text emotion classification representation method based on deep learning - Google Patents

Text emotion classification representation method based on deep learning Download PDF

Info

Publication number
CN108009148B
CN108009148B CN201711137565.4A CN201711137565A CN108009148B CN 108009148 B CN108009148 B CN 108009148B CN 201711137565 A CN201711137565 A CN 201711137565A CN 108009148 B CN108009148 B CN 108009148B
Authority
CN
China
Prior art keywords
word
text
data
vector
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711137565.4A
Other languages
Chinese (zh)
Other versions
CN108009148A (en
Inventor
王宝亮
么素素
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201711137565.4A priority Critical patent/CN108009148B/en
Publication of CN108009148A publication Critical patent/CN108009148A/en
Application granted granted Critical
Publication of CN108009148B publication Critical patent/CN108009148B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a text emotion classification representation method based on deep learning, which comprises the following steps: preprocessing a text; word vectorization: a. distributed word feature vector representation; b. performing vectorization representation on the shallow term features; fusing the distributed characteristics of the words and the shallow characteristics to obtain a characteristic fusion matrix; extracting abstract features by using a convolutional neural network; and training a text emotion classification model by using sentence characteristics.

Description

Text emotion classification representation method based on deep learning
Technical Field
The invention relates to a text emotion classification representation method.
Background
In order for a computer to be able to process text, the text must be represented as a mathematical vector that the computer can process. The current text representation model mainly comprises a vector space model, a probability model and a language model.
The Vector Space Model (VSM) reduces the processing of text content to vector operations in vector space and represents the similarity of text semantics in terms of similarity in vector space. The text vectorization process comprises the following steps: 1) word segmentation; 2) stop words; 3) selecting a feature term; 4) calculating the weight of the feature item; 5) and (5) feature normalization. The characteristic lexical item weight calculation method comprises Boolean weight calculation, word frequency weight calculation and word frequency inverse document frequency. The weight of each term represents its degree of importance.
The probabilistic model is a text representation model based on the principle of probabilistic queuing. The probability queuing principle is that when the texts are arranged according to the principle of probability descending, the best retrieval performance can be obtained. For a query given by a user, the probability model calculates the probability of all documents and arranges the texts in descending order according to the size of the document probability. The probability model is a text representation model for information retrieval by using conceptual correlations between terms and between terms and documents, and overcomes the defect that the VSM model and the Boolean model ignore the correlations of terms.
The language model defines the probability distribution of the marker sequences in natural language. The tokens may be words, characters or even bytes, depending on the design of the particular model. Where the labels represent discrete entities. The earliest successful language model was a fixed length sequence based marker model called n-gram. An n-gram is a sequence comprising n markers. It is basically assumed that the current tag is correlated into the first n-1 tags. Unlike n-grams, neural network language models learn a distributed representation of words through a neural network, enabling the model to recognize two similar words without losing the ability to encode each word as a different value.
Whether the information processing task is difficult or not depends greatly on the representation form of the information. The method is a basic principle widely applied to daily life, scientific calculation and machine learning. In machine learning, finding the proper representation corresponding to a task during data processing facilitates training of the model. The representation learning based on deep learning does not impose any condition on the learned intermediate features definitely, and other representation learning algorithms often design the representation definitely in a certain specific representation mode. The current text representation method based on deep learning utilizes the linear expression capability of distributed word vectors and a deep learning model to help to improve the abstract capability of text features.
Disclosure of Invention
The invention aims to provide a text representation method based on deep learning, which is applied to text emotion classification. The expression method fuses deep features and shallow features of words and phrases, and learns vector expression of sentences through a Convolutional Neural Network (CNN). Sentence information can be effectively utilized, and subsequent emotion classification model training is facilitated. The technical scheme is as follows:
a text emotion classification representation method based on deep learning comprises the following steps:
1) text pre-processing
a. Description of data: classifying the emotion of the text, wherein the data types comprise positive emotion, neutral emotion and negative emotion;
b. constructing a data set: after data cleaning, randomly selecting 80% of data from the data as training data, using the rest 20% of the data as test data for performance evaluation of a classification model, and using all the data for training a word vector matrix;
2) word vectorization:
a. distributed term feature vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W1,w2,…,wn-each word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS1,pos2,…,posnThe part of speech of each word is represented by m-dimensional vectors, wherein the vector representations of the words and the part of speech are obtained by word2vec tool training;
b. shallow term feature vectorization represents: for a piece of text, the word sequence after word segmentation preprocessing is expressed as NEG ═ NEG1,neg2,…,negnAnd expressing the named entity recognition result of the sentence as a binary vector, wherein if the word is the named entity, the word is set to be 0, otherwise, the word is 1, and introducing the position information of the word in each text, wherein the position information is expressed as P ═ P1,p2,…,pn}={1,2,...,n};
c. Fusing the distributed features of the words with the shallow features, wherein each word is represented as a vector with the length of k + m +2, and if l is k + m +2, each text is represented as a feature fusion matrix of l multiplied by n;
3) extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a characteristic fusion matrix obtained after a certain text is subjected to steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, a pooling method can be adopted to select the maximum value after convolution of each convolution kernel as the local feature under the convolution kernel, and the local feature replaces the abstract feature of the text.
4) And (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).
The invention has the advantages that: a word level feature selection method based on shallow feature fusion is provided, and the method is different from a traditional feature extraction method and does not require a user to have strong prior knowledge. Meanwhile, the vector representation fuses the traditional word vector representation and the characteristics of words, so that the finally obtained word vector has richer information; as shown in fig. 1, a general framework is proposed in which a sentence-vectorization model can be adapted to a model structure according to a specific task or represented using a recurrent neural network. Meanwhile, the emotion classifier can be selected according to actual requirements, so that the emotion classifier is flexible to realize and has certain universality.
Drawings
FIG. 1 is a text emotion classification flow chart based on deep learning
FIG. 2 is a feature fusion process
FIG. 3 is a text sub-vectorization process based on convolutional neural network
Detailed Description
The invention provides a text emotion classification representation method based on deep learning, which integrates the characteristics of words in addition to distributed word vector representation to obtain the vector representation of each word in a text. And simultaneously, abstract features of the text are extracted by utilizing the deep neural network. The text representation mode is beneficial to the training of a follow-up emotion classification model, so that the emotion analysis is more accurate. FIG. 1 shows the process of the invention for implementing text emotion classification based on deep learning. FIG. 2 shows the word-level feature fusion process, after fusion the vectors are
Figure BDA0001470795670000031
Fig. 3 shows a process of convolution to achieve text feature extraction.
The method specifically comprises the following steps:
2) text pre-processing
c. Description of data: in this patent, emotion classification is performed for a text, and data categories include positive emotion, neutral emotion, and negative emotion.
d. Constructing a data set: after data cleansing, 80% of the data were randomly selected as training data. The remaining 20% served as test data for classification model performance evaluation. Where all data is used to train the word vector matrix.
2) Word vectorization:
a. distributed word vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W1,w2,…,wnEach word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS1,pos2,…,posnAnd the part of speech of each word is represented by an m-dimensional vector. Wherein, the vector representation of the words and the parts of speech is obtained by word2vec tool training.
b. The superficial word characteristic vectorization represents that: setting a text s to be composed of n words, and expressing a word sequence after word segmentation preprocessing as NEG (new) NEG1,neg2,…,negnAnd expressing the named entity recognition result of the statement as a binary vector, wherein if the term is the named entity, the term is set to be 0, and otherwise, the term is 1. And simultaneously introducing position information of words in each text, wherein the position information is expressed as P ═ P1,p2,…,pn}={1,2,...,n}。
c. Fusing the distributed features with the shallow features of the words, each word being represented as a vector of length k + m +2, let l be k + m + 2. Then each text is represented as an l n matrix
3) Extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a matrix obtained by a certain text after the steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, a pooling method can be adopted to select the maximum value after convolution of each convolution kernel as the local feature under the convolution kernel, and the local feature replaces the abstract feature of the text.
4) And (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).
The invention is properly adjusted according to specific use scenes in use. When training the vectorization matrix based on word2vec, the algorithm hyper-parameters including vector representation dimension, corpus iteration times, word2vec training method, etc. should be selected according to the actual situation. Generally, English text should be mapped into 50-dimensional vector, Chinese text should be mapped into 300-dimensional vector, and attention should be paid to increase the number of training iterations in case of insufficient corpus resources. For the long text emotion classification task, the invention adopts CNN to extract abstract features of sentences, and the RNN can be considered to realize sentence coding expression when the length of the sentences is shorter or the length difference is higher. The final emotion classification model also selects an appropriate classifier according to the actual application scene.

Claims (1)

1. A text emotion classification representation method based on deep learning comprises the following steps:
1) text pre-processing
a. Description of data: classifying the emotion of the text, wherein the data types comprise positive emotion, neutral emotion and negative emotion;
b. constructing a data set: after data cleaning, randomly selecting 80% of data from the data as training data, using the rest 20% of the data as test data for performance evaluation of a classification model, and using all the data for training a word vector matrix;
2) word vectorization:
a. distributed term feature vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W1,w2,...,wn-each word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS1,pos2,...,posnThe part of speech of each word is represented by m-dimensional vectors, wherein the vector representations of the words and the part of speech are obtained by word2vec tool training;
b. shallow term feature vectorization represents: for a piece of text, the word sequence after word segmentation preprocessing is expressed as NEG ═ NEG1,neg2,...,negnExpressing the named entity recognition result of the statement as a binary vector,setting the word to be 0 if the word is a named entity, otherwise, setting the word to be 1, and introducing the position information of the word in each text, wherein the position information is expressed as P ═ { P ═1,p2,...,pn}={1,2,...,n};
c. Fusing the distributed features of the words with the shallow features, wherein each word is represented as a vector with the length of k + m +2, and if l is k + m +2, each text is represented as a feature fusion matrix of l multiplied by n;
3) extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a characteristic fusion matrix obtained after a certain text is subjected to steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, selecting the maximum value after convolution of each convolution kernel as a local feature under the convolution kernel by adopting a pooling method, and replacing the abstract feature of the text with the local feature;
4) and (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).
CN201711137565.4A 2017-11-16 2017-11-16 Text emotion classification representation method based on deep learning Expired - Fee Related CN108009148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711137565.4A CN108009148B (en) 2017-11-16 2017-11-16 Text emotion classification representation method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711137565.4A CN108009148B (en) 2017-11-16 2017-11-16 Text emotion classification representation method based on deep learning

Publications (2)

Publication Number Publication Date
CN108009148A CN108009148A (en) 2018-05-08
CN108009148B true CN108009148B (en) 2021-04-27

Family

ID=62052547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711137565.4A Expired - Fee Related CN108009148B (en) 2017-11-16 2017-11-16 Text emotion classification representation method based on deep learning

Country Status (1)

Country Link
CN (1) CN108009148B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845560B (en) * 2018-05-30 2021-07-13 国网浙江省电力有限公司宁波供电公司 Power dispatching log fault classification method
CN108877801B (en) * 2018-06-14 2020-10-02 南京云思创智信息科技有限公司 Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN109036465B (en) * 2018-06-28 2021-05-11 南京邮电大学 Speech emotion recognition method
CN109190112B (en) * 2018-08-10 2022-12-06 合肥工业大学 Patent classification method, system and storage medium based on dual-channel feature fusion
CN111241271B (en) * 2018-11-13 2023-04-25 网智天元科技集团股份有限公司 Text emotion classification method and device and electronic equipment
CN109271493B (en) * 2018-11-26 2021-10-08 腾讯科技(深圳)有限公司 Language text processing method and device and storage medium
CN109829500B (en) * 2019-01-31 2023-05-02 华南理工大学 Position composition and automatic clustering method
CN110046250A (en) * 2019-03-17 2019-07-23 华南师范大学 Three embedded convolutional neural networks model and its more classification methods of text
CN109977414B (en) * 2019-04-01 2023-03-14 中科天玑数据科技股份有限公司 Internet financial platform user comment theme analysis system and method
CN110750648A (en) * 2019-10-21 2020-02-04 南京大学 Text emotion classification method based on deep learning and feature fusion
CN110929587B (en) * 2019-10-30 2021-04-20 杭州电子科技大学 Bidirectional reconstruction network video description method based on hierarchical attention mechanism
CN110851600A (en) * 2019-11-07 2020-02-28 北京集奥聚合科技有限公司 Text data processing method and device based on deep learning
CN111274401A (en) * 2020-01-20 2020-06-12 华中师范大学 Classroom utterance classification method and device based on multi-feature fusion
CN111221974B (en) * 2020-04-22 2020-08-14 成都索贝数码科技股份有限公司 Method for constructing news text classification model based on hierarchical structure multi-label system
CN111858939A (en) * 2020-07-27 2020-10-30 上海五节数据科技有限公司 Text emotion classification method based on context information and convolutional neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107025284A (en) * 2017-04-06 2017-08-08 中南大学 The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN107038480A (en) * 2017-05-12 2017-08-11 东华大学 A kind of text sentiment classification method based on convolutional neural networks
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8856056B2 (en) * 2011-03-22 2014-10-07 Isentium, Llc Sentiment calculus for a method and system using social media for event-driven trading
US20160034426A1 (en) * 2014-08-01 2016-02-04 Raytheon Bbn Technologies Corp. Creating Cohesive Documents From Social Media Messages

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107025284A (en) * 2017-04-06 2017-08-08 中南大学 The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107038480A (en) * 2017-05-12 2017-08-11 东华大学 A kind of text sentiment classification method based on convolutional neural networks

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Emotion Classification of Chinese Microblog Text via Fusion of BoW and eVector Feature Representations;Chengxin Li等;《NLPCC 2014》;20141231;第217-228页 *
基于多特征融合的混合神经网络模型讽刺语用判别;孙晓等;《中文信息学报》;20161130;第30卷(第6期);第215-223页 *
绝对不平衡样本分类的集成迁移学习算法;么素素等;《http://kns.cnki.net/kcms/detail/11.5602.TP.20170920.1136.002.html》;20170920;第1145-1153页 *

Also Published As

Publication number Publication date
CN108009148A (en) 2018-05-08

Similar Documents

Publication Publication Date Title
CN108009148B (en) Text emotion classification representation method based on deep learning
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN108984526B (en) Document theme vector extraction method based on deep learning
CN109753566B (en) Model training method for cross-domain emotion analysis based on convolutional neural network
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN107085581B (en) Short text classification method and device
CN111027595B (en) Double-stage semantic word vector generation method
CN111783462A (en) Chinese named entity recognition model and method based on dual neural network fusion
CN110263325B (en) Chinese word segmentation system
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN106649853A (en) Short text clustering method based on deep learning
CN111984791B (en) Attention mechanism-based long text classification method
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN110263174B (en) Topic category analysis method based on focus attention
CN110489551B (en) Author identification method based on writing habit
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN111078833A (en) Text classification method based on neural network
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN111125367A (en) Multi-character relation extraction method based on multi-level attention mechanism
CN113220865B (en) Text similar vocabulary retrieval method, system, medium and electronic equipment
CN111914544A (en) Metaphor sentence recognition method, metaphor sentence recognition device, metaphor sentence recognition equipment and storage medium
CN111651993A (en) Chinese named entity recognition method fusing local-global character level association features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210427

Termination date: 20211116

CF01 Termination of patent right due to non-payment of annual fee