CN108009148B - Text emotion classification representation method based on deep learning - Google Patents
Text emotion classification representation method based on deep learning Download PDFInfo
- Publication number
- CN108009148B CN108009148B CN201711137565.4A CN201711137565A CN108009148B CN 108009148 B CN108009148 B CN 108009148B CN 201711137565 A CN201711137565 A CN 201711137565A CN 108009148 B CN108009148 B CN 108009148B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- data
- vector
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a text emotion classification representation method based on deep learning, which comprises the following steps: preprocessing a text; word vectorization: a. distributed word feature vector representation; b. performing vectorization representation on the shallow term features; fusing the distributed characteristics of the words and the shallow characteristics to obtain a characteristic fusion matrix; extracting abstract features by using a convolutional neural network; and training a text emotion classification model by using sentence characteristics.
Description
Technical Field
The invention relates to a text emotion classification representation method.
Background
In order for a computer to be able to process text, the text must be represented as a mathematical vector that the computer can process. The current text representation model mainly comprises a vector space model, a probability model and a language model.
The Vector Space Model (VSM) reduces the processing of text content to vector operations in vector space and represents the similarity of text semantics in terms of similarity in vector space. The text vectorization process comprises the following steps: 1) word segmentation; 2) stop words; 3) selecting a feature term; 4) calculating the weight of the feature item; 5) and (5) feature normalization. The characteristic lexical item weight calculation method comprises Boolean weight calculation, word frequency weight calculation and word frequency inverse document frequency. The weight of each term represents its degree of importance.
The probabilistic model is a text representation model based on the principle of probabilistic queuing. The probability queuing principle is that when the texts are arranged according to the principle of probability descending, the best retrieval performance can be obtained. For a query given by a user, the probability model calculates the probability of all documents and arranges the texts in descending order according to the size of the document probability. The probability model is a text representation model for information retrieval by using conceptual correlations between terms and between terms and documents, and overcomes the defect that the VSM model and the Boolean model ignore the correlations of terms.
The language model defines the probability distribution of the marker sequences in natural language. The tokens may be words, characters or even bytes, depending on the design of the particular model. Where the labels represent discrete entities. The earliest successful language model was a fixed length sequence based marker model called n-gram. An n-gram is a sequence comprising n markers. It is basically assumed that the current tag is correlated into the first n-1 tags. Unlike n-grams, neural network language models learn a distributed representation of words through a neural network, enabling the model to recognize two similar words without losing the ability to encode each word as a different value.
Whether the information processing task is difficult or not depends greatly on the representation form of the information. The method is a basic principle widely applied to daily life, scientific calculation and machine learning. In machine learning, finding the proper representation corresponding to a task during data processing facilitates training of the model. The representation learning based on deep learning does not impose any condition on the learned intermediate features definitely, and other representation learning algorithms often design the representation definitely in a certain specific representation mode. The current text representation method based on deep learning utilizes the linear expression capability of distributed word vectors and a deep learning model to help to improve the abstract capability of text features.
Disclosure of Invention
The invention aims to provide a text representation method based on deep learning, which is applied to text emotion classification. The expression method fuses deep features and shallow features of words and phrases, and learns vector expression of sentences through a Convolutional Neural Network (CNN). Sentence information can be effectively utilized, and subsequent emotion classification model training is facilitated. The technical scheme is as follows:
a text emotion classification representation method based on deep learning comprises the following steps:
1) text pre-processing
a. Description of data: classifying the emotion of the text, wherein the data types comprise positive emotion, neutral emotion and negative emotion;
b. constructing a data set: after data cleaning, randomly selecting 80% of data from the data as training data, using the rest 20% of the data as test data for performance evaluation of a classification model, and using all the data for training a word vector matrix;
2) word vectorization:
a. distributed term feature vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W1,w2,…,wn-each word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS1,pos2,…,posnThe part of speech of each word is represented by m-dimensional vectors, wherein the vector representations of the words and the part of speech are obtained by word2vec tool training;
b. shallow term feature vectorization represents: for a piece of text, the word sequence after word segmentation preprocessing is expressed as NEG ═ NEG1,neg2,…,negnAnd expressing the named entity recognition result of the sentence as a binary vector, wherein if the word is the named entity, the word is set to be 0, otherwise, the word is 1, and introducing the position information of the word in each text, wherein the position information is expressed as P ═ P1,p2,…,pn}={1,2,...,n};
c. Fusing the distributed features of the words with the shallow features, wherein each word is represented as a vector with the length of k + m +2, and if l is k + m +2, each text is represented as a feature fusion matrix of l multiplied by n;
3) extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a characteristic fusion matrix obtained after a certain text is subjected to steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, a pooling method can be adopted to select the maximum value after convolution of each convolution kernel as the local feature under the convolution kernel, and the local feature replaces the abstract feature of the text.
4) And (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).
The invention has the advantages that: a word level feature selection method based on shallow feature fusion is provided, and the method is different from a traditional feature extraction method and does not require a user to have strong prior knowledge. Meanwhile, the vector representation fuses the traditional word vector representation and the characteristics of words, so that the finally obtained word vector has richer information; as shown in fig. 1, a general framework is proposed in which a sentence-vectorization model can be adapted to a model structure according to a specific task or represented using a recurrent neural network. Meanwhile, the emotion classifier can be selected according to actual requirements, so that the emotion classifier is flexible to realize and has certain universality.
Drawings
FIG. 1 is a text emotion classification flow chart based on deep learning
FIG. 2 is a feature fusion process
FIG. 3 is a text sub-vectorization process based on convolutional neural network
Detailed Description
The invention provides a text emotion classification representation method based on deep learning, which integrates the characteristics of words in addition to distributed word vector representation to obtain the vector representation of each word in a text. And simultaneously, abstract features of the text are extracted by utilizing the deep neural network. The text representation mode is beneficial to the training of a follow-up emotion classification model, so that the emotion analysis is more accurate. FIG. 1 shows the process of the invention for implementing text emotion classification based on deep learning. FIG. 2 shows the word-level feature fusion process, after fusion the vectors areFig. 3 shows a process of convolution to achieve text feature extraction.
The method specifically comprises the following steps:
2) text pre-processing
c. Description of data: in this patent, emotion classification is performed for a text, and data categories include positive emotion, neutral emotion, and negative emotion.
d. Constructing a data set: after data cleansing, 80% of the data were randomly selected as training data. The remaining 20% served as test data for classification model performance evaluation. Where all data is used to train the word vector matrix.
2) Word vectorization:
a. distributed word vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W1,w2,…,wnEach word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS1,pos2,…,posnAnd the part of speech of each word is represented by an m-dimensional vector. Wherein, the vector representation of the words and the parts of speech is obtained by word2vec tool training.
b. The superficial word characteristic vectorization represents that: setting a text s to be composed of n words, and expressing a word sequence after word segmentation preprocessing as NEG (new) NEG1,neg2,…,negnAnd expressing the named entity recognition result of the statement as a binary vector, wherein if the term is the named entity, the term is set to be 0, and otherwise, the term is 1. And simultaneously introducing position information of words in each text, wherein the position information is expressed as P ═ P1,p2,…,pn}={1,2,...,n}。
c. Fusing the distributed features with the shallow features of the words, each word being represented as a vector of length k + m +2, let l be k + m + 2. Then each text is represented as an l n matrix
3) Extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a matrix obtained by a certain text after the steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, a pooling method can be adopted to select the maximum value after convolution of each convolution kernel as the local feature under the convolution kernel, and the local feature replaces the abstract feature of the text.
4) And (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).
The invention is properly adjusted according to specific use scenes in use. When training the vectorization matrix based on word2vec, the algorithm hyper-parameters including vector representation dimension, corpus iteration times, word2vec training method, etc. should be selected according to the actual situation. Generally, English text should be mapped into 50-dimensional vector, Chinese text should be mapped into 300-dimensional vector, and attention should be paid to increase the number of training iterations in case of insufficient corpus resources. For the long text emotion classification task, the invention adopts CNN to extract abstract features of sentences, and the RNN can be considered to realize sentence coding expression when the length of the sentences is shorter or the length difference is higher. The final emotion classification model also selects an appropriate classifier according to the actual application scene.
Claims (1)
1. A text emotion classification representation method based on deep learning comprises the following steps:
1) text pre-processing
a. Description of data: classifying the emotion of the text, wherein the data types comprise positive emotion, neutral emotion and negative emotion;
b. constructing a data set: after data cleaning, randomly selecting 80% of data from the data as training data, using the rest 20% of the data as test data for performance evaluation of a classification model, and using all the data for training a word vector matrix;
2) word vectorization:
a. distributed term feature vector representation: setting a text s composed of n words, and after word segmentation preprocessing, the word sequence is W ═ W1,w2,...,wn-each word is represented by a k-dimensional vector; the part of speech sequence is POS ═ POS1,pos2,...,posnThe part of speech of each word is represented by m-dimensional vectors, wherein the vector representations of the words and the part of speech are obtained by word2vec tool training;
b. shallow term feature vectorization represents: for a piece of text, the word sequence after word segmentation preprocessing is expressed as NEG ═ NEG1,neg2,...,negnExpressing the named entity recognition result of the statement as a binary vector,setting the word to be 0 if the word is a named entity, otherwise, setting the word to be 1, and introducing the position information of the word in each text, wherein the position information is expressed as P ═ { P ═1,p2,...,pn}={1,2,...,n};
c. Fusing the distributed features of the words with the shallow features, wherein each word is represented as a vector with the length of k + m +2, and if l is k + m +2, each text is represented as a feature fusion matrix of l multiplied by n;
3) extracting abstract features by using a convolutional neural network: the convolutional neural network consists of an input layer and a convolutional layer, wherein the input layer is a characteristic fusion matrix obtained after a certain text is subjected to steps 1) and 2), the convolutional layer is divided into a convolution part and a pooling part, the matrixes of the input layer are sequentially convolved by utilizing convolution kernels with different lengths, and corresponding convolution results with different lengths are obtained through a Sigmoid activation function; in order to normalize the result, selecting the maximum value after convolution of each convolution kernel as a local feature under the convolution kernel by adopting a pooling method, and replacing the abstract feature of the text with the local feature;
4) and (4) training a text emotion classification model according to the sentence characteristics obtained in the step (3).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711137565.4A CN108009148B (en) | 2017-11-16 | 2017-11-16 | Text emotion classification representation method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711137565.4A CN108009148B (en) | 2017-11-16 | 2017-11-16 | Text emotion classification representation method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108009148A CN108009148A (en) | 2018-05-08 |
CN108009148B true CN108009148B (en) | 2021-04-27 |
Family
ID=62052547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711137565.4A Expired - Fee Related CN108009148B (en) | 2017-11-16 | 2017-11-16 | Text emotion classification representation method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108009148B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108845560B (en) * | 2018-05-30 | 2021-07-13 | 国网浙江省电力有限公司宁波供电公司 | Power dispatching log fault classification method |
CN108877801B (en) * | 2018-06-14 | 2020-10-02 | 南京云思创智信息科技有限公司 | Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system |
CN109036465B (en) * | 2018-06-28 | 2021-05-11 | 南京邮电大学 | Speech emotion recognition method |
CN109190112B (en) * | 2018-08-10 | 2022-12-06 | 合肥工业大学 | Patent classification method, system and storage medium based on dual-channel feature fusion |
CN111241271B (en) * | 2018-11-13 | 2023-04-25 | 网智天元科技集团股份有限公司 | Text emotion classification method and device and electronic equipment |
CN109271493B (en) * | 2018-11-26 | 2021-10-08 | 腾讯科技(深圳)有限公司 | Language text processing method and device and storage medium |
CN109829500B (en) * | 2019-01-31 | 2023-05-02 | 华南理工大学 | Position composition and automatic clustering method |
CN110046250A (en) * | 2019-03-17 | 2019-07-23 | 华南师范大学 | Three embedded convolutional neural networks model and its more classification methods of text |
CN109977414B (en) * | 2019-04-01 | 2023-03-14 | 中科天玑数据科技股份有限公司 | Internet financial platform user comment theme analysis system and method |
CN110750648A (en) * | 2019-10-21 | 2020-02-04 | 南京大学 | Text emotion classification method based on deep learning and feature fusion |
CN110929587B (en) * | 2019-10-30 | 2021-04-20 | 杭州电子科技大学 | Bidirectional reconstruction network video description method based on hierarchical attention mechanism |
CN110851600A (en) * | 2019-11-07 | 2020-02-28 | 北京集奥聚合科技有限公司 | Text data processing method and device based on deep learning |
CN111274401A (en) * | 2020-01-20 | 2020-06-12 | 华中师范大学 | Classroom utterance classification method and device based on multi-feature fusion |
CN111221974B (en) * | 2020-04-22 | 2020-08-14 | 成都索贝数码科技股份有限公司 | Method for constructing news text classification model based on hierarchical structure multi-label system |
CN111858939A (en) * | 2020-07-27 | 2020-10-30 | 上海五节数据科技有限公司 | Text emotion classification method based on context information and convolutional neural network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649275A (en) * | 2016-12-28 | 2017-05-10 | 成都数联铭品科技有限公司 | Relation extraction method based on part-of-speech information and convolutional neural network |
CN106951472A (en) * | 2017-03-06 | 2017-07-14 | 华侨大学 | A kind of multiple sensibility classification method of network text |
CN107025284A (en) * | 2017-04-06 | 2017-08-08 | 中南大学 | The recognition methods of network comment text emotion tendency and convolutional neural networks model |
CN107038480A (en) * | 2017-05-12 | 2017-08-11 | 东华大学 | A kind of text sentiment classification method based on convolutional neural networks |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8856056B2 (en) * | 2011-03-22 | 2014-10-07 | Isentium, Llc | Sentiment calculus for a method and system using social media for event-driven trading |
US20160034426A1 (en) * | 2014-08-01 | 2016-02-04 | Raytheon Bbn Technologies Corp. | Creating Cohesive Documents From Social Media Messages |
-
2017
- 2017-11-16 CN CN201711137565.4A patent/CN108009148B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649275A (en) * | 2016-12-28 | 2017-05-10 | 成都数联铭品科技有限公司 | Relation extraction method based on part-of-speech information and convolutional neural network |
CN106951472A (en) * | 2017-03-06 | 2017-07-14 | 华侨大学 | A kind of multiple sensibility classification method of network text |
CN107025284A (en) * | 2017-04-06 | 2017-08-08 | 中南大学 | The recognition methods of network comment text emotion tendency and convolutional neural networks model |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
CN107038480A (en) * | 2017-05-12 | 2017-08-11 | 东华大学 | A kind of text sentiment classification method based on convolutional neural networks |
Non-Patent Citations (3)
Title |
---|
Emotion Classification of Chinese Microblog Text via Fusion of BoW and eVector Feature Representations;Chengxin Li等;《NLPCC 2014》;20141231;第217-228页 * |
基于多特征融合的混合神经网络模型讽刺语用判别;孙晓等;《中文信息学报》;20161130;第30卷(第6期);第215-223页 * |
绝对不平衡样本分类的集成迁移学习算法;么素素等;《http://kns.cnki.net/kcms/detail/11.5602.TP.20170920.1136.002.html》;20170920;第1145-1153页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108009148A (en) | 2018-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108009148B (en) | Text emotion classification representation method based on deep learning | |
CN110866117B (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN109657239B (en) | Chinese named entity recognition method based on attention mechanism and language model learning | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN109753566B (en) | Model training method for cross-domain emotion analysis based on convolutional neural network | |
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN107085581B (en) | Short text classification method and device | |
CN111027595B (en) | Double-stage semantic word vector generation method | |
CN111783462A (en) | Chinese named entity recognition model and method based on dual neural network fusion | |
CN110263325B (en) | Chinese word segmentation system | |
CN112100346B (en) | Visual question-answering method based on fusion of fine-grained image features and external knowledge | |
CN106649853A (en) | Short text clustering method based on deep learning | |
CN111984791B (en) | Attention mechanism-based long text classification method | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN110263174B (en) | Topic category analysis method based on focus attention | |
CN110489551B (en) | Author identification method based on writing habit | |
CN110781290A (en) | Extraction method of structured text abstract of long chapter | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN111078833A (en) | Text classification method based on neural network | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN111125367A (en) | Multi-character relation extraction method based on multi-level attention mechanism | |
CN113220865B (en) | Text similar vocabulary retrieval method, system, medium and electronic equipment | |
CN111914544A (en) | Metaphor sentence recognition method, metaphor sentence recognition device, metaphor sentence recognition equipment and storage medium | |
CN111651993A (en) | Chinese named entity recognition method fusing local-global character level association features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210427 Termination date: 20211116 |
|
CF01 | Termination of patent right due to non-payment of annual fee |