CN108376133A - The short text sensibility classification method expanded based on emotion word - Google Patents
The short text sensibility classification method expanded based on emotion word Download PDFInfo
- Publication number
- CN108376133A CN108376133A CN201810234391.1A CN201810234391A CN108376133A CN 108376133 A CN108376133 A CN 108376133A CN 201810234391 A CN201810234391 A CN 201810234391A CN 108376133 A CN108376133 A CN 108376133A
- Authority
- CN
- China
- Prior art keywords
- word
- emotion
- short text
- feature
- affective characteristics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to the short text sensibility classification methods expanded based on emotion word, belong to computer and information science technical field.Comment text is cut into sentence set by the present invention first, is carried out participle and part-of-speech tagging using jieba participle tools, is obtained pre-processed results;Secondly, it is commented on for each short text, the term vector that each word is obtained with wikipedia language material training Glove, the semantic similarity of other words and the initial affective characteristics that part of speech is N, V, Adj and Adv is calculated using term vector, the word of semantic similarity is extended to initial affective characteristics collection;Then DF TF MI are proposed, improving traditional characteristic dimension reduction method using statistical nature between word carries out Feature Dimension Reduction, obtains the feature set of low-dimensional, is weighted using affective characteristics;Obtained feature vector is finally subjected to emotion tendency classification by the RADA algorithms being made of Weak Classifier weighting.The present invention solves the problems, such as sentiment dictionary, and there are unregistered words, while efficiently solving the problems, such as that short text comments on effective emotion word less and causes affective characteristics sparse, improve the performance and accuracy rate of emotional orientation analysis.
Description
Technical field
The present invention relates to the short text sensibility classification methods expanded based on emotion word, belong to computer and information science technology
Field.
Background technology
Chinese short text length is shorter, and the information being rich in is less.But the user comment information of short text contains user
Sentiment orientation.By short text emotional orientation analysis, it can effectively excavate that hide the user under the text of surface layer real
Viewpoint attitude.Therefore, the present invention will provide the short text sensibility classification method that expands based on emotion word to improve comment text feelings
Feel the accuracy of classification, and then promote the practical value of the systems such as user's evaluation analysis, Products Show, there is important theoretical meaning
Justice and commercial value.
The short text sensibility classification method that emotion word expands needs the basic problem that solves to be:1. short text comment on length compared with
It is sparse that short, affective characteristic words are easy to cause affective characteristics less;2. existing short text sentiment analysis method characteristic dimensionality reduction effect is owed
It is good;3. there is the selection of grader in supervision emotional orientation analytical method to be affected emotional semantic classification precision.It takes a broad view of existing
Short text emotional orientation analytical method, two classes can be classified as usually using method:
1. being based on sentiment dictionary
This method is a kind of common sorting technique, and general idea is:According to the emotion tendency in sentiment dictionary
Word extracts the emotion word in text, and text is calculated using the Sentiment orientation and emotion degree of the emotion word extracted
Emotion score, the emotional category of text is judged further according to emotion score.But each language zone has the word of emotion to be all
It is extremely abundant, and some emotion words can change with the variation of context, and in addition different fields also has difference
Emotion word.So many sentiment dictionaries can not include all emotion words, there are unregistered words, thus can shadow
Ring the analysis of emotion tendentiousness of text.
2. being based on supervised learning
Supervised learning method is to train grader on the basis for the training data for having marked classification information, it
The classification results of test data are tested using trained grader afterwards.Support vector machine classifier (SVM), Piao
Plain Bayes classifier (NB), maximum entropy classifiers (ME) etc. are all most common graders.For the feelings based on supervised learning
Feel for sentiment classification, the key point of task is the selection and extraction of effective text feature.For Text character extraction,
Main method includes:Mutual information (MI), information gain (IG), word frequency (WF), document frequency (DF), term frequency-inverse document frequency
(TF-IDF) strategies such as.But there is influence of the selection of grader in supervision emotional orientation analytical method to emotional semantic classification precision
It is larger.
In conclusion existing model does not solve the affective characteristics Sparse Problems of short text well, it is in addition existing
Short text sentiment analysis method characteristic dimensionality reduction less effective, while having the selection of grader in supervision emotional orientation analytical method
It is affected to emotional semantic classification precision.
Invention content
The purpose of the present invention expands by affective characteristics and improves traditional characteristic dimension reduction method, proposes to expand based on emotion word
Short text sensibility classification method, to improve the accuracy of short text emotional semantic classification.
The present invention design principle be:First, text is pre-processed, using jieba participle tool carry out participle and
Part-of-speech tagging, the sentence set marked;Secondly, term vector is trained by Glove, comments on, obtains for each short text
The term vector for obtaining each word calculates the semantic similarity of other words and basic emotion word using term vector, by semantic similarity
Word extend to initial affective characteristics collection;Then, it proposes DF-TF-MI, traditional characteristic dimensionality reduction is improved using statistical nature between word
Method carries out Feature Dimension Reduction, obtains the feature set of low-dimensional, and tf*KL algorithms is recycled to carry out affective characteristics weighting;Finally, it will obtain
Feature vector by carrying out emotion tendency classification by the RADA algorithms that form of Weak Classifier weighting.
The technical scheme is that be achieved by the steps of:
Step 1, text is pre-processed.
Step 1.1, text dividing is formed a complete sentence subclass.
Step 1.2, participle and part-of-speech tagging are carried out using jieba participle tools.
Step 2, affective characteristics expand.
Step 2.1, term vector is generated by Glove.
Step 2.2, the semantic similarity of other words and basic emotion word is calculated according to term vector.
Step 2.3, it selects similar word to be added to affective characteristics according to semantic similarity to concentrate.
Step 3, affective characteristics dimensionality reduction improves traditional characteristic dimension reduction method using statistical nature between word and carries out feature drop
Dimension.
Step 4, affective characteristics weight, and by analyzing contribution degree of the emotion word to Sentiment orientation, it is corresponding to assign affective characteristics
Weight.
Step 5, emotional semantic classification.By obtained feature vector by weighting the RADA algorithms formed into market by Weak Classifier
Sense classification.
Advantageous effect
Compared to the method based on emotion dictionary, it is incomplete that the present invention can solve the problems, such as that sentiment dictionary includes emotion word,
Improve the performance of emotional orientation analysis.
Compared to the method for traditional supervised learning, the present invention efficiently solves the effective emotion word of short text comment leads to feelings less
Feel the sparse problem of feature, improves the accuracy of emotional semantic classification.
Description of the drawings
Fig. 1 is the short text sensibility classification method schematic diagram expanded based on emotion word.
Fig. 2 is that Chinese mobile phone comments on influence of the language material training set scale to classification accuracy rate.
Fig. 3 is that influence of the language material training set scale to classification accuracy rate is commented in Chinese hotel.
Fig. 4 is that the feature that Chinese mobile phone is commented under language material data set expands effect experiment result.
Fig. 5 is that the feature that Chinese hotel is commented under language material data set expands effect experiment result.
Specific implementation mode
In order to better illustrate objects and advantages of the present invention, the embodiment of the method for the present invention is done with reference to example
It is further described.
Detailed process is:
Step 1, text is pre-processed.
Step 1.1, text dividing is formed a complete sentence subclass.
Step 1.2, it carries out participle using jieba participle tools and part-of-speech tagging, specifically used experimental data is shown in Table 1:
1. short text emotional orientation analysis experimental data (item) of table
Step 2, affective characteristics expand.
Step 2.1, training Glove generates term vector, and word can be characterized as real number value vector, pass through given instruction by term vector
Practice language material to be trained, content of text is mapped to the vector of vector space, it makes use of the thoughts of neural network, pass through vector
Cosine similarity in space indicates the similarity on phrase semantic.
Step 2.2, the semantic similarity of other words and basic emotion word is calculated by term vector.Language between two term vectors
For the value range of adopted similarity between 0-1, value two words of bigger expression are more related, are worth two words of smaller expression more not phase
It closes.The Semantic Similarity Measurement formula of term vector N=(w_1, w_2 ..., w_n), term vector N_1 and N_2 are such as shown in (1).
Step 2.3, it selects similar word to be added to basic affective characteristics according to semantic similarity to concentrate.Short text is commented on
The feature that affective characteristics select when expanding not only will correctly reflect the emotion clue of comment, also want that one can be clearly distinguished
Document and other documents, these features can be word, contamination etc..When carrying out feature selecting to each comment short text,
Select the combination NVAA of N, V, Adj and Adv as comment affective characteristics.The word that similarity is more than to certain threshold value is added to
NVAA feature sets expand emotional semantic feature, and the value range of threshold value is 0~1, increase threshold value with 0.05 for interval cycle, finally
It chooses so that the highest threshold value of emotional semantic classification accuracy.During feature expands, due to negative word meeting semantics, if whether
Determine word " no " and "no", this two word and next neighbouring adjective, adverbial word and verb constitute a phrase as feature, such as
Feature " not liking ", " uncomfortable " etc..
Step 3, affective characteristics dimensionality reduction.
Step 3.1, the weights for calculating each feature in NVAA features, propose DF-TF-MI.TF is word frequency, and what MI was weighed is
Statistical relationship between characteristic item and classification.When carrying out Feature Dimension Reduction, text classification can be improved by choosing high word frequency Feature Words
Performance.So the influence factor of word frequency is added, improved formula is such as shown in (2).
Wherein, t indicates characteristic item, ciIndicate classification.Arg (TF) is the mean value of TF value of the characteristic item in every document.A
Indicate classification ciInclude the number of documents of characteristic item t.B is indicated containing characteristic item t but is not belonging to classification ciNumber of documents.C tables
Show not comprising characteristic item t but belongs to classification ciNumber of documents.D expressions are not belonging to classification ci, do not include the text of characteristic item t yet
Gear number amount.Classification ciNumber of documents be M=A+C.The quantity of all documents is N=A+B+C+D in training set.
Step 3.2, sort from big to small according to the size of weights, take successively arrangement front certain dimension (150,200,
250,300,500,1000) feature.
Step 4, affective characteristics weighting is carried out using tf*KL algorithms.
Step 4.1, significance level of the feature in certain document is calculated.
Step 4.2, contribution degree of the feature on showing emotion is calculated.
Step 5, emotion tendency is classified.By obtained feature vector by weighting the RADA algorithms formed by Weak Classifier
Carry out emotion tendency classification.Experimental result is shown in Table 2:
2. emotional semantic classification contrast experiment accuracy rate (%) of table
The experimental results showed that this method is better than SCCR on the whole, wherein hotel is commented in language material, this method accuracy energy
Reach 90.91%, improves 5.46% than SCCR best result, mobile phone is commented in language material, and this method accuracy can reach
93.67%, 0.34% is improved than SCCR highest results, 2% is improved when feature quantity 1000 is tieed up.
Test result:Test the short text sensibility classification method expanded based on emotion word, the spy of the combination term vector of use
Sign extending method considers the emotional semantic information between word, and then expands the emotion word of semantic similarity, can effectively carry
Rise the accuracy of emotional orientation analysis.
Above-described specific descriptions have carried out further specifically the purpose, technical solution and advantageous effect of invention
It is bright, it should be understood that the above is only a specific embodiment of the present invention, the protection model being not intended to limit the present invention
It encloses, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in the present invention
Protection domain within.
Claims (4)
1. the short text sensibility classification method expanded based on emotion word, it is characterised in that described method includes following steps:
Step 1, comment text is cut into sentence set, carries out participle and part-of-speech tagging using jieba participle tools, obtains pre-
Handling result;
Step 2, it is commented on for each short text, the term vector of each word, profit is obtained with wikipedia language material training Glove
The semantic similarity that other words and the initial affective characteristics that part of speech is N, V, Adj and Adv are calculated with term vector, by semantic similarity
Word extend to initial affective characteristics collection;
Step 3, DF-TF-MI is proposed, improving traditional characteristic dimension reduction method using statistical nature between word carries out Feature Dimension Reduction, obtains
To the feature set of low-dimensional;
Step 4, affective characteristics weight;
Step 5, obtained feature vector is subjected to emotion tendency classification by the RADA algorithms being made of Weak Classifier weighting.
2. the short text sensibility classification method according to claim 1 expanded based on emotion word, it is characterised in that:Step 2
The middle cosine similarity using in vector space indicates the similarity on phrase semantic, other words in comment are calculated by term vector
The semantic similarity of language and each feature in NVAA, then selects similar word to be added to NVAA features according to semantic similarity
In.
3. the short text sensibility classification method according to claim 1 expanded based on emotion word, it is characterised in that:Step 3
Middle proposition DF-TF-MI carries out Feature Dimension Reduction using the statistical nature between word,.
4. the short text sensibility classification method according to claim 1 expanded based on emotion word, it is characterised in that:Step 4
Middle to carry out affective characteristics weighting using tf*KL algorithms, wherein tf is the significance level of emotion word feature in a document, and KL is emotion
Importance of the word feature on showing emotion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810234391.1A CN108376133A (en) | 2018-03-21 | 2018-03-21 | The short text sensibility classification method expanded based on emotion word |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810234391.1A CN108376133A (en) | 2018-03-21 | 2018-03-21 | The short text sensibility classification method expanded based on emotion word |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108376133A true CN108376133A (en) | 2018-08-07 |
Family
ID=63018921
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810234391.1A Pending CN108376133A (en) | 2018-03-21 | 2018-03-21 | The short text sensibility classification method expanded based on emotion word |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108376133A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388800A (en) * | 2018-09-30 | 2019-02-26 | 江苏师范大学 | A kind of short text sentiment analysis method based on adding window term vector feature |
CN109446423A (en) * | 2018-10-26 | 2019-03-08 | 北京捷报数据技术有限公司 | A kind of Judgment by emotion system and method for news and text |
CN109492105A (en) * | 2018-11-10 | 2019-03-19 | 上海文军信息技术有限公司 | A kind of text sentiment classification method based on multiple features integrated study |
CN109871429A (en) * | 2019-01-31 | 2019-06-11 | 郑州轻工业学院 | Merge the short text search method of Wikipedia classification and explicit semantic feature |
CN109902300A (en) * | 2018-12-29 | 2019-06-18 | 深兰科技(上海)有限公司 | A kind of method, apparatus, electronic equipment and storage medium creating dictionary |
CN111191032A (en) * | 2019-12-24 | 2020-05-22 | 深圳追一科技有限公司 | Corpus expansion method and device, computer equipment and storage medium |
CN111221962A (en) * | 2019-11-18 | 2020-06-02 | 重庆邮电大学 | Text emotion analysis method based on new word expansion and complex sentence pattern expansion |
CN111552802A (en) * | 2020-03-09 | 2020-08-18 | 北京达佳互联信息技术有限公司 | Text classification model training method and device |
CN111611374A (en) * | 2019-02-25 | 2020-09-01 | 北京嘀嘀无限科技发展有限公司 | Corpus expansion method and device, electronic equipment and storage medium |
CN111797898A (en) * | 2020-06-03 | 2020-10-20 | 武汉大学 | Online comment automatic reply method based on deep semantic matching |
CN112507115A (en) * | 2020-12-07 | 2021-03-16 | 重庆邮电大学 | Method and device for classifying emotion words in barrage text and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955451A (en) * | 2014-05-15 | 2014-07-30 | 北京优捷信达信息科技有限公司 | Method for judging emotional tendentiousness of short text |
CN104462409A (en) * | 2014-12-12 | 2015-03-25 | 重庆理工大学 | Cross-language emotional resource data identification method based on AdaBoost |
WO2017024553A1 (en) * | 2015-08-12 | 2017-02-16 | 浙江核新同花顺网络信息股份有限公司 | Information emotion analysis method and system |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
-
2018
- 2018-03-21 CN CN201810234391.1A patent/CN108376133A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955451A (en) * | 2014-05-15 | 2014-07-30 | 北京优捷信达信息科技有限公司 | Method for judging emotional tendentiousness of short text |
CN104462409A (en) * | 2014-12-12 | 2015-03-25 | 重庆理工大学 | Cross-language emotional resource data identification method based on AdaBoost |
WO2017024553A1 (en) * | 2015-08-12 | 2017-02-16 | 浙江核新同花顺网络信息股份有限公司 | Information emotion analysis method and system |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
Non-Patent Citations (1)
Title |
---|
路凯: "基于综合比率因子的互信息特征选择方法的改进", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388800A (en) * | 2018-09-30 | 2019-02-26 | 江苏师范大学 | A kind of short text sentiment analysis method based on adding window term vector feature |
CN109446423A (en) * | 2018-10-26 | 2019-03-08 | 北京捷报数据技术有限公司 | A kind of Judgment by emotion system and method for news and text |
CN109492105A (en) * | 2018-11-10 | 2019-03-19 | 上海文军信息技术有限公司 | A kind of text sentiment classification method based on multiple features integrated study |
CN109492105B (en) * | 2018-11-10 | 2022-11-15 | 上海五节数据科技有限公司 | Text emotion classification method based on multi-feature ensemble learning |
CN109902300A (en) * | 2018-12-29 | 2019-06-18 | 深兰科技(上海)有限公司 | A kind of method, apparatus, electronic equipment and storage medium creating dictionary |
CN109871429A (en) * | 2019-01-31 | 2019-06-11 | 郑州轻工业学院 | Merge the short text search method of Wikipedia classification and explicit semantic feature |
CN111611374A (en) * | 2019-02-25 | 2020-09-01 | 北京嘀嘀无限科技发展有限公司 | Corpus expansion method and device, electronic equipment and storage medium |
CN111221962A (en) * | 2019-11-18 | 2020-06-02 | 重庆邮电大学 | Text emotion analysis method based on new word expansion and complex sentence pattern expansion |
CN111191032A (en) * | 2019-12-24 | 2020-05-22 | 深圳追一科技有限公司 | Corpus expansion method and device, computer equipment and storage medium |
CN111191032B (en) * | 2019-12-24 | 2023-09-12 | 深圳追一科技有限公司 | Corpus expansion method, corpus expansion device, computer equipment and storage medium |
CN111552802A (en) * | 2020-03-09 | 2020-08-18 | 北京达佳互联信息技术有限公司 | Text classification model training method and device |
CN111797898A (en) * | 2020-06-03 | 2020-10-20 | 武汉大学 | Online comment automatic reply method based on deep semantic matching |
CN112507115A (en) * | 2020-12-07 | 2021-03-16 | 重庆邮电大学 | Method and device for classifying emotion words in barrage text and storage medium |
CN112507115B (en) * | 2020-12-07 | 2023-02-03 | 重庆邮电大学 | Method and device for classifying emotion words in barrage text and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108376133A (en) | The short text sensibility classification method expanded based on emotion word | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
CN107239439A (en) | Public sentiment sentiment classification method based on word2vec | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN108388554B (en) | Text emotion recognition system based on collaborative filtering attention mechanism | |
CN104331506A (en) | Multiclass emotion analyzing method and system facing bilingual microblog text | |
CN101295294A (en) | Improved Bayes acceptation disambiguation method based on information gain | |
CN106202032A (en) | A kind of sentiment analysis method towards microblogging short text and system thereof | |
CN103034626A (en) | Emotion analyzing system and method | |
CN107145560B (en) | Text classification method and device | |
CN106599054A (en) | Method and system for title classification and push | |
CN109960799A (en) | A kind of Optimum Classification method towards short text | |
CN109815400A (en) | Personage's interest extracting method based on long text | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN102567308A (en) | Information processing feature extracting method | |
CN109086375A (en) | A kind of short text subject extraction method based on term vector enhancing | |
CN110134799B (en) | BM25 algorithm-based text corpus construction and optimization method | |
CN104899188A (en) | Problem similarity calculation method based on subjects and focuses of problems | |
Man | Feature extension for short text categorization using frequent term sets | |
Sabuna et al. | Summarizing Indonesian text automatically by using sentence scoring and decision tree | |
Do et al. | Korean twitter emotion classification using automatically built emotion lexicons and fine-grained features | |
Cao et al. | Machine learning based detection of clickbait posts in social media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180807 |