CN108376133A - The short text sensibility classification method expanded based on emotion word - Google Patents

The short text sensibility classification method expanded based on emotion word Download PDF

Info

Publication number
CN108376133A
CN108376133A CN201810234391.1A CN201810234391A CN108376133A CN 108376133 A CN108376133 A CN 108376133A CN 201810234391 A CN201810234391 A CN 201810234391A CN 108376133 A CN108376133 A CN 108376133A
Authority
CN
China
Prior art keywords
word
emotion
short text
feature
affective characteristics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810234391.1A
Other languages
Chinese (zh)
Inventor
罗森林
李东超
潘丽敏
毛焱颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810234391.1A priority Critical patent/CN108376133A/en
Publication of CN108376133A publication Critical patent/CN108376133A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the short text sensibility classification methods expanded based on emotion word, belong to computer and information science technical field.Comment text is cut into sentence set by the present invention first, is carried out participle and part-of-speech tagging using jieba participle tools, is obtained pre-processed results;Secondly, it is commented on for each short text, the term vector that each word is obtained with wikipedia language material training Glove, the semantic similarity of other words and the initial affective characteristics that part of speech is N, V, Adj and Adv is calculated using term vector, the word of semantic similarity is extended to initial affective characteristics collection;Then DF TF MI are proposed, improving traditional characteristic dimension reduction method using statistical nature between word carries out Feature Dimension Reduction, obtains the feature set of low-dimensional, is weighted using affective characteristics;Obtained feature vector is finally subjected to emotion tendency classification by the RADA algorithms being made of Weak Classifier weighting.The present invention solves the problems, such as sentiment dictionary, and there are unregistered words, while efficiently solving the problems, such as that short text comments on effective emotion word less and causes affective characteristics sparse, improve the performance and accuracy rate of emotional orientation analysis.

Description

The short text sensibility classification method expanded based on emotion word
Technical field
The present invention relates to the short text sensibility classification methods expanded based on emotion word, belong to computer and information science technology Field.
Background technology
Chinese short text length is shorter, and the information being rich in is less.But the user comment information of short text contains user Sentiment orientation.By short text emotional orientation analysis, it can effectively excavate that hide the user under the text of surface layer real Viewpoint attitude.Therefore, the present invention will provide the short text sensibility classification method that expands based on emotion word to improve comment text feelings Feel the accuracy of classification, and then promote the practical value of the systems such as user's evaluation analysis, Products Show, there is important theoretical meaning Justice and commercial value.
The short text sensibility classification method that emotion word expands needs the basic problem that solves to be:1. short text comment on length compared with It is sparse that short, affective characteristic words are easy to cause affective characteristics less;2. existing short text sentiment analysis method characteristic dimensionality reduction effect is owed It is good;3. there is the selection of grader in supervision emotional orientation analytical method to be affected emotional semantic classification precision.It takes a broad view of existing Short text emotional orientation analytical method, two classes can be classified as usually using method:
1. being based on sentiment dictionary
This method is a kind of common sorting technique, and general idea is:According to the emotion tendency in sentiment dictionary Word extracts the emotion word in text, and text is calculated using the Sentiment orientation and emotion degree of the emotion word extracted Emotion score, the emotional category of text is judged further according to emotion score.But each language zone has the word of emotion to be all It is extremely abundant, and some emotion words can change with the variation of context, and in addition different fields also has difference Emotion word.So many sentiment dictionaries can not include all emotion words, there are unregistered words, thus can shadow Ring the analysis of emotion tendentiousness of text.
2. being based on supervised learning
Supervised learning method is to train grader on the basis for the training data for having marked classification information, it The classification results of test data are tested using trained grader afterwards.Support vector machine classifier (SVM), Piao Plain Bayes classifier (NB), maximum entropy classifiers (ME) etc. are all most common graders.For the feelings based on supervised learning Feel for sentiment classification, the key point of task is the selection and extraction of effective text feature.For Text character extraction, Main method includes:Mutual information (MI), information gain (IG), word frequency (WF), document frequency (DF), term frequency-inverse document frequency (TF-IDF) strategies such as.But there is influence of the selection of grader in supervision emotional orientation analytical method to emotional semantic classification precision It is larger.
In conclusion existing model does not solve the affective characteristics Sparse Problems of short text well, it is in addition existing Short text sentiment analysis method characteristic dimensionality reduction less effective, while having the selection of grader in supervision emotional orientation analytical method It is affected to emotional semantic classification precision.
Invention content
The purpose of the present invention expands by affective characteristics and improves traditional characteristic dimension reduction method, proposes to expand based on emotion word Short text sensibility classification method, to improve the accuracy of short text emotional semantic classification.
The present invention design principle be:First, text is pre-processed, using jieba participle tool carry out participle and Part-of-speech tagging, the sentence set marked;Secondly, term vector is trained by Glove, comments on, obtains for each short text The term vector for obtaining each word calculates the semantic similarity of other words and basic emotion word using term vector, by semantic similarity Word extend to initial affective characteristics collection;Then, it proposes DF-TF-MI, traditional characteristic dimensionality reduction is improved using statistical nature between word Method carries out Feature Dimension Reduction, obtains the feature set of low-dimensional, and tf*KL algorithms is recycled to carry out affective characteristics weighting;Finally, it will obtain Feature vector by carrying out emotion tendency classification by the RADA algorithms that form of Weak Classifier weighting.
The technical scheme is that be achieved by the steps of:
Step 1, text is pre-processed.
Step 1.1, text dividing is formed a complete sentence subclass.
Step 1.2, participle and part-of-speech tagging are carried out using jieba participle tools.
Step 2, affective characteristics expand.
Step 2.1, term vector is generated by Glove.
Step 2.2, the semantic similarity of other words and basic emotion word is calculated according to term vector.
Step 2.3, it selects similar word to be added to affective characteristics according to semantic similarity to concentrate.
Step 3, affective characteristics dimensionality reduction improves traditional characteristic dimension reduction method using statistical nature between word and carries out feature drop Dimension.
Step 4, affective characteristics weight, and by analyzing contribution degree of the emotion word to Sentiment orientation, it is corresponding to assign affective characteristics Weight.
Step 5, emotional semantic classification.By obtained feature vector by weighting the RADA algorithms formed into market by Weak Classifier Sense classification.
Advantageous effect
Compared to the method based on emotion dictionary, it is incomplete that the present invention can solve the problems, such as that sentiment dictionary includes emotion word, Improve the performance of emotional orientation analysis.
Compared to the method for traditional supervised learning, the present invention efficiently solves the effective emotion word of short text comment leads to feelings less Feel the sparse problem of feature, improves the accuracy of emotional semantic classification.
Description of the drawings
Fig. 1 is the short text sensibility classification method schematic diagram expanded based on emotion word.
Fig. 2 is that Chinese mobile phone comments on influence of the language material training set scale to classification accuracy rate.
Fig. 3 is that influence of the language material training set scale to classification accuracy rate is commented in Chinese hotel.
Fig. 4 is that the feature that Chinese mobile phone is commented under language material data set expands effect experiment result.
Fig. 5 is that the feature that Chinese hotel is commented under language material data set expands effect experiment result.
Specific implementation mode
In order to better illustrate objects and advantages of the present invention, the embodiment of the method for the present invention is done with reference to example It is further described.
Detailed process is:
Step 1, text is pre-processed.
Step 1.1, text dividing is formed a complete sentence subclass.
Step 1.2, it carries out participle using jieba participle tools and part-of-speech tagging, specifically used experimental data is shown in Table 1:
1. short text emotional orientation analysis experimental data (item) of table
Step 2, affective characteristics expand.
Step 2.1, training Glove generates term vector, and word can be characterized as real number value vector, pass through given instruction by term vector Practice language material to be trained, content of text is mapped to the vector of vector space, it makes use of the thoughts of neural network, pass through vector Cosine similarity in space indicates the similarity on phrase semantic.
Step 2.2, the semantic similarity of other words and basic emotion word is calculated by term vector.Language between two term vectors For the value range of adopted similarity between 0-1, value two words of bigger expression are more related, are worth two words of smaller expression more not phase It closes.The Semantic Similarity Measurement formula of term vector N=(w_1, w_2 ..., w_n), term vector N_1 and N_2 are such as shown in (1).
Step 2.3, it selects similar word to be added to basic affective characteristics according to semantic similarity to concentrate.Short text is commented on The feature that affective characteristics select when expanding not only will correctly reflect the emotion clue of comment, also want that one can be clearly distinguished Document and other documents, these features can be word, contamination etc..When carrying out feature selecting to each comment short text, Select the combination NVAA of N, V, Adj and Adv as comment affective characteristics.The word that similarity is more than to certain threshold value is added to NVAA feature sets expand emotional semantic feature, and the value range of threshold value is 0~1, increase threshold value with 0.05 for interval cycle, finally It chooses so that the highest threshold value of emotional semantic classification accuracy.During feature expands, due to negative word meeting semantics, if whether Determine word " no " and "no", this two word and next neighbouring adjective, adverbial word and verb constitute a phrase as feature, such as Feature " not liking ", " uncomfortable " etc..
Step 3, affective characteristics dimensionality reduction.
Step 3.1, the weights for calculating each feature in NVAA features, propose DF-TF-MI.TF is word frequency, and what MI was weighed is Statistical relationship between characteristic item and classification.When carrying out Feature Dimension Reduction, text classification can be improved by choosing high word frequency Feature Words Performance.So the influence factor of word frequency is added, improved formula is such as shown in (2).
Wherein, t indicates characteristic item, ciIndicate classification.Arg (TF) is the mean value of TF value of the characteristic item in every document.A Indicate classification ciInclude the number of documents of characteristic item t.B is indicated containing characteristic item t but is not belonging to classification ciNumber of documents.C tables Show not comprising characteristic item t but belongs to classification ciNumber of documents.D expressions are not belonging to classification ci, do not include the text of characteristic item t yet Gear number amount.Classification ciNumber of documents be M=A+C.The quantity of all documents is N=A+B+C+D in training set.
Step 3.2, sort from big to small according to the size of weights, take successively arrangement front certain dimension (150,200, 250,300,500,1000) feature.
Step 4, affective characteristics weighting is carried out using tf*KL algorithms.
Step 4.1, significance level of the feature in certain document is calculated.
Step 4.2, contribution degree of the feature on showing emotion is calculated.
Step 5, emotion tendency is classified.By obtained feature vector by weighting the RADA algorithms formed by Weak Classifier Carry out emotion tendency classification.Experimental result is shown in Table 2:
2. emotional semantic classification contrast experiment accuracy rate (%) of table
The experimental results showed that this method is better than SCCR on the whole, wherein hotel is commented in language material, this method accuracy energy Reach 90.91%, improves 5.46% than SCCR best result, mobile phone is commented in language material, and this method accuracy can reach 93.67%, 0.34% is improved than SCCR highest results, 2% is improved when feature quantity 1000 is tieed up.
Test result:Test the short text sensibility classification method expanded based on emotion word, the spy of the combination term vector of use Sign extending method considers the emotional semantic information between word, and then expands the emotion word of semantic similarity, can effectively carry Rise the accuracy of emotional orientation analysis.
Above-described specific descriptions have carried out further specifically the purpose, technical solution and advantageous effect of invention It is bright, it should be understood that the above is only a specific embodiment of the present invention, the protection model being not intended to limit the present invention It encloses, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in the present invention Protection domain within.

Claims (4)

1. the short text sensibility classification method expanded based on emotion word, it is characterised in that described method includes following steps:
Step 1, comment text is cut into sentence set, carries out participle and part-of-speech tagging using jieba participle tools, obtains pre- Handling result;
Step 2, it is commented on for each short text, the term vector of each word, profit is obtained with wikipedia language material training Glove The semantic similarity that other words and the initial affective characteristics that part of speech is N, V, Adj and Adv are calculated with term vector, by semantic similarity Word extend to initial affective characteristics collection;
Step 3, DF-TF-MI is proposed, improving traditional characteristic dimension reduction method using statistical nature between word carries out Feature Dimension Reduction, obtains To the feature set of low-dimensional;
Step 4, affective characteristics weight;
Step 5, obtained feature vector is subjected to emotion tendency classification by the RADA algorithms being made of Weak Classifier weighting.
2. the short text sensibility classification method according to claim 1 expanded based on emotion word, it is characterised in that:Step 2 The middle cosine similarity using in vector space indicates the similarity on phrase semantic, other words in comment are calculated by term vector The semantic similarity of language and each feature in NVAA, then selects similar word to be added to NVAA features according to semantic similarity In.
3. the short text sensibility classification method according to claim 1 expanded based on emotion word, it is characterised in that:Step 3 Middle proposition DF-TF-MI carries out Feature Dimension Reduction using the statistical nature between word,.
4. the short text sensibility classification method according to claim 1 expanded based on emotion word, it is characterised in that:Step 4 Middle to carry out affective characteristics weighting using tf*KL algorithms, wherein tf is the significance level of emotion word feature in a document, and KL is emotion Importance of the word feature on showing emotion.
CN201810234391.1A 2018-03-21 2018-03-21 The short text sensibility classification method expanded based on emotion word Pending CN108376133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810234391.1A CN108376133A (en) 2018-03-21 2018-03-21 The short text sensibility classification method expanded based on emotion word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810234391.1A CN108376133A (en) 2018-03-21 2018-03-21 The short text sensibility classification method expanded based on emotion word

Publications (1)

Publication Number Publication Date
CN108376133A true CN108376133A (en) 2018-08-07

Family

ID=63018921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810234391.1A Pending CN108376133A (en) 2018-03-21 2018-03-21 The short text sensibility classification method expanded based on emotion word

Country Status (1)

Country Link
CN (1) CN108376133A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388800A (en) * 2018-09-30 2019-02-26 江苏师范大学 A kind of short text sentiment analysis method based on adding window term vector feature
CN109446423A (en) * 2018-10-26 2019-03-08 北京捷报数据技术有限公司 A kind of Judgment by emotion system and method for news and text
CN109492105A (en) * 2018-11-10 2019-03-19 上海文军信息技术有限公司 A kind of text sentiment classification method based on multiple features integrated study
CN109871429A (en) * 2019-01-31 2019-06-11 郑州轻工业学院 Merge the short text search method of Wikipedia classification and explicit semantic feature
CN109902300A (en) * 2018-12-29 2019-06-18 深兰科技(上海)有限公司 A kind of method, apparatus, electronic equipment and storage medium creating dictionary
CN111191032A (en) * 2019-12-24 2020-05-22 深圳追一科技有限公司 Corpus expansion method and device, computer equipment and storage medium
CN111221962A (en) * 2019-11-18 2020-06-02 重庆邮电大学 Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN111552802A (en) * 2020-03-09 2020-08-18 北京达佳互联信息技术有限公司 Text classification model training method and device
CN111611374A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 Corpus expansion method and device, electronic equipment and storage medium
CN111797898A (en) * 2020-06-03 2020-10-20 武汉大学 Online comment automatic reply method based on deep semantic matching
CN112507115A (en) * 2020-12-07 2021-03-16 重庆邮电大学 Method and device for classifying emotion words in barrage text and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
WO2017024553A1 (en) * 2015-08-12 2017-02-16 浙江核新同花顺网络信息股份有限公司 Information emotion analysis method and system
CN107193801A (en) * 2017-05-21 2017-09-22 北京工业大学 A kind of short text characteristic optimization and sentiment analysis method based on depth belief network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
WO2017024553A1 (en) * 2015-08-12 2017-02-16 浙江核新同花顺网络信息股份有限公司 Information emotion analysis method and system
CN107193801A (en) * 2017-05-21 2017-09-22 北京工业大学 A kind of short text characteristic optimization and sentiment analysis method based on depth belief network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
路凯: "基于综合比率因子的互信息特征选择方法的改进", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388800A (en) * 2018-09-30 2019-02-26 江苏师范大学 A kind of short text sentiment analysis method based on adding window term vector feature
CN109446423A (en) * 2018-10-26 2019-03-08 北京捷报数据技术有限公司 A kind of Judgment by emotion system and method for news and text
CN109492105A (en) * 2018-11-10 2019-03-19 上海文军信息技术有限公司 A kind of text sentiment classification method based on multiple features integrated study
CN109492105B (en) * 2018-11-10 2022-11-15 上海五节数据科技有限公司 Text emotion classification method based on multi-feature ensemble learning
CN109902300A (en) * 2018-12-29 2019-06-18 深兰科技(上海)有限公司 A kind of method, apparatus, electronic equipment and storage medium creating dictionary
CN109871429A (en) * 2019-01-31 2019-06-11 郑州轻工业学院 Merge the short text search method of Wikipedia classification and explicit semantic feature
CN111611374A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 Corpus expansion method and device, electronic equipment and storage medium
CN111221962A (en) * 2019-11-18 2020-06-02 重庆邮电大学 Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN111191032A (en) * 2019-12-24 2020-05-22 深圳追一科技有限公司 Corpus expansion method and device, computer equipment and storage medium
CN111191032B (en) * 2019-12-24 2023-09-12 深圳追一科技有限公司 Corpus expansion method, corpus expansion device, computer equipment and storage medium
CN111552802A (en) * 2020-03-09 2020-08-18 北京达佳互联信息技术有限公司 Text classification model training method and device
CN111797898A (en) * 2020-06-03 2020-10-20 武汉大学 Online comment automatic reply method based on deep semantic matching
CN112507115A (en) * 2020-12-07 2021-03-16 重庆邮电大学 Method and device for classifying emotion words in barrage text and storage medium
CN112507115B (en) * 2020-12-07 2023-02-03 重庆邮电大学 Method and device for classifying emotion words in barrage text and storage medium

Similar Documents

Publication Publication Date Title
CN108376133A (en) The short text sensibility classification method expanded based on emotion word
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN105183833B (en) Microblog text recommendation method and device based on user model
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN107992542A (en) A kind of similar article based on topic model recommends method
CN108388554B (en) Text emotion recognition system based on collaborative filtering attention mechanism
CN104331506A (en) Multiclass emotion analyzing method and system facing bilingual microblog text
CN101295294A (en) Improved Bayes acceptation disambiguation method based on information gain
CN106202032A (en) A kind of sentiment analysis method towards microblogging short text and system thereof
CN103034626A (en) Emotion analyzing system and method
CN107145560B (en) Text classification method and device
CN106599054A (en) Method and system for title classification and push
CN109960799A (en) A kind of Optimum Classification method towards short text
CN109815400A (en) Personage's interest extracting method based on long text
CN108804595B (en) Short text representation method based on word2vec
CN102567308A (en) Information processing feature extracting method
CN109086375A (en) A kind of short text subject extraction method based on term vector enhancing
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
Man Feature extension for short text categorization using frequent term sets
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
Do et al. Korean twitter emotion classification using automatically built emotion lexicons and fine-grained features
Cao et al. Machine learning based detection of clickbait posts in social media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180807