CN105956095A - Psychological pre-warning model establishment method based on fine-granularity sentiment dictionary - Google Patents

Psychological pre-warning model establishment method based on fine-granularity sentiment dictionary Download PDF

Info

Publication number
CN105956095A
CN105956095A CN201610286515.1A CN201610286515A CN105956095A CN 105956095 A CN105956095 A CN 105956095A CN 201610286515 A CN201610286515 A CN 201610286515A CN 105956095 A CN105956095 A CN 105956095A
Authority
CN
China
Prior art keywords
word
dictionary
emotion
sentiment
step
Prior art date
Application number
CN201610286515.1A
Other languages
Chinese (zh)
Other versions
CN105956095B (en
Inventor
于瑞国
林榆旺
王建荣
于健
喻梅
刘江月
Original Assignee
天津大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津大学 filed Critical 天津大学
Priority to CN201610286515.1A priority Critical patent/CN105956095B/en
Publication of CN105956095A publication Critical patent/CN105956095A/en
Application granted granted Critical
Publication of CN105956095B publication Critical patent/CN105956095B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a psychological pre-warning model establishment method based on a fine-granularity sentiment dictionary. The method comprises the steps of 1, acquiring a Chinese dictionary corresponding to an ANEW dictionary by a translation method; 2, screening vocabularies, and deleting vocabularies not suitable for sentiment analysis in the Chinese dictionary acquired in the step 1; 3, performing normalization processing on sentiment values, normalizing the sentiment values to be between -1 and 1; 4, expanding the sentiment dictionary based on a synonym extended edition; 5, expanding the dictionary based on an improved SO-PMI algorithm; 6, performing rule-based sentiment orientation analysis on a microblog text; and 7, executing a sentiment analysis algorithm based on weight factors. Compared with the prior art, the method is not limited by the quantity of corpus, can achieve completely unsupervised execution, and is particularly suitable for a large amount of unmarked data of microblogs.

Description

A kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary

Technical field

The invention belongs to data mining and information retrieval field, particularly relate to a kind of psychological Early-warning Model based on fine granularity sentiment dictionary.

Background technology

At present, the prior art research of major part text analyzing is the sentiment analysis for English text, wherein comprises polarity dictionary, context relation transducer etc..But, have a large vocabulary owing to Chinese has, if using manual markings Chinese affection resources to need to pay huge workload, the research the most how going out Chinese emotion vocabulary by existing English resource rapid build is significant.

Need word is measured in sentiment dictionary builds.Three-dimensional emotion model PAD (Pleasure-displeasure, Arousal-nonarousal, Dominance-submissiveness) is the emotion model with widest application proposed by Mehrabian.Wherein P represents joyful degree Pleasure, A and represents degree of waking up up Arousal, D representative domination degree Dominance.The emotional category representated by a word can be weighed, as shown in table 1 with PAD model:

The affective style citing that each dimension of table 1, PAD is corresponding

Margaret M.Bradley and Peter professor J.Lang are the research worker in research center at heart, University of Florida, propose the dictionary for specification english vocabulary emotion grade, english vocabulary emotion specification (Affective Norms for English Words, ANEW).ANEW emotion vocabulary is with PAD as prototype, marks written material according to three dimensions of PAD.The research worker of various countries launches research work also around ANEW, marks various countries' language.

Summary of the invention

Based on above-mentioned prior art and the problem of existence, the present invention proposes a kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary, construction method and the expansion to Chinese sentiment dictionary, and to microblog text affective tendency detection and psychology early warning.The especially basic research work in the research directions such as Chinese text research, sentiment analysis, Internet public opinion analysis, contribute for other researchs further on text, to accelerate the Efficiency on Chinese text and offer a kind of psychology method for early warning, find the psycho-emotional of text, awareness network public sentiment and the Sentiment orientation of user in time.

The present invention proposes a kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary, the method following steps:

Step 1, the method for utilization translation obtain the Chinese dictionary that ANEW dictionary is corresponding;

Step 2, vocabulary screen, and are not suitable for the vocabulary of sentiment analysis in the Chinese dictionary that delete step (1) obtains;

Step 3, carrying out the normalized of emotion value, normalize between-1~1 by the emotion value of word, normalization formula is expressed as:

Y = X - A v e v a l u e M a x v a l u e - M i n v a l u e

Wherein, Avevalue represents the meansigma methods of emotion intensity, and Maxvalue represents the maximum of intensity of emotion word place classification, and Minvalue represents the minima of emotion word place classification intensity, and X represents the emotion intensity changing word, and Y represents the emotion intensity after normalization;

Step 4, carry out the expansion of sentiment dictionary based on synonym woods extended edition;

Step 5, SO-PMI algorithm based on improvement carry out the expansion of dictionary, and concrete process is as follows:

According to following formula:

W i ( w o r d ) = Σ j γ P M I ( w o r d , w i j ) w i j

SO (word)=max (Wi(word))wij

Wherein, wherein, γ is adjustment coefficient, wijRepresent jth benchmark word, W in the i-th class emotional categoryi(word) represent that neologisms word is in the SO-PMI value with the i-th class emotion word;

For neologisms word calculated SO-PMI value in inhomogeneity, select the neologisms word wherein with the SO-PMI of maximum;

Step 6, rule-based emotional orientation analysis is carried out for microblogging text, including word segmentation processing, text decimation rule is expanded, polarity word is shifted, degree adverb is processed, the structure of negative word+degree adverb+emotion word and the structure of degree adverb+negative word+emotion word are analyzed processing, give different weights;

Step 7, execution sentiment analysis algorithm based on weight factor, this algorithmic formula is expressed as:

S O ( S ) = MAXΣα C i W i j W i j C i

Wherein, SO (S) is the Sentiment orientation value (Sentiment Orientation) of statement S, WijRepresent the emotion value that i-th belongs to emotion word W of emotional category j, CiReferring to modify the weight factor of this emotion word, α is adjustment coefficient.

Compared with prior art, the advantage of technique scheme is: not by language material number quantitative limitation, it is possible to achieve complete unsupervised execution, is especially suitable for that microblogging is a large amount of and unmarked data.

Accompanying drawing explanation

Fig. 1 is the overall flow schematic diagram of based on fine granularity sentiment dictionary the psychological Early-warning Model construction method of the present invention.

Detailed description of the invention

Below in conjunction with the drawings and the specific embodiments, it is described in further detail technical scheme.

As it is shown in figure 1, based on fine granularity sentiment dictionary the psychological Early-warning Model construction method of the present invention, flow process specifically includes following steps:

Step 1, carrying out machine translation, by manually obtaining, with the method for machine translation, the Chinese dictionary that ANEW dictionary is corresponding, this step specifically includes following process:

The all lexical informations process one, arranging and merge in ANEW dictionary, reject the english vocabulary of the past tense that can not represent in Chinese;Processing two, obtained bilingual table by machine translation, simultaneously in translation process, the most word for word investigation confirms the accuracy of vocabulary, prevents from causing bigger ambiguity;Process three, centering English dictionary show inconsistent entry, corrects, select the option best suiting sentiment analysis to be added into dictionary.The final Chinese vocabulary obtaining certain scale;

Step 2, vocabulary screen, and delete some vocabulary not being suitable for sentiment analysis, and this step specifically includes following process: deletes the vocabulary that there is differential expression in Chinese and English early warning, deletes this type of word that can affect sentiment analysis result;

Step 3: carry out the normalized of emotion value, the emotion value of word is normalized between-1~1, specifically include following process: in ANEW, the standards of grading scope of emotion time is 1~8, the negative affect intensity representing emotion from small to large of numerical value is to the excursion of positive emotion intensity, consider contrariety and the standards of grading of PAD dimension values of emotion word polarity, the intensity level of word is normalized;Use shown in normalization formula such as formula (1),

Y = X - A v e v a l u e M a x v a l u e - M i n v a l u e - - - ( 1 )

Wherein Avevalue represents the meansigma methods of emotion intensity, and Maxvalue represents the maximum of intensity of emotion word place classification, and Minvalue represents the minima of emotion word place classification intensity, and X represents the emotion intensity changing word, and Y represents the emotion intensity after normalization;

Step 4, carry out the expansion of sentiment dictionary based on synonym woods extended edition, specifically include following process: process one, choose microblog data as corpus, filter out the vocabulary that the frequency of occurrences is extremely low, construct more efficient sentiment dictionary;Process two, use the word similarity algorithm of Harbin Institute of Technology Chinese thesaurus, and combine existing semantic dictionary and carry out calculating Similarity of Words;

Step 5, SO-PMI algorithm based on improvement carry out the expansion of dictionary, specifically include following process: process one, utilize network neologisms to expand sentiment dictionary, and selected positive emotion and the benchmark word of Negative Affect are denoted as PS and NS respectively;Process two, neologisms word with PS, NS set is calculated corresponding PMI value, be denoted as WP and WN respectively.The calculation of the PMI value between one word and a word, as shown in Equation 2,

P M I ( word 1 , word 2 ) ≈ log 2 ( N * f ( word 1 , word 2 ) f ( word 1 ) * f ( word 2 ) ) - - - ( 2 )

Wherein, N represents word number of times total in corpus, f (word1,word2) represent word1,word2The frequency simultaneously occurred in corpus, f (word1) represent word word1The frequency occurred in corpus, f (word2) represent word2The frequency occurred in corpus, log2() the function representation logarithmic function with 2 as the end, such as formula (2) are planted and be may be assumed that word1For neologisms, word2For the word (otherwise can also) in gathering from PS, NS.

If calculate is a word and the PMI value of a set of words, the most as shown in Equation 3,

P M I ( word 1 , W o r d S e t ) ≈ Σ word ′ log 2 ( N * f ( word 1 , word ′ ) f ( word 1 ) * f ( word ′ ) ) - - - ( 3 )

Wherein, WordSet represents a set of words, and word' is the word in WordSet;

Process three, the computing formula such as formula (4) of SO-PMI value shown,

SO (word)=PMI (word, PS)-PMI (word, NS) (4)

Wherein, SO (word) represents the SO-PMI value of word word, if the value obtained is more than 0, adds positive dictionary, if the value obtained is less than 0, adds passive dictionary, is otherwise added without any dictionary.

The method that process four, the present invention improve is such as formula (5), shown in formula (6)

W i ( w o r d ) = Σ j γ P M I ( w o r d , w i j ) w i j - - - ( 5 )

SO (word)=max (Wi(word))wij (6)

Wherein, γ is adjustment coefficient, wij wijRepresent jth benchmark word, W in the i-th class emotional categoryi(word) represent that neologisms word is in the SO-PMI value with the i-th class emotion word.Formula (6) shows, for neologisms word calculated SO-PMI value in inhomogeneity, selects maximum of which, can obtain the classification belonging to this word simultaneously.

Step 6: carry out rule-based emotional orientation analysis for microblogging text, specifically includes following process: the Chinese lexical analysis device ICLTCLAS process, utilizing Institute of Computing Technology, CAS to develop carries out word segmentation processing;Processing two, expand text decimation rule, obtain the text decimation rule that the present invention uses, decimation rule is as shown in table 3;Processing three, shifted by polarity word, the emotion modified for negative word (if not, may not wait word) is multiplied by the coefficient of-1.The sentence occurred for adversative (although as, but etc.), only carries out sentiment analysis to later half sentence;Processing four, process degree adverb, be divided into Pyatyi according to the intensity of emotion, value is between 0.5-3;Process five, the structure of negative word+degree adverb+emotion word and the structure of degree adverb+negative word+emotion word are analyzed processing, give different weights.

Step 7, execution sentiment analysis algorithm based on weight factor, sentiment analysis algorithm based on weight factor (the Text sentiment orientation classification algorithm based on weighting factor of the present invention, WF-SO), as shown in formula (7).

S O ( S ) = MAXΣα C i W i j W i j C i - - - ( 7 )

Wherein, SO (S) is the Sentiment orientation value (Sentiment Orientation) of statement S, WijRepresent the emotion value that i-th belongs to emotion word W of emotional category j, CiReferring to modify the weight factor of this emotion word, α is adjustment coefficient.When α is 1, the tendency of text is the classification of the emotion word that occurrence number is most in play, and when α tends to infinitely great, text tendency is the classification of the word of emotion maximum intensity in this sentence.

Table 2, grand average experiment comparing result

Table 3, text decimation rule

The present invention uses the data that NLP&CC (Natural Language Processing&Chinese Computing) 2013 Chinese microblogging trend analysis evaluation and test provides.According to the requirement of NLP&CC, carry out identification and the classification of emotion sentence.The accuracy that experimental result obtains is 0.3420, and recall rate is 0.8873, and F value is 0.4935.Although the result that the present invention is in accuracy is relatively low, but the method building sentiment dictionary of the present invention have the advantage that into: not by language material number quantitative limitation, it is possible to achieve complete unsupervised execution, be especially suitable for that microblogging is a large amount of and unmarked data.

Grand is averagely the arithmetic mean of instantaneous value of each emotion class performance indications, the arithmetic mean of instantaneous value of micro-performance indications being averagely each instance document.The present invention microblog data emotional semantic classification (good, happy, anger, sorrow, fear, dislike, frightened) experiment obtains the accuracy of micro-average result, recall rate, F value respectively 0.3332,0.2959,0.3134, the accuracy of grand average result, recall rate, F value is respectively 0.3411,0.2232,0.2698.Totally obtain more satisfied result.As shown in table 1, be the inventive method in grand average index with other method comparing result.

Claims (1)

1. a psychological Early-warning Model construction method based on fine granularity sentiment dictionary, it is characterised in that below the method Step:
Step (1), the method for utilization translation obtain the Chinese dictionary that ANEW dictionary is corresponding;
Step (2), vocabulary screen, and are not suitable for the vocabulary of sentiment analysis in the Chinese dictionary that delete step (1) obtains;
Step (3), carry out the normalized of emotion value, the emotion value of word is normalized between-1~1, normalizing Change formula to be expressed as:
Y = X - A v e v a l u e M a x v a l u e - M i n v a l u e
Wherein, Avevalue represents the meansigma methods of emotion intensity, and Maxvalue represents the maximum intensity of emotion word place classification Value, Minvalue represents the minima of emotion word place classification intensity, and X represents the emotion intensity changing word, and Y represents normalizing Emotion intensity after change;
Step (4), carry out the expansion of sentiment dictionary based on synonym woods extended edition, filter out the frequency of occurrences in corpus Extremely low vocabulary, constructs more efficient sentiment dictionary;Use the word similarity algorithm of Harbin Institute of Technology's Chinese thesaurus, and Carry out calculating Similarity of Words in conjunction with existing semantic dictionary;
Step (5), SO-PMI algorithm based on improvement carry out the expansion of dictionary, and concrete process is as follows:
According to following formula:
W i ( w o r d ) = Σ j γ P M I ( w o r d , w i j ) w i j
SO (word)=max (Wi(word))wij
Wherein, wherein, γ is adjustment coefficient, wijRepresent jth benchmark word, W in the i-th class emotional categoryi(word) table Show that neologisms word is in the SO-PMI value with the i-th class emotion word;
For neologisms word calculated SO-PMI value in inhomogeneity, select the SO-PMI wherein with maximum Neologisms word;
Step (6), rule-based emotional orientation analysis is carried out for microblogging text, including word segmentation processing, to literary composition This decimation rule carries out expanding, carrying out polarity word shifting, processing degree adverb, for negative word+degree pair The structure of word+emotion word and the structure of degree adverb+negative word+emotion word are analyzed processing, and give different weights;
Step (7), execution sentiment analysis algorithm based on weight factor, this algorithmic formula is expressed as:
S O ( S ) = MAXΣα C i W i j W i j C i
Wherein, SO (S) is the Sentiment orientation value (Sentiment Orientation) of statement S, WijRepresent i-th to belong to The emotion value of emotion word W of emotional category j, CiReferring to modify the weight factor of this emotion word, α is adjustment coefficient.
CN201610286515.1A 2016-04-29 2016-04-29 A kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary CN105956095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610286515.1A CN105956095B (en) 2016-04-29 2016-04-29 A kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610286515.1A CN105956095B (en) 2016-04-29 2016-04-29 A kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary

Publications (2)

Publication Number Publication Date
CN105956095A true CN105956095A (en) 2016-09-21
CN105956095B CN105956095B (en) 2019-11-05

Family

ID=56914867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610286515.1A CN105956095B (en) 2016-04-29 2016-04-29 A kind of psychological Early-warning Model construction method based on fine granularity sentiment dictionary

Country Status (1)

Country Link
CN (1) CN105956095B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708805A (en) * 2016-12-30 2017-05-24 深圳天珑无线科技有限公司 Text statistics-based psychoanalysis method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090864A (en) * 2014-06-09 2014-10-08 合肥工业大学 Emotion dictionary building and emotion calculation method
US20150213002A1 (en) * 2014-01-24 2015-07-30 International Business Machines Corporation Personal emotion state monitoring from social media

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150213002A1 (en) * 2014-01-24 2015-07-30 International Business Machines Corporation Personal emotion state monitoring from social media
CN104090864A (en) * 2014-06-09 2014-10-08 合肥工业大学 Emotion dictionary building and emotion calculation method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708805A (en) * 2016-12-30 2017-05-24 深圳天珑无线科技有限公司 Text statistics-based psychoanalysis method and device

Also Published As

Publication number Publication date
CN105956095B (en) 2019-11-05

Similar Documents

Publication Publication Date Title
Pasupat et al. Compositional semantic parsing on semi-structured tables
US20150347385A1 (en) Systems and Methods for Determining Lexical Associations Among Words in a Corpus
US10372741B2 (en) Apparatus for automatic theme detection from unstructured data
US20160189029A1 (en) Displaying Quality of Question Being Asked a Question Answering System
Hashimi et al. Selection criteria for text mining approaches
US9336306B2 (en) Automatic evaluation and improvement of ontologies for natural language processing tasks
US20160162492A1 (en) Confidence Ranking of Answers Based on Temporal Semantics
CN103885934B (en) Method for automatically extracting key phrases of patent documents
Fagan Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods
Hassan et al. Semantic relatedness using salient semantic analysis
Gorman et al. Quantitative analysis
Alowibdi et al. Empirical evaluation of profile characteristics for gender classification on twitter
CN102662952B (en) Chinese text parallel data mining method based on hierarchy
CN101872349B (en) Method and device for treating natural language problem
Sharifi et al. Summarizing microblogs automatically
US9558264B2 (en) Identifying and displaying relationships between candidate answers
US8356025B2 (en) Systems and methods for detecting sentiment-based topics
US10147036B2 (en) Analyzing concepts over time
CN104573046A (en) Comment analyzing method and system based on term vector
US8370129B2 (en) System and methods for quantitative assessment of information in natural language contents
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US9378204B2 (en) Context based synonym filtering for natural language processing systems
US20170011289A1 (en) Learning word embedding using morphological knowledge
Herbelot et al. Building a shared world: Mapping distributional to model-theoretic semantic spaces
US7899816B2 (en) System and method for the triage and classification of documents

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
GR01 Patent grant