CN109376251A - A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model - Google Patents
A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model Download PDFInfo
- Publication number
- CN109376251A CN109376251A CN201811143903.XA CN201811143903A CN109376251A CN 109376251 A CN109376251 A CN 109376251A CN 201811143903 A CN201811143903 A CN 201811143903A CN 109376251 A CN109376251 A CN 109376251A
- Authority
- CN
- China
- Prior art keywords
- term vector
- learning model
- microblogging
- training
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 title claims abstract description 130
- 238000010276 construction Methods 0.000 title claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 98
- 230000008451 emotion Effects 0.000 claims abstract description 44
- 238000011156 evaluation Methods 0.000 claims abstract description 12
- 238000000034 method Methods 0.000 claims description 53
- 230000008569 process Effects 0.000 claims description 40
- 238000001914 filtration Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 2
- 230000004927 fusion Effects 0.000 claims 1
- 230000014509 gene expression Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 239000000047 product Substances 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000012854 evaluation process Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 206010028916 Neologism Diseases 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000013067 intermediate product Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000011017 operating method Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model, comprising: (1) corresponding training corpus is obtained for the feature of current microblog data;(2) data prediction work is carried out to training corpus;(3) candidate dictionary is constructed;(4) seed sentiment dictionary is constructed;(5) selection and definition of training parameter and configuration;(6) training term vector learning model;(7) term vector learning model training result is assessed;(8) iteration executes step (6), until the training of all parameter traversals finishes;(9) term vector of optimum evaluation result is selected;(10) training word-level feeling polarities classifier;(11) words application grade feeling polarities classifier and final target sentiment dictionary is obtained.The present invention devises one and combines semantic and emotion information term vector learning model, thus devises the Chinese sentiment dictionary construction method towards microblogging, can promote the efficiency and quality for obtaining Chinese sentiment dictionary.
Description
Technical field
The present invention relates to a kind of term vector learning arts, and in particular to a kind of microblogging Chinese based on term vector learning model
Sentiment dictionary construction method, belongs to natural language processing technique field.
Background technique
Sentiment analysis is an important branch of natural language processing field, is also known as opining mining, proneness analysis.
Its task is that people's quick obtaining, arrangement and analysis related commentary information are helped by computer resource, to passionate color
Color subjectivity information text is analyzed, handled, concluded and reasoning.In recent years, with the popularity of the internet with development, especially
It is the rise of all kinds of social networks, the network user can issue daily and propagate up to more than one hundred million information.In the letter of these magnanimity
In informative text, there is the viewpoint tendency for greatly expressing user and Sentiment orientation, these emotion information texts are very precious
Expensive opinion resource includes people to the different viewpoints and position of the various phenomenons of society, topic be related to politics, it is economical, military,
The various fields such as amusement, life.Individuals and organizations increasingly pay attention to the Sentiment orientation and viewpoint of user, and by the analysis knot to it
Fruit is used for relevant Decision, therefore automatically analyzes it processing using computer technology, in the analysis of public opinion, precision marketing, sales volume
The fields such as prediction suffer from very extensive application, thus cause the extensive concern of enterprise, researcher and government organs.
Sentiment analysis is a newer and more popular research field, after starting from 2000.The sub- direction of sentiment analysis includes emotion
Classification, viewpoint extraction, viewpoint question and answer and viewpoint abstract etc..Wherein, narrow sense can be regarded to the classification of text emotion tendency as
Sentiment analysis belongs to the research range of text classification;The elements such as extraction viewpoint holder, evaluation object then belong to information extraction and ask
Topic;The viewpoint to some object is determined from a large amount of texts, and is considered as Issues about Information Retrieval.
For sentiment analysis problem, the sentiment dictionary for constructing high quality can provide great help for sentiment analysis.Emotion
The difference is that for a word, mark is not its semanteme or its foreign language translation for dictionary and normal dictionary,
But its feeling polarities.The labels of this feeling polarities, can also either " positive " of coarseness, " passiveness ", " neutrality "
To be fine-grained " indignation ", " fearing ", " liking " etc..In addition to polar categories, this polar intensity can also be provided, table is carried out
Up to the emotion intensity of vocabulary out.The classification of sentiment dictionary can be divided into three classes: basic sentiment dictionary, expand sentiment dictionary and
Field sentiment dictionary.Emotion word that is that basic sentiment dictionary includes some bases and being accepted extensively, such as " fine ", " beauty
It is beautiful ", " evil person ", " ugliness " etc.;Sentiment dictionary is expanded, is expanded by basic sentiment dictionary, main side
Formula is to carry out the extension of emotion word by synonymicon;For the emotion word inside identification sentence, basic emotion is relied solely on
Dictionary is inadequate, because the word being not present in basic sentiment dictionary is also likely to be present mood and inclines in certain fields
To, such as: " this mobile phone always blue screen ", " blue screen " is exactly the word for having negative feeling in this field of the digital products such as mobile phone,
Therefore domain lexicon is also needed.
Complete sentiment dictionary is to carry out necessity for sentiment analysis without adequate condition, the emotion of text with comprising
Word emotion has very big correlation.Utilize sentiment dictionary, it can be determined that whether each of sentence word has actively
Perhaps passive Sentiment orientation or more fine-grained specific emotional color and intensity are obtained, thus to judge a sentence
Sub, a document Sentiment orientation provides certain reference frame.Therefore, how the emotion that range is wide, quality is high is constructed automatically
Dictionary has great research significance.Currently, the sentiment dictionary of English has had many good achievements, and Chinese emotion word
Allusion quotation is although there is some products, and there are also to be strengthened for quality.
Summary of the invention
The present invention is a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model, is primarily directed to
Microblogging Chinese sentiment dictionary constructs task, according to current existing term vector learning method and the feature of Chinese language, proposes
A kind of term vector learning model of combination semanteme and emotion information proposes a kind of benefit in combination with the feature of microblog data
The method of microblogging Chinese sentiment dictionary is constructed with term vector learning model.In building process, pointedly to microblogging sentence into
Row pretreatment, optimizes the training process of term vector learning model, improves the semanteme and emotional expression ability of the term vector of acquisition, most
The quality of the Chinese sentiment dictionary obtained is improved eventually.
A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model of the present invention, feature exist
In the following steps are included:
Step (1) obtains corresponding training corpus for the feature of current microblog data;
Step (2) carries out data prediction work to the microblogging training corpus of acquisition;
Step (3) constructs candidate dictionary, as the dictionary in subsequent term vector learning model training process;
Step (4) constructs seed sentiment dictionary;
The selection and definition of step (5) term vector learning model training parameter and configuration;
Step (6) trains term vector learning model;
Step (7) assesses term vector learning model training result;
Step (8) iteration executes step (6), until the training of all parameter traversals finishes;
Step (9) selects the term vector under optimum evaluation result, the vector characteristics final as word;
Step (10) trains word-level feeling polarities classifier;
Step (11) words application grade feeling polarities classifier simultaneously obtains final target sentiment dictionary.
Specifically, step (1) is the corresponding training corpus of feature acquisition for current microblog data.
Step (2) is to carry out data prediction work for the microblogging training corpus obtained, and obtaining from original language material can
Mainly include following sub-step to be directly used in the data of model training:
Step (2.1) data normalization extracts the useful information in microblogging sentence;
The emotion of step (2.2) microblogging sentence parses, the feeling polarities information of labeled statement;
Step (2.3) carries out Chinese word segmentation for microblogging sentence;
Stop words is arranged in step (2.4), and filtering is for meaningless word in term vector learning model training process.
Step (3) is to carry out the construction of candidate dictionary for the microblogging corpus after pretreatment.
Step (4) constructs seed sentiment dictionary, and the dictionary is by the learning process for being used for subsequent term vector and final word
The training process of grade feeling polarities classifier, mainly includes following sub-step:
Step (4.1) is based on candidate dictionary construction basic seeds sentiment dictionary;
Step (4.2) is based on seed sentiment dictionary, expands seed sentiment dictionary size using synonym extended method.
Step (5) is chosen and defines term vector learning model relevant parameter and configuration, mainly includes following sub-step:
Non-combined word vocabulary is arranged in step (5.1), and filtering is without considering Chinese in term vector learning model training process
The word of word semanteme;
The setting of step (5.2) term vector learning model training parameter;
The setting of step (5.3) term vector learning model evaluation criteria.
The training of step (6) term vector learning model is trained using training corpus and obtains corresponding term vector.
The term vector that step (7) is obtained for training is assessed, which is that the emotion of word is carried out using term vector
Polarity classification task assesses the quality of term vector, determines the quality of term vector by accuracy rate and macro average F value.
Step (8) adjusting training parameter iteration executes step (6), until all equal iteration of parameter are finished.
Step (9) selects the word under optimal result according to the assessment result under different term vector learning model training parameters
The vector vector characteristics final as word.
Step (10) utilizes the term vector training word-level feeling polarities classifier obtained.
Step (11) words application grade feeling polarities classifier carries out the feeling polarities reasoning of word in candidate dictionary, and shape
At final target sentiment dictionary.
Compared with prior art, the present invention its remarkable advantage is: rejecting microblogging language using the analysis of the technologies such as regular expression
Irrelevant information in sentence, avoids influence of these irrelevant informations for term vector learning model training result;Use stop words
Influence of the nonsense words to term vector training process is removed, noise word is reduced, reduces computation complexity;Use non-combined word
To remove influence of the word for being not necessarily to text semanteme in consideration to term vector training process, reduction computation complexity;Use one
Kind trains acquisition term vector in conjunction with semantic and emotion information term vector learning model, the semanteme of three parts of models coupling
Information, semantic information, the feeling polarities information of sentence and the feelings of word of the word of context and composition word including word
Feel polarity information, the term vector obtained by this model can preferably express the semanteme and affective characteristics of word.
Detailed description of the invention
Fig. 1 constructs process based on the microblogging Chinese sentiment dictionary of term vector learning model
Fig. 2 seed sentiment dictionary extends process
Fig. 3 seed sentiment dictionary is respectively classified quantity situation
Fig. 4 combines semantic and emotion information term vector learning model
Fig. 5 term vector estimation flow figure
Fig. 6 word-level feeling polarities classifier training flow chart
Specific embodiment
It is right below in conjunction with the accompanying drawings and the specific embodiments to be more clear the object, technical solutions and advantages of the present invention
The present invention is described in detail.
The purpose of the present invention is to provide a kind of efficient and accurate Chinese sentiment dictionary construction methods, propose a kind of base
In the microblogging Chinese sentiment dictionary construction method of term vector learning model.Valid data are filtered out by using regular expression,
Emotion is marked using emoticon, seed sentiment dictionary is constructed using a kind of mode of semi-automation, uses a kind of novel knot
Semantic and emotion information term vector learning model is closed to train acquisition term vector, the Chinese of acquisition is improved by these modes
The quality of sentiment dictionary.The invention mainly includes steps:
Step (1) obtains corresponding training corpus for the feature of current microblog data;
Step (2) carries out data prediction work to the microblogging training corpus of acquisition;
Step (3) constructs candidate dictionary, as the dictionary in subsequent term vector learning model training process;
Step (4) constructs seed sentiment dictionary;
The selection and definition of step (5) term vector learning model training parameter and configuration;
Step (6) trains term vector learning model;
Step (7) assesses term vector learning model training result;
Step (8) iteration executes step (6), until the training of all parameter traversals finishes;
Step (9) selects the term vector under optimum evaluation result, the vector characteristics final as word;
Step (10) trains word-level feeling polarities classifier;
Step (11) words application grade feeling polarities classifier simultaneously obtains final target sentiment dictionary.
Detailed operation process such as Fig. 1 institute of the above-mentioned microblogging Chinese sentiment dictionary construction method based on term vector learning model
Show.Here above-mentioned steps are described in detail respectively.
1. various neologisms layers go out not entirely since the expression way of user on microblogging is ever-changing, so in selection corpus
When choose that wherein emotional expression is abundant and corpus with current era feature as far as possible, is promoted as much as possible with this final trained
The accuracy and timeliness of obtained sentiment dictionary, since in reality, user expresses the way of viewpoint by emoticon
It is more and more, emotion all kinds of emoticons abundant are contained in a large amount of microblogging sentence, so finally crawling microblog data
When, only obtain the sentence that those include emoticon.
2. rejecting meaningless content to obtain valuable information from microblogging training corpus, need to carry out data
Pretreatment work specifically includes following sub-step:
(2.1) data normalization, for the microblog data got, due to being the daily commentary delivered of user, no
Normalization with document, a portion information is no for model training in all senses, such as the symbol in text
Information (comma, fullstop, exclamation mark etc.), some webpage link informations that may be present in text, there are also present in text its
The information (such as # theme # ,@user, other special symbolic information) of his form, further, since target is to construct Chinese emotion word
Allusion quotation, it is therefore desirable to reject the English word in microblogging sentence;It is eventually by regular expression that these redundant informations are literary from microblogging
It is removed in this, leaves behind valuable text information;
(2.2) label microblogging sentence feeling polarities only obtain the data comprising emoticon when obtaining microblog data,
When carrying out specific microblogging sentence feeling polarities label, such a strategy is used: if only wrapped in a microblogging sentence
The emoticon of the positive emotion containing expression, the feeling polarities of this microblogging sentence are exactly positive, conversely, its feeling polarities is exactly
Passive.During specific implementation, need to summarize two class emoticon collection, such as [relative], [heart], [struggle] express product
The emotion of pole, [sad], [cursing in rage], [terrified] etc. express passive emotion;Using the strategy and emoticon set, lead to
The matched mode of canonical is crossed to parse microblog text affective information, is finally labelled with corresponding emotion on each microblogging sentence
Polarity.
(2.3) Chinese word segmentation, since the target of this method is building microblogging Chinese sentiment dictionary, Chinese word is this method
The basic unit of operation uses Jieba participle tool for processing and segments task, the participle tool in the specific implementation process
There are three types of different participle modes, are accurate model, syntype and search engine mode respectively, wherein accurate model is by sentence
It accurately separates, compares suitable for text analyzing;Syntype is to provide all words that can be scanned to divide, this mode
It is easy to appear ambiguity, and the word segmentation result of search engine mode is segmented suitable for search engine.It is different for the participle tool
The feature of participle mode, this method select accurate model to carry out specific participle task.Finally one by one by microblogging sentence set
Carry out participle operation.
(2.4) stop words is arranged, and the natural language text form that user shows emotion is varied, wherein comprising a large amount of
Pronoun, conjunction, interjection, for example, etc, oh, I, then etc., these words trained actual term vector learning model
Journey is nonsensical, and for the sentiment dictionary finally constructed be also not in all senses, so before model training,
One deactivated vocabulary can be first set, and the word deactivated in vocabulary at this can be removed in specific training process, reduced with this
Bring negatively affects these words in the training process.It in the specific implementation process, can will be in the microblogging sentence after participle
The word being present in deactivated vocabulary is rejected, and has filtered the microblogging sentence of deactivated vocabulary as the corpus in subsequent training process.
3. constructing subsequent dictionary, the dictionary set needed during subsequent experimental is obtained, when constructing candidate dictionary, first
According to frequency of occurrence will be pretreated after corpus in word be ranked up, the too low data of removal frequency of occurrence, here
A frequency threshold value MIN_FREQUENCY is set, the word that will be less than the threshold value all removes, and using remaining word as candidate
Dictionary carries out subsequent experimentation.10 are set by the frequency threshold value in the actual implementation process.
4. constructing seed sentiment dictionary, classify for the learning process of subsequent term vector and last word-level feeling polarities
The training process of device specifically includes following sub-step:
(4.1) based on candidate dictionary construction basic seeds sentiment dictionary, basic seeds sentiment dictionary is with a high credibility, quantity
Less dictionary is carried out by the way of manually marking in specific implementation process, selects 5 labelers first, then from candidate
5 labelers are allowed to carry out emotion to these words respectively later from 500 vocabulary of high to low selection according to frequency of occurrence in dictionary
Polarity mark, mark value is divided into three classes: actively, it is passive and other, finally extract mark value is all the same in five parts of data
It is configured to basic dictionary, different takes most values voted as annotation results for marking.
(4.2) seed sentiment dictionary extends, although the feeling polarities of basic seeds sentiment dictionary are with a high credibility, due to
Negligible amounts, while the mode inefficiency manually marked, so need a kind of method of automation to carry out the extension of dictionary,
Specific extension process is as shown in Figure 2.Specific implementation steps are as follows: (i.e. by emotion word w existing in basic seeds sentiment dictionary
Not comprising the unrelated word of emotion, the i.e. word of other classifications) it is put into the near synonym set S that w is searched in Harbin Institute of Technology's Chinese thesaurus, for
Each word w_new in S, count w_new in the near synonym set M in Harbin Institute of Technology's Chinese thesaurus it is positive/passive/other
Number n1, n2, the n3 of class word, if n1 > n2+threshold_pos, n1 > n3+threshold_pos, then return w_new
To positive class, similarly for passive/other division mode, until the word in S is all inspected, stop algorithm.Pass through
True extension experience, the extension dictionary effect that every threshold value is respectively set as 1,0,0 acquisition are best.It is final to obtain
Seed sentiment dictionary quantity situation it is as shown in Figure 3.
5. the selection and definition of term vector learning model training parameter and configuration, specifically include following sub-step:
(5.1) non-combined word vocabulary setting, since the training process of subsequent term vector learning model considers composition word
The feature of the word of language, still, the word in not every Chinese word are all meaningful, such as foreign language phonemic loans, as " chalk
Power ", " sofa " etc., they are come by the pronunciation transliteration of English word, and individually word can not inside these words
The semanteme of word itself, furthermore many substantive nouns, such as name, place name, organization name are expressed, these words are instructed in model
Feature without the concern for word during practicing, in the specific implementation process, by the way of manual reviews' candidate's dictionary come
Non-combined word is extracted in identification.
(5.2) term vector learning model training parameter is arranged, and needs the model training number that carries out with this to be arranged;Because
Need to obtain be adjoint product --- the term vector of model training, so needing to consider different training parameters to finally obtaining
The influence of the quality of term vector, specific term vector quality evaluation are arranged in next step.It is primarily upon in specific training process
Several parameter indexes are as follows: window size, vector dimension and initial learning rate.
(5.3) evaluation criteria be arranged, due to need obtain be model training intermediate product, so it is finally paying close attention to and
Model training as a result, but the term vector data of final output, so need to assess is the quality of term vector, here
Characterized by term vector, word-level feeling polarities classification task is carried out, it is specific to assess finally using classification results as evaluation criteria
Process will be described in detail in subsequent model evaluation step.
6. the training of term vector learning model utilizes combination semanteme and emotion information after the pretreatment of microblogging corpus finishes
Term vector learning model train corresponding term vector feature, network structure such as Fig. 4 institute of term vector learning model here
Show, which combines three kinds of language message joint training term vectors, context and composition word including word
The feeling polarities information of the semantic information of the word of language, the feeling polarities information of sentence and word, in specific training process, point
Safety pin is trained these three language messages, and is optimized using Negative Sampling method to objective function,
The objective function of these three final parts is respectively such as following f1、f2、f3Shown in formula.
7. term vector learning model training result is assessed, wherein assessment object is the term vector obtained, the totality of the assessment
Thinking is to carry out word-level feeling polarities classification task using the term vector obtained as feature, final to choose classifying quality most
Good parameter combination, in the specific implementation process, selection construct SVM classifier to carry out classification task, whole evaluation process
As shown in Figure 5.The specific operating procedure of the process are as follows: the first step prepares training set and test set, here by the seed of extension
Emotion set of words is as training test set, also with the thought of k-fold cross validation, by entire seed emotion
Set of words is divided evenly into 5 parts.Second step selects a copy of it data as test set each time, remaining 4 parts are used as training set
For SVM classifier model training.Third step repeats second step 5 times, and each part of data set in this way all can serve as test set ginseng
With enter.Training can obtain a model on each training set, be tested on corresponding test set with this model, calculate simultaneously
The evaluation index of preservation model.4th step, calculates estimation of the average value as model accuracy of 5 groups of test results, and as working as
The performance indicator of preceding k folding cross validation drag.
8. adjusting term vector learning model parameter, iteration executes the assessment of model training and training result, records simultaneously
The performance indicator that the model obtained under lower different parameters is shown in assessment component.
9. as a result, which type of parameter can be gone out with Tactic selection according to obtained in model training before and model evaluation
It combines to carry out the training of final term vector learning model.And the term vector generated after completing the model training is as subsequent
The basic word feature of sentiment dictionary building is applied.Finally according to hands-on and evaluation process, selected term vector
Practise the final argument setting of model are as follows: window size 5, term vector dimension are 200, and initial learning rate is 0.025.
10. training word-level feeling polarities classifier obtains portion by the training of term vector learning model before
With the semantic and associated term vector feature of emotion, word-level feeling polarities point are constructed in conjunction with seed sentiment dictionary set
Class device.Here the feeling polarities of the word set are divided into three classes: actively, it is passive and other, in the specific implementation process, use
SVM classifier to carry out the feeling polarities classification, overall training process as shown in fig. 6, by seed emotion set of words and
Input of the term vector feature as model training, wherein seed emotion set of words needs to be divided into training set and test set, utilizes K
The mode of cross validation is rolled over to carry out model training and obtain the classifying quality index of corresponding model, in addition, by adjusting mould
The hyper parameter of type itself carrys out iteration and executes model training, final to choose by comparing the classifying quality index under different models
The optimal sorter model of classifying quality is as final sentiment dictionary reasoning device.The parameter that the process finally confirms includes: core
Function is Gaussian kernel (Radial Basis Function, RBF), and penalty coefficient C is that 1, gamma parameter is 1/k, and wherein k is
The feature quantity of word, the i.e. dimension of term vector.
11. the word-level feeling polarities classifier that application training obtains, and obtained before being to the set of words of reasoning
Candidate dictionary, the classifier can be divided among three classification according to the term vector feature of each word, finally obtain respectively
Taking feeling polarities is positive, passive and other set of words, is actively together with the word bout of passiveness by wherein expression
It may make up final target Chinese sentiment dictionary.
Above by reference to attached drawing to a kind of microblogging Chinese feelings based on term vector learning model implemented according to the present invention
Sense dictionary creation method is described in detail.It is rejected the present invention has the advantage that being analyzed using technologies such as regular expressions
Irrelevant information in microblogging sentence avoids influence of these irrelevant informations for term vector learning model training result;Using stopping
Word removes influence of the nonsense words to term vector training process, reduces noise word, reduces computation complexity;Using non-
Portmanteau word come remove be not necessarily to consider in text semanteme influence of the word to term vector training process, reduction computation complexity;Make
Acquisition term vector is trained with the term vector learning model of a kind of combination semanteme and emotion information, three parts of the models coupling
Semantic information, the semantic information of word, the feeling polarities information of sentence and the word of context and composition word including word
The feeling polarities information of language, the term vector obtained by this model can preferably express semanteme and the emotion spy of word
Sign.
It needs to define, the invention is not limited to specific configuration described above and shown in figure and processing.Also,
For brevity, the detailed description to known method technology is omitted here.Current embodiment is all counted as in all respects
It is exemplary rather than limited, the scope of the present invention is by appended claims rather than foregoing description defines, and falls into power
Whole changes in the range of meaning and equivalent that benefit requires are to all be included among the scope of the present invention.
Claims (12)
1. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model, it is characterized in that by design fusion
Literary semantic and emotion information term vector learning model obtains corresponding term vector feature to train, retraining word-level emotion point
Emotion reasoning of the class device for final Chinese word, comprising the following steps:
Step (1) obtains corresponding training corpus for the feature of current microblog data;
Step (2) carries out data prediction work to the microblogging training corpus of acquisition;
Step (3) constructs candidate dictionary, as the dictionary in subsequent term vector learning model training process;
Step (4) constructs seed sentiment dictionary;
The selection and definition of step (5) term vector learning model training parameter and configuration;
Step (6) trains term vector learning model;
Step (7) assesses term vector learning model training result;
Step (8) iteration executes step (6), until the training of all parameter traversals finishes;
Step (9) selects the term vector under optimum evaluation result, the vector characteristics final as word;
Step (10) trains word-level feeling polarities classifier;
Step (11) words application grade feeling polarities classifier simultaneously obtains final target sentiment dictionary.
2. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that obtaining corresponding training corpus for the feature of current microblog data in step (1).
3. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that carrying out data prediction work for the microblogging training corpus obtained in step (2), obtaining from original language material can
Mainly include following sub-step to be directly used in the data of model training:
Step (2.1) data normalization extracts the useful information in microblogging sentence;
The emotion of step (2.2) microblogging sentence parses, the feeling polarities information of labeled statement;
Step (2.3) carries out Chinese word segmentation for microblogging sentence;
Stop words is arranged in step (2.4), and filtering is for meaningless word in term vector learning model training process.
4. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that in step (3) carrying out the construction of candidate dictionary for the microblogging corpus after pretreatment.
5. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that step (4) construct seed sentiment dictionary, the dictionary is by the learning process for being used for subsequent term vector and final word
The training process of grade feeling polarities classifier, mainly includes following sub-step:
Step (4.1) is based on candidate dictionary construction basic seeds sentiment dictionary;
Step (4.2) is based on seed sentiment dictionary, expands seed sentiment dictionary size using synonym extended method.
6. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that step (5) are chosen and define term vector learning model relevant parameter and configuration, mainly includes following sub-step:
Non-combined word vocabulary is arranged in step (5.1), and filtering is without text language in considering in term vector learning model training process
The word of justice;
The setting of step (5.2) term vector learning model training parameter;
The setting of step (5.3) term vector learning model evaluation criteria.
7. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that the training of step (6) term vector learning model, is trained using training corpus and obtain corresponding term vector.
8. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
The term vector for being characterized in that step (7) are obtained for training is assessed, which is that the emotion of word is carried out using term vector
Polarity classification task assesses the quality of term vector, determines the quality of term vector by accuracy rate and macro average F value.
9. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that step (8) adjusting training parameter iteration executes step (6), until all equal iteration of parameter are finished.
10. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that step (9) according to the assessment result under different term vector learning model training parameters, selects the word under optimal result
The vector vector characteristics final as word.
11. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that step (10) using the term vector training word-level feeling polarities classifier obtained.
12. a kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model according to claim 1,
It is characterized in that step (11) words application grade feeling polarities classifier carries out the feeling polarities reasoning of word in candidate dictionary, and shape
At final target sentiment dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811143903.XA CN109376251A (en) | 2018-09-25 | 2018-09-25 | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811143903.XA CN109376251A (en) | 2018-09-25 | 2018-09-25 | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109376251A true CN109376251A (en) | 2019-02-22 |
Family
ID=65402988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811143903.XA Pending CN109376251A (en) | 2018-09-25 | 2018-09-25 | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109376251A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858034A (en) * | 2019-02-25 | 2019-06-07 | 武汉大学 | A kind of text sentiment classification method based on attention model and sentiment dictionary |
CN110083825A (en) * | 2019-03-21 | 2019-08-02 | 昆明理工大学 | A kind of Laotian sentiment analysis method based on GRU model |
CN110263321A (en) * | 2019-05-06 | 2019-09-20 | 成都数联铭品科技有限公司 | A kind of sentiment dictionary construction method and system |
CN110570941A (en) * | 2019-07-17 | 2019-12-13 | 北京智能工场科技有限公司 | System and device for assessing psychological state based on text semantic vector model |
CN110569354A (en) * | 2019-07-22 | 2019-12-13 | 中国农业大学 | Barrage emotion analysis method and device |
CN110597997A (en) * | 2019-07-19 | 2019-12-20 | 中国人民解放军国防科技大学 | Military scenario text event extraction corpus iterative construction method and device |
CN110750648A (en) * | 2019-10-21 | 2020-02-04 | 南京大学 | Text emotion classification method based on deep learning and feature fusion |
CN111061876A (en) * | 2019-12-10 | 2020-04-24 | 中国建设银行股份有限公司 | Event public opinion data analysis method and device |
CN111191463A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Emotion analysis method and device, electronic equipment and storage medium |
CN111353044A (en) * | 2020-03-09 | 2020-06-30 | 重庆邮电大学 | Comment-based emotion analysis method and system |
CN111400496A (en) * | 2020-03-18 | 2020-07-10 | 江苏海洋大学 | Public praise emotion analysis method for user behavior analysis |
CN111522913A (en) * | 2020-04-16 | 2020-08-11 | 山东贝赛信息科技有限公司 | Emotion classification method suitable for long text and short text |
CN111881676A (en) * | 2020-07-03 | 2020-11-03 | 南京航空航天大学 | Emotion classification method based on word vectors and emotion part of speech |
CN112765350A (en) * | 2021-01-15 | 2021-05-07 | 西华大学 | Microblog comment emotion classification method based on emoticons and text information |
CN113111655A (en) * | 2021-05-12 | 2021-07-13 | 数库(上海)科技有限公司 | Construction method of separation dictionary, word segmentation method and device based on separation dictionary |
WO2021147298A1 (en) * | 2020-01-21 | 2021-07-29 | 中国银联股份有限公司 | Sentiment lexicon construction method and system, sentiment recognition method and system, and storage medium |
CN113191135A (en) * | 2021-01-26 | 2021-07-30 | 北京联合大学 | Multi-category emotion extraction method fusing facial characters |
CN113420151A (en) * | 2021-07-13 | 2021-09-21 | 上海明略人工智能(集团)有限公司 | Emotion polarity intensity classification method, system, electronic device and medium |
CN116340511A (en) * | 2023-02-16 | 2023-06-27 | 深圳市深弈科技有限公司 | Public opinion analysis method combining deep learning and language logic reasoning |
CN116450840A (en) * | 2023-03-22 | 2023-07-18 | 武汉理工大学 | Deep learning-based field emotion dictionary construction method |
CN117217218A (en) * | 2023-11-08 | 2023-12-12 | 中国科学技术信息研究所 | Emotion dictionary construction method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150278195A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Text data sentiment analysis method |
CN106503049A (en) * | 2016-09-22 | 2017-03-15 | 南京理工大学 | A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
-
2018
- 2018-09-25 CN CN201811143903.XA patent/CN109376251A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150278195A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Text data sentiment analysis method |
CN106503049A (en) * | 2016-09-22 | 2017-03-15 | 南京理工大学 | A kind of microblog emotional sorting technique for merging multiple affection resources based on SVM |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
Non-Patent Citations (1)
Title |
---|
杨玉凡: "中文情感词典构建中词向量学习技术的研究与应用", 《中国知网》 * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858034B (en) * | 2019-02-25 | 2023-02-03 | 武汉大学 | Text emotion classification method based on attention model and emotion dictionary |
CN109858034A (en) * | 2019-02-25 | 2019-06-07 | 武汉大学 | A kind of text sentiment classification method based on attention model and sentiment dictionary |
CN110083825A (en) * | 2019-03-21 | 2019-08-02 | 昆明理工大学 | A kind of Laotian sentiment analysis method based on GRU model |
CN110263321A (en) * | 2019-05-06 | 2019-09-20 | 成都数联铭品科技有限公司 | A kind of sentiment dictionary construction method and system |
CN110570941A (en) * | 2019-07-17 | 2019-12-13 | 北京智能工场科技有限公司 | System and device for assessing psychological state based on text semantic vector model |
CN110570941B (en) * | 2019-07-17 | 2020-08-14 | 北京智能工场科技有限公司 | System and device for assessing psychological state based on text semantic vector model |
CN110597997A (en) * | 2019-07-19 | 2019-12-20 | 中国人民解放军国防科技大学 | Military scenario text event extraction corpus iterative construction method and device |
CN110597997B (en) * | 2019-07-19 | 2022-03-22 | 中国人民解放军国防科技大学 | Military scenario text event extraction corpus iterative construction method and device |
CN110569354A (en) * | 2019-07-22 | 2019-12-13 | 中国农业大学 | Barrage emotion analysis method and device |
CN110569354B (en) * | 2019-07-22 | 2022-08-09 | 中国农业大学 | Barrage emotion analysis method and device |
CN110750648A (en) * | 2019-10-21 | 2020-02-04 | 南京大学 | Text emotion classification method based on deep learning and feature fusion |
CN111061876B (en) * | 2019-12-10 | 2023-06-13 | 中国建设银行股份有限公司 | Event public opinion data analysis method and device |
CN111061876A (en) * | 2019-12-10 | 2020-04-24 | 中国建设银行股份有限公司 | Event public opinion data analysis method and device |
CN111191463A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Emotion analysis method and device, electronic equipment and storage medium |
WO2021147298A1 (en) * | 2020-01-21 | 2021-07-29 | 中国银联股份有限公司 | Sentiment lexicon construction method and system, sentiment recognition method and system, and storage medium |
CN111353044B (en) * | 2020-03-09 | 2022-11-11 | 重庆邮电大学 | Comment-based emotion analysis method and system |
CN111353044A (en) * | 2020-03-09 | 2020-06-30 | 重庆邮电大学 | Comment-based emotion analysis method and system |
CN111400496A (en) * | 2020-03-18 | 2020-07-10 | 江苏海洋大学 | Public praise emotion analysis method for user behavior analysis |
CN111400496B (en) * | 2020-03-18 | 2023-05-09 | 江苏海洋大学 | Public praise emotion analysis method for user behavior analysis |
CN111522913A (en) * | 2020-04-16 | 2020-08-11 | 山东贝赛信息科技有限公司 | Emotion classification method suitable for long text and short text |
CN111881676B (en) * | 2020-07-03 | 2024-03-15 | 南京航空航天大学 | Emotion classification method based on word vector and emotion part of speech |
CN111881676A (en) * | 2020-07-03 | 2020-11-03 | 南京航空航天大学 | Emotion classification method based on word vectors and emotion part of speech |
CN112765350A (en) * | 2021-01-15 | 2021-05-07 | 西华大学 | Microblog comment emotion classification method based on emoticons and text information |
CN113191135A (en) * | 2021-01-26 | 2021-07-30 | 北京联合大学 | Multi-category emotion extraction method fusing facial characters |
CN113111655A (en) * | 2021-05-12 | 2021-07-13 | 数库(上海)科技有限公司 | Construction method of separation dictionary, word segmentation method and device based on separation dictionary |
CN113420151A (en) * | 2021-07-13 | 2021-09-21 | 上海明略人工智能(集团)有限公司 | Emotion polarity intensity classification method, system, electronic device and medium |
CN116340511A (en) * | 2023-02-16 | 2023-06-27 | 深圳市深弈科技有限公司 | Public opinion analysis method combining deep learning and language logic reasoning |
CN116340511B (en) * | 2023-02-16 | 2023-09-15 | 深圳市深弈科技有限公司 | Public opinion analysis method combining deep learning and language logic reasoning |
CN116450840A (en) * | 2023-03-22 | 2023-07-18 | 武汉理工大学 | Deep learning-based field emotion dictionary construction method |
CN117217218A (en) * | 2023-11-08 | 2023-12-12 | 中国科学技术信息研究所 | Emotion dictionary construction method and device, electronic equipment and storage medium |
CN117217218B (en) * | 2023-11-08 | 2024-01-23 | 中国科学技术信息研究所 | Emotion dictionary construction method and device for science and technology risk event related public opinion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109376251A (en) | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model | |
Saeed et al. | An ensemble approach for spam detection in Arabic opinion texts | |
CN108446271B (en) | Text emotion analysis method of convolutional neural network based on Chinese character component characteristics | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN111767741A (en) | Text emotion analysis method based on deep learning and TFIDF algorithm | |
CN105183717B (en) | A kind of OSN user feeling analysis methods based on random forest and customer relationship | |
CN107315734B (en) | A kind of method and system to be standardized based on time window and semantic variant word | |
KR20120109943A (en) | Emotion classification method for analysis of emotion immanent in sentence | |
Zhao et al. | ZYJ123@ DravidianLangTech-EACL2021: Offensive language identification based on XLM-RoBERTa with DPCNN | |
US11030533B2 (en) | Method and system for generating a transitory sentiment community | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
CN112860896A (en) | Corpus generalization method and man-machine conversation emotion analysis method for industrial field | |
CN107463703A (en) | English social media account number classification method based on information gain | |
CN106569996B (en) | A kind of Sentiment orientation analysis method towards Chinese microblogging | |
Mohandas et al. | Domain specific sentence level mood extraction from malayalam text | |
Xiao et al. | Chinese text sentiment analysis based on improved Convolutional Neural Networks | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN112434164A (en) | Network public opinion analysis method and system considering topic discovery and emotion analysis | |
Katyayan et al. | Sarcasm detection approaches for English language | |
US11605004B2 (en) | Method and system for generating a transitory sentiment community | |
KR20130103249A (en) | Method of classifying emotion from multi sentence using context information | |
Ilavarasan | A Survey on Sarcasm detection and challenges | |
Walha et al. | A Lexicon approach to multidimensional analysis of tweets opinion | |
CN116911286A (en) | Dictionary construction method, emotion analysis device, dictionary construction equipment and storage medium | |
Bhatia et al. | Analysing cyberbullying using natural language processing by understanding jargon in social media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190222 |