CN105512687A - Emotion classification model training and textual emotion polarity analysis method and system - Google Patents

Emotion classification model training and textual emotion polarity analysis method and system Download PDF

Info

Publication number
CN105512687A
CN105512687A CN201510931457.9A CN201510931457A CN105512687A CN 105512687 A CN105512687 A CN 105512687A CN 201510931457 A CN201510931457 A CN 201510931457A CN 105512687 A CN105512687 A CN 105512687A
Authority
CN
China
Prior art keywords
data
classification model
training
raw data
sentiment classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510931457.9A
Other languages
Chinese (zh)
Inventor
张建华
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201510931457.9A priority Critical patent/CN105512687A/en
Publication of CN105512687A publication Critical patent/CN105512687A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Abstract

The invention provides an emotion classification model training and textual emotion polarity analysis method and system. The emotion classification model training method comprises the steps that data are acquired from a corpus so that original data are obtained; the original data are preprocessed so that preprocessed data are obtained; word vectors are extracted from the preprocessed data through a neural network model; the word vectors are fused according to preset fusion rules so that sentence vector characteristics are generated; and an emotion classification model is trained according to the sentence vector characteristics so that the trained emotion classification model is obtained. The neural network model is adopted, the words are expressed by low-dimensional spatial vectors, the low-dimensional spatial word vectors are fused into the sentence vector characteristics according to the preset rules, and the emotion classification model is obtained by certain learning models through training so that word vector dimension can be effectively reduced, the dimensions disaster problem can be avoided, correlative attributes between the words can be mined and vector semantic accuracy can be enhanced.

Description

The method and system that training sentiment classification model and text feeling polarities are analyzed
Technical field
The present invention relates to data mining technology field, particularly relate to a kind of method and system of training the method and system of sentiment classification model and a kind of text feeling polarities to analyze.
Background technology
Sentiment analysis, also known as sentiment classification, specifically can analyze the subjective texts with emotional color, processes, concludes and the process of reasoning.Have complaints extraction, opinion mining, emotion of common sentiment analysis is excavated and subjective analysis etc.
In Financial Information analysis, for a long time, investor approves that financial market is subject to the human nature such as frightened and greediness and orders about widely, but lacks the concrete emotion of a kind of technology or the next objective comprehensive quantification people of data.Sentiment analysis is carried out to social data, for all the time the investor that perplexs by the irrational movement in financial market, open the window that the soul world understood by a fan, carry out prediction markets trend by the sentiment analysis of masses to market information.
In merchandise sales, when after new commodity added a period of time, for some attributes, sentiment analysis is done to the evaluation of commodity, or does the sentiment analysis of mixed attributes, then these analysis results are summed up, and emotion is done to representational evaluation present.Concerning businessman, a large amount of market surveys can be saved, also can be used for analyzing Consumer's Experience, in order to the product in Continual Improvement later stage.Concerning user, comprehensively can also formulate according to the evaluation of having bought user and buy strategy.
In enterprise's the analysis of public opinion, by a large amount of open social data, analyze the public to the attitude view of some relevant focuses of enterprise self, and strategy of diplomatizing accordingly can be formulated accordingly.
In the prior art, the step of sentiment analysis is substantially:
The first step, determines that a word is positive or passive, is subjective or objective, mainly relies on dictionary;
Second step, identifies that a sentence is positive or passive, is subjective or objective;
3rd step, excavates from emotion and rises to opinion mining.
The general method of tradition sentiment analysis builds tree to sentiment dictionary, go to search by lexicographic tree to each word of the inside after the content that will analyze does participle, determine the number of front word and negation words in the content that will analyze, the polarity of the content that will analyze is determined again by the comparison of these two quantity, namely positive or passive.This method only considers that the quantity that occurs by single emotion word or frequency judge feeling polarities, and does not consider that the general character of co-occurrence between word is to do sentiment analysis, and people have attempted a lot of sentiment classification model for this reason.
In sentiment analysis model, word represents in vector form, be exactly the most simply one-hotrepresentation, use vector representation word that very long exactly, the length of vector is the size of dictionary, the component of vector only has one 1, and other is that the position of 0,1 is to should the index of word in dictionary entirely.But this term vector is easily by the puzzlement of dimension disaster.And word is above mapped in a new space, and represent with the continuous real number vector of multidimensional, be called " WordRepresention " or " WordEmbedding ".Through long-term experiment, people are transitioned into the intensive expression present lower dimensional space gradually from original term vector rarefaction representation method.Because often can run into dimension disaster problem by term vector rarefaction representation method when solving practical problems, and semantic information cannot represent, cannot disclose the potential contact between word.
Summary of the invention
For the deficiency of above-mentioned sentiment classification model, the invention provides a kind of method and system of training the method and system of sentiment classification model and a kind of text feeling polarities to analyze, pass through neural network model, lower dimensional space is adopted to represent term vector, dimension disaster problem can be avoided, excavate the relating attribute between word and word, improve vector accuracy semantically.
First aspect, a kind of method of training sentiment classification model that the embodiment of the present invention provides, comprising:
Image data from corpus, obtains raw data;
Pre-service is carried out to described raw data, obtains preprocessed data;
By neural network model, from described preprocessed data, extract term vector;
By described term vector, merge by default fusion rule, generate sentence vector characteristics;
According to described sentence vector characteristics, training sentiment classification model, obtains the sentiment classification model after training.
Second aspect, the method that a kind of text feeling polarities that the embodiment of the present invention provides is analyzed, comprising:
Sentence vector characteristics is extracted from target text;
According to the sentiment classification model after the training that the method for described sentence vector characteristics and above-mentioned training sentiment classification model obtains, analyze the feeling polarities of described target text.
The third aspect, a kind of system of training sentiment classification model that the embodiment of the present invention provides, comprising:
Data acquisition unit, for image data from corpus, obtains raw data;
Raw data pretreatment unit, for carrying out pre-service to described raw data, obtains preprocessed data;
Term vector extraction unit, for by neural network model, extracts term vector from described preprocessed data;
Sentence vector characteristics generation unit, for by described term vector, merges by default fusion rule, generates sentence vector characteristics; And
Sentiment classification model training unit, for according to described sentence vector characteristics, trains sentiment classification model, obtains the sentiment classification model after training.
Fourth aspect, the system that a kind of text feeling polarities that the embodiment of the present invention provides is analyzed, comprising:
Sentence vector characteristics extraction unit, for extracting sentence vector characteristics from target text; And
Feeling polarities analytic unit, according to the sentiment classification model after the training that the method for described sentence vector characteristics and above-mentioned training sentiment classification model obtains, analyzes the feeling polarities of described target text.
The beneficial effect that technical scheme provided by the invention is brought:
In sum, in the present embodiment, computing machine can collect raw data by reptile instrument in corpus, pre-service is carried out to this raw data and obtains preprocessed data, from this preprocessed data, extract term vector with neural network model, then by fusion rules such as superpositions, above-mentioned term vector is merged and generates corresponding sentence vector characteristics, according to sentence vector characteristics, training obtains the stable sentiment classification model of robustness.The method of this training sentiment classification model, effectively can reduce the dimension of term vector, avoid the problem of dimension disaster, and can excavate the relating attribute between word and word, thus improves vector accuracy semantically.
In like manner, pass through said method, sentence vector characteristics is extracted from target text, adopt above-mentioned sentiment classification model again, the method of the text feeling polarities analysis of evaluating objects text, effectively can reduce the dimension of term vector, avoid the problem of dimension disaster, and the relating attribute that can excavate between word and word, improve vector accuracy semantically.
Accompanying drawing explanation
Figure 1A is the schematic flow sheet of the method for the training sentiment classification model that the embodiment of the present invention one provides;
The schematic diagram of the neural network model that Figure 1B adopts when being and extracting term vector in technical solution of the present invention.
Fig. 2 be the embodiment of the present invention two provide pretreated method flow schematic diagram is carried out to raw data;
Fig. 3 is the method flow schematic diagram of the cleaning raw data that the embodiment of the present invention three provides;
Fig. 4 is the schematic flow sheet of the method for the text feeling polarities analysis that the embodiment of the present invention four provides;
Fig. 5 is the configuration diagram of the system of the training sentiment classification model that the embodiment of the present invention five provides;
Fig. 6 is the configuration diagram of the raw data pretreatment unit that the embodiment of the present invention six provides;
Fig. 7 is the configuration diagram of the cleaning subelement that the embodiment of the present invention seven provides;
Fig. 8 is the configuration diagram of the system of the text feeling polarities analysis that the embodiment of the present invention eight provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described, be understandable that the technical scheme in the embodiment of the present invention, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.In addition, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not full content.
Embodiment one
See Figure 1A, the scheme of the present embodiment can be performed by computing machine, specifically can be implemented by the software program configured in a computer, and the method for training sentiment classification model comprises the steps:
S110: image data from corpus, obtains raw data.
Exemplary, the content that can be crawled in corpus by reptile instrument obtains original analysis data, also can obtain original analysis data by other Data Collection modes.
Reptile can be a kind of program of automatic acquisition web page contents, also can be the important component part of search engine.Search engine uses reptile to find Web content, HTML (HyperTextMark-upLanguage on network, HTML (Hypertext Markup Language)) document use hyperlink connect, throw the net just as being made into one, crawlers is thrown the net along this and is creeped, and this webpage just grabs with capture program by every webpage, then by content extraction out, extract hyperlink, as the clue of creeping further simultaneously.This reptile instrument can for increase income reptile instrument, non-reptile instrument, the separately exploitation of increasing income of purchase or the reptile instrument carrying out secondary development based on the reptile instrument of increasing income or buy.
Content in this corpus can be the statement of user comment or message on each large webpage, and above-mentioned raw data is a series of statements with key message that reptile is extracted from corpus.
S120: carry out pre-service to raw data, obtains preprocessed data.
Exemplary, do pre-service to the statement in above-mentioned raw data, can be that this statement is divided into multiple word, this preprocessed data be a series of word.Because the object of sentiment analysis is different, pretreated mode perhaps can be caused different with means, conventional raw data preprocess method has a lot, such as: data scrubbing, data integration, data fusion, data transformation and hough transformation etc.The present embodiment is not limited in any way pretreated embodiment, but embodiments of the invention two provide preferred embodiment.
S130: by neural network, extracts term vector from preprocessed data.
Exemplary, be the advantage of term vector in the technical scheme that embodiment the present embodiment provides, traditional sentiment classification model represented that the limitation of term vector is described here.
In traditional sentiment classification model, it is the most simply one-hotrepresentation model, this model is combined all words and forms lexicographic tree, with very long vector representation word, the length of vector is the size of dictionary, and the component of vector only has one 1, and other is 0, wherein the position of 1 is to should the index of word in dictionary, and this term vector is easily by the puzzlement of dimension disaster.Such as, one comprises in the dictionary of 10 words, and word needs to represent with 10 dimensional vectors, as " happily " in dictionary uses vector representation: V (' happily ')=[1,0,0,0,0,0,0,0,0,0], " anger " in dictionary uses vector representation: V (' anger ')=[0,1,0,0,0,0,0,0,0,0] etc.Adopt this model representation word to there is following defect, when the vocabulary in dictionary is very large, when such as reaching rank up to ten thousand, needs to represent word with dimensional vector up to ten thousand, cause being easy to occur dimension disaster.Meanwhile, such representation, is difficult to embody the relation between each word, and " happiness " and " happily " in such as dictionary has similarity, but by this model, is difficult to measure " happiness " and the similarity between " happily ".
For above-mentioned situation, the method for expressing of technical scheme to term vector of the present embodiment has done following improvement.First the word in above-mentioned preprocessed data is converted to the vector of 0-1.As shown in the top of Figure 1B, input feature vector gets W ifront c word 0-1 vector sum after c word 0-1 vector, W irepresent the 0-1 vector of i-th word, be not that this 2c 0-1 vector is stitched together composition high dimension vector here, but their bit-wise addition are obtained hidden layer node value W neu1, newly-generated feature is as the input layer of neural network model, and modeling uses neural networks with single hidden layer, and hidden layer chooses the unit of some, and activation function adopts sigmod function.Traditional neural network output layer be softmax, as shown in the bottom of Figure 1B, the present embodiment be huffman coding tree, export term vector eigenwert W syn1with the term vector W of correspondence.Using huffman coding tree as the output layer of above-mentioned neural network model, compare former higher-dimension one-hotrepresentation term vector, can effectively reduce term vector dimension, make described neural network model unsupervised learning obtain low-dimensional term vector corresponding to institute's predicate.
Preferably, in the present embodiment, acquisition be the term vector of 100 dimensions, also can be the term vectors etc. of 20,50,150,200 dimensions in other embodiments.Input is W ifront c and rear c 0-1 vectorial, output W i0-1 vector supervise, without rigid label, effectively can extract the nonlinear relationship of around word and centre word.
Therefore, the present embodiment method of the word of lower dimensional space vector representation, can be placed on close positions the word of similar import, because term vector is generally real number vector, therefore by training with a large amount of language materials in corpus being carried out to nothing supervision, can extract and obtain this term vector.This term vector conveniently can do cluster, can determine the word of two similar import with Euclidean distance or cosine similarity.
S140: merged by default fusion rule by term vector, generates sentence vector characteristics.
Exemplary, the term vector that the word that above-mentioned statement can be divided into is corresponding superposes by preset rules, obtains the sentence vector of this statement, also can merge by other means, the mode of such as splicing.
Such as, as a statement S includes n word, then S=w 1, w 2w iw n, wherein, w irepresent i-th word.In the present embodiment, each word w icorresponding term vector all uses length to be the vector of 100 dimensions represent, namely wherein, each dimension represents the value of this word in an abstract dimension.According to superposition principle, the sentence vector of described statement S is therefore, exemplary, if all statements are all with the sentence vector representation of one 100 dimension, because number of dimensions secures, then can avoid completely occurring dimension disaster problem, embody the relating attribute between word and word simultaneously.Adopt the method for superposition by the lower dimensional space Vector Fusion of these words together, represent the sentence comprising these words, extracted the further feature of sentence by the learning method of some profound levels, Deep Learning method compare before shallow-layer learning method, discrimination improves.
S150: according to sentence vector characteristics, training sentiment classification model, obtains the sentiment classification model after training.
Exemplary, according to the feature of the sentence vector of input, train some sentiment classification model with learning model, generally do training with shallow-layer learning model and degree of depth learning model.Further preferred, the present invention adopts degree of depth learning model to train, and such as, volume and neural network, decision tree, linear regression even depth learning model, by a large amount of Trainings, obtain the good sentiment classification model of robustness.
In sum, in the present embodiment, computing machine collects raw data by reptile instrument in corpus, pre-service is carried out to this raw data and obtains preprocessed data, from this preprocessed data, term vector is extracted with neural network model, by the mode of splicing, above-mentioned term vector being merged the corresponding sentence vector characteristics of generation again, according to above-mentioned sentence vector characteristics, training sentiment classification model to obtain the stable sentiment classification model of robustness with there being the degree of depth learning model of supervision.The sentiment classification model that the present embodiment obtains, can avoid the problem of dimension disaster, excavates the relating attribute between word and word, improves vector accuracy semantically.
Embodiment two
On the basis of the embodiment of the present invention one, the present embodiment further provides the step S120 in the technical scheme of embodiment one, namely carries out pre-service to raw data, obtains the preferred implementation of preprocessed data.
With reference to the embodiment of the present invention one, as shown in Figure 2, step S120, namely carries out pre-service to raw data, and obtaining preprocessed data can comprise:
S121: cleaning raw data, obtains the rear data of cleaning.
Exemplary, by the raw data that previously obtained with reptile instrument can not the rejecting such as identification data, non-legible character, obtain data after cleaning, to facilitate follow-up participle, remove stop words and to extract term vector operation.
S122: data after cleaning are done to participle and gone stop words process, obtains preprocessed data.
Exemplary, can with the non-participle instrument of increasing income of increase income participle instrument or purchase, statement in data after above-mentioned cleaning is divided into multiple word, is generally the multiple words with verb, sentence being divided into a part of speech or adjectival, or with space, sentence is divided into multiple word.According to stopping vocabulary, the stop words in described statement can also be filtered out, obtains above-mentioned preprocessed data simultaneously.
In sum, the present embodiment, on the basis of embodiment one, further provides the preferred implementation of step S120, by step S121, the non-legible class data cleansing in described raw data can be fallen, and obtains the rear data of cleaning.By step S122, the stop words in data after described cleaning can be filtered out, the preprocessed data needed for acquisition.
Embodiment three
On the basis of embodiment two, the present embodiment further provides the step S121 of the technical scheme China of embodiment two, namely cleans raw data, obtains the preferred implementation of the rear data of cleaning.
With reference to the embodiment of the present invention two, as shown in Figure 3, step S121, namely cleans raw data, and after obtaining cleaning, data can comprise:
S1211: delete the html tag in raw data and URL.
Exemplary, HTML (Hypertext Markup Language) (HyperTextMark-upLanguage in described raw data, HTML) label and URL(uniform resource locator) (UniformResourceLocation, URL) etc., have nothing to do with statement itself, also do not form word, therefore it may be necessary software and above-mentioned html tag and URL etc. are deleted, to facilitate follow-up extraction term vector operation.
S1212: when the content in corpus is Chinese, convert the complex form of Chinese characters in raw data to simplified Chinese character.
Exemplary, when word in corpus is Chinese, word in above-mentioned raw data is Chinese text, when running into some Chinese text and being the complex form of Chinese characters, in order to follow-up process data carry out unifying process, need, by complex form of Chinese characters converter, to convert the complex form of Chinese characters to corresponding simplified Chinese character, to facilitate follow-up extraction term vector operation.
It should be noted that at this, the execution sequence of step S1211 and step S1212 can be exchanged, and at this, does not do any restriction to the order of step S1211 and step S1212.
In sum, the present embodiment, on the basis of embodiment two, further provides the preferred implementation of step S121, by step S1211, the html tag in above-mentioned raw data and URL etc. can be deleted, and obtains the data after cleaning.By step S1212, the complex form of Chinese characters in above-mentioned raw data can be converted to corresponding simplified Chinese character.
The embodiment of the method that the text feeling polarities provided for the embodiment of the present invention is below analyzed.The present embodiment have employed the method for above-mentioned training sentiment classification model and the sentiment classification model of embodiment acquisition, carries out feeling polarities analysis to target text.So the detail content of not detailed description in the present embodiment, can with reference to the embodiment of the method for above-mentioned training sentiment classification model.
Embodiment four
On the basis of above-described embodiment, see Fig. 4, the scheme of the present embodiment can be performed by computing machine, and specifically can be implemented by the software program configured in a computer, the method for text feeling polarities analysis comprises the steps:
S410: extract sentence vector characteristics from target text.
Exemplary, the method provided by any embodiment in the embodiment of the present invention one to three, extracts sentence vector characteristics from target text to be analyzed.
S420: according to the sentiment classification model after the training that sentence vector characteristics and the method for training sentiment classification model obtain, the feeling polarities of evaluating objects text.
Exemplary, adopt the sentiment classification model after the training that in the embodiment of the present invention one to three, any embodiment obtains, in conjunction with sentence vector characteristics, the feeling polarities of evaluating objects text.Such as, this feeling polarities can comprise: positive and passive, subjectivity and objectivity.
In sum, in the present embodiment, by the method for any embodiment in embodiment one to three, from target text, extract sentence vector characteristics, then adopt the sentiment classification model obtained in any embodiment in embodiment one to three, the feeling polarities of evaluating objects text.The method of the text feeling polarities analysis of the present embodiment, can avoid the problem of dimension disaster, excavate the relating attribute between word and word, improves vector accuracy semantically.
The embodiment of the system of the training sentiment classification model provided for the embodiment of the present invention below, the embodiment of the method for this embodiment and above-mentioned training sentiment classification model belongs to same design, the detail content of not detailed description in the embodiment of the system of training sentiment classification model, can with reference to the embodiment of the method for above-mentioned training sentiment classification model.
Embodiment five
See Fig. 5, the system 500 of the training sentiment classification model of the present embodiment is corresponding with the method for embodiment one, this system 500 comprises, data acquisition unit 510, raw data pretreatment unit 520, term vector extraction unit 530, sentence vector characteristics generation unit 540 and sentiment classification model training unit 550.Wherein:
Data acquisition unit 510, for image data from corpus, obtains raw data;
Raw data pretreatment unit 520, for carrying out pre-service to above-mentioned raw data, obtains preprocessed data;
Term vector extraction unit 530, for by neural network model, extracts term vector from above-mentioned preprocessed data;
Sentence vector characteristics generation unit 540, for by above-mentioned term vector, merges by default fusion rule, generates sentence vector characteristics; And
Sentiment classification model training unit 550, for according to the sentiment classification model after the training that obtains of method of above-mentioned sentence vector characteristics and training sentiment classification model, obtains the sentiment classification model after training.
In sum, according to the sentiment classification model that the present embodiment technical scheme obtains, the problem of dimension disaster can be avoided, excavate the relating attribute between word and word, improve vector accuracy semantically.
Embodiment six
On the basis of the embodiment of the present invention five, the present embodiment further provides the preferred implementation of raw data pretreatment unit 520.
As shown in Figure 6, raw data pretreatment unit 520 can comprise:
Cleaning subelement 521, for cleaning above-mentioned raw data, obtains the rear data of cleaning.
Participle and remove stop words subelement 522, for doing participle to data after above-mentioned cleaning and going stop words process, obtains preprocessed data.
In sum, according to the technical scheme of the present embodiment, by cleaning subelement 521, the non-legible class data cleansing in raw data can be fallen, obtain the rear data of cleaning.By participle and remove stop words subelement 522, can participle be done to data after cleaning and go stop words process, obtain preprocessed data.
Embodiment seven
On the basis of the embodiment of the present invention six, the embodiment of the present invention further provides the preferred implementation of cleaning subelement 521.
As shown in Figure 7, clean subelement 521 can comprise:
Delete Sun Danyuan 5211, for deleting html tag in described raw data and URL.
Conversion Sun Danyuan 5212, during for being Chinese when the word in described corpus, converts the complex form of Chinese characters in described raw data to simplified Chinese character.
In sum, according to the technical scheme of the present embodiment, by deleting Sun Danyuan 5211, html tag in raw data and URL etc. can be deleted.By conversion Sun Danyuan 5212, can the traditional Chinese word in raw data be converted to simplified Chinese character.
The embodiment of the system that the text feeling polarities provided for the embodiment of the present invention is below analyzed, the method of this embodiment and above-mentioned text feeling polarities analysis belongs to same design, the detail content of not detailed description in the embodiment of the system that text feeling polarities is analyzed, the embodiment of the method can analyzed with reference to above-mentioned text feeling polarities.
Embodiment eight
See Fig. 8, the system 800 that the text feeling polarities of the present embodiment is analyzed is corresponding with the method for embodiment four, and system comprises a vector characteristics extraction unit 810 and feeling polarities analytic unit 820.Wherein:
Sentence vector characteristics extraction unit 810, for the system provided by enforcement arbitrary in the embodiment of the present invention five to seven, extracts sentence vector characteristics from target text; And
Feeling polarities analytic unit 820, the sentiment classification model after the training that the system for providing according to arbitrary enforcement in above-mentioned sentence vector characteristics and the embodiment of the present invention five to seven obtains, the feeling polarities of evaluating objects text.
In sum, according to the system that the text feeling polarities of the present embodiment technical scheme is analyzed, the problem of dimension disaster can be avoided, excavate the relating attribute between word and word, improve vector accuracy semantically.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, to those skilled in the art, the present invention can have various change and change in embodiment.All do within spirit of the present invention and principle any amendment, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. train a method for sentiment classification model, comprising:
Image data from corpus, obtains raw data;
Pre-service is carried out to described raw data, obtains preprocessed data;
By neural network model, from described preprocessed data, extract term vector;
By described term vector, merge by default fusion rule, generate sentence vector characteristics;
According to described sentence vector characteristics, training sentiment classification model, obtains the sentiment classification model after training.
The method of claim 1, wherein 2. described by neural network model, from described preprocessed data, extract term vector, comprising:
After word in described preprocessed data being converted to the vectorial also bit-wise addition of 0-1, as the input layer of neural network model;
Using huffman coding tree as the output layer of described neural network model;
Described neural network model unsupervised learning is made to obtain term vector corresponding to institute's predicate.
3. method as claimed in claim 1, wherein, describedly to merge by default fusion rule, comprising:
Superpose by preset rules.
4. the method for claim 1, wherein described according to described sentence vector characteristics, training sentiment classification model, comprising:
With described sentence vector characteristics, carry out Training by learning model, obtain sentiment classification model.
5. the method as described in any one of Claims 1 to 4, wherein, described from corpus image data, comprising:
The content crawled in described corpus by reptile instrument carrys out image data.
6. the method as described in any one of Claims 1 to 4, wherein, describedly carries out pre-service to described raw data, obtains preprocessed data, comprising:
Clean described raw data, obtain the rear data of cleaning;
Participle done to data after described cleaning and goes stop words process, obtaining preprocessed data.
7. method as claimed in claim 6, wherein, the described raw data of described cleaning, comprising:
Delete the html tag in described raw data and URL;
When the content in described corpus is Chinese, convert the complex form of Chinese characters in described raw data to simplified Chinese character.
8. a method for text feeling polarities analysis, comprising:
Sentence vector characteristics is extracted from target text;
Sentiment classification model after the training that the method for the training sentiment classification model according to any one of described sentence vector characteristics and claim 1 ~ 7 obtains, analyzes the feeling polarities of described target text.
9. train a system for sentiment classification model, it is characterized in that, comprising:
Data acquisition unit, for image data from corpus, obtains raw data;
Raw data pretreatment unit, for carrying out pre-service to described raw data, obtains preprocessed data;
Term vector extraction unit, for by neural network model, extracts term vector from described preprocessed data;
Sentence vector characteristics generation unit, for by described term vector, merges by default fusion rule, generates sentence vector characteristics; And
Sentiment classification model training unit, for according to described sentence vector characteristics, trains sentiment classification model, obtains the sentiment classification model after training.
10. system as claimed in claim 9, wherein, described raw data pretreatment unit comprises:
Cleaning subelement, for cleaning described raw data, obtains the rear data of cleaning; And
Participle and remove stop words subelement, for doing participle to data after described cleaning and going stop words process, obtains preprocessed data.
11. systems as claimed in claim 10, wherein, described cleaning subelement comprises:
Delete Sun Danyuan, for deleting html tag in described raw data and URL; And
Conversion Sun Danyuan, during for being Chinese when the word in described corpus, converts the complex form of Chinese characters in described raw data to simplified Chinese character.
The system that 12. 1 kinds of text feeling polarities are analyzed, comprising:
Sentence vector characteristics extraction unit, for extracting sentence vector characteristics from target text; And
Feeling polarities analytic unit, for the sentiment classification model after the training that the system of the training sentiment classification model according to any one of described sentence vector characteristics and claim 9 ~ 11 obtains, analyzes the feeling polarities of described target text.
CN201510931457.9A 2015-12-15 2015-12-15 Emotion classification model training and textual emotion polarity analysis method and system Pending CN105512687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510931457.9A CN105512687A (en) 2015-12-15 2015-12-15 Emotion classification model training and textual emotion polarity analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510931457.9A CN105512687A (en) 2015-12-15 2015-12-15 Emotion classification model training and textual emotion polarity analysis method and system

Publications (1)

Publication Number Publication Date
CN105512687A true CN105512687A (en) 2016-04-20

Family

ID=55720653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510931457.9A Pending CN105512687A (en) 2015-12-15 2015-12-15 Emotion classification model training and textual emotion polarity analysis method and system

Country Status (1)

Country Link
CN (1) CN105512687A (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095746A (en) * 2016-06-01 2016-11-09 竹间智能科技(上海)有限公司 Word emotion identification system and method
CN106202274A (en) * 2016-06-30 2016-12-07 云南电网有限责任公司电力科学研究院 A kind of defective data automatic abstract sorting technique based on Bayesian network
CN106383815A (en) * 2016-09-20 2017-02-08 清华大学 Neural network sentiment analysis method in combination with user and product information
CN106502989A (en) * 2016-10-31 2017-03-15 东软集团股份有限公司 Sentiment analysis method and device
CN106919673A (en) * 2017-02-21 2017-07-04 浙江工商大学 Text mood analysis system based on deep learning
CN107066445A (en) * 2017-04-11 2017-08-18 华东师范大学 The deep learning method of one attribute emotion word vector
CN107341685A (en) * 2017-05-24 2017-11-10 百度在线网络技术(北京)有限公司 Data analysing method and device
CN107609009A (en) * 2017-07-26 2018-01-19 北京大学深圳研究院 Text emotion analysis method, device, storage medium and computer equipment
CN107644102A (en) * 2017-10-13 2018-01-30 北京京东尚科信息技术有限公司 Data characteristics building method and device, storage medium, electronic equipment
CN107861936A (en) * 2016-09-28 2018-03-30 平安科技(深圳)有限公司 The polarity probability analysis method and device of sentence
CN108205523A (en) * 2016-12-19 2018-06-26 北京天广汇通科技有限公司 Utilize the method and device of the dense term vector of training
CN108399158A (en) * 2018-02-05 2018-08-14 华南理工大学 Attribute sensibility classification method based on dependency tree and attention mechanism
CN108491497A (en) * 2018-03-20 2018-09-04 苏州大学 The medical document creation method of network technology is fought based on production
CN108664512A (en) * 2017-03-31 2018-10-16 华为技术有限公司 Text object sorting technique and device
CN108875024A (en) * 2018-06-20 2018-11-23 清华大学深圳研究生院 File classification method, system, readable storage medium storing program for executing and electronic equipment
WO2018232699A1 (en) * 2017-06-22 2018-12-27 腾讯科技(深圳)有限公司 Information processing method and related device
CN109460472A (en) * 2018-11-09 2019-03-12 北京京东金融科技控股有限公司 File classification method and device and electronic equipment
CN109684634A (en) * 2018-12-17 2019-04-26 北京百度网讯科技有限公司 Sentiment analysis method, apparatus, equipment and storage medium
CN109783800A (en) * 2018-12-13 2019-05-21 北京百度网讯科技有限公司 Acquisition methods, device, equipment and the storage medium of emotion keyword
CN109918550A (en) * 2019-01-22 2019-06-21 招银云创(深圳)信息技术有限公司 Information monitoring method, device, computer equipment and readable storage medium storing program for executing
CN110019782A (en) * 2017-09-26 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for exporting text categories
CN110134934A (en) * 2018-02-02 2019-08-16 普天信息技术有限公司 Text emotion analysis method and device
CN110196977A (en) * 2019-05-31 2019-09-03 广西南宁市博睿通软件技术有限公司 A kind of intelligence alert inspection processing system and method
CN110310629A (en) * 2019-07-16 2019-10-08 湖南检信智能科技有限公司 Speech recognition control system based on text emotion classification
CN110851569A (en) * 2019-11-12 2020-02-28 北京创鑫旅程网络技术有限公司 Data processing method, device, equipment and storage medium
CN111291198A (en) * 2020-03-12 2020-06-16 重庆仙桃易云数据有限公司 Economic situation index analysis method and system based on big data and computer readable medium
WO2020186627A1 (en) * 2019-03-15 2020-09-24 深圳市赛为智能股份有限公司 Public opinion polarity prediction method and apparatus, computer device, and storage medium
CN112115331A (en) * 2020-09-21 2020-12-22 朱彤 Capital market public opinion monitoring method based on distributed web crawler and NLP
JP2022026278A (en) * 2020-07-30 2022-02-10 クロスリバ株式会社 Sentence analyzer, method, and program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103400145A (en) * 2013-07-19 2013-11-20 北京理工大学 Voice-vision fusion emotion recognition method based on hint nerve networks
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN104899298A (en) * 2015-06-09 2015-09-09 华东师范大学 Microblog sentiment analysis method based on large-scale corpus characteristic learning
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103400145A (en) * 2013-07-19 2013-11-20 北京理工大学 Voice-vision fusion emotion recognition method based on hint nerve networks
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN104899298A (en) * 2015-06-09 2015-09-09 华东师范大学 Microblog sentiment analysis method based on large-scale corpus characteristic learning
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈晨 等: "基于迭代神经网络的文本情感分析", 《中国科技论文在线》 *
高凯 等: "基于微博的情感倾向性分析方法研究", 《中文信息学报》 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095746A (en) * 2016-06-01 2016-11-09 竹间智能科技(上海)有限公司 Word emotion identification system and method
CN106095746B (en) * 2016-06-01 2019-05-10 竹间智能科技(上海)有限公司 Text emotion identification system and method
CN106202274B (en) * 2016-06-30 2019-10-15 云南电网有限责任公司电力科学研究院 A kind of defective data automatic abstract classification method based on Bayesian network
CN106202274A (en) * 2016-06-30 2016-12-07 云南电网有限责任公司电力科学研究院 A kind of defective data automatic abstract sorting technique based on Bayesian network
CN106383815A (en) * 2016-09-20 2017-02-08 清华大学 Neural network sentiment analysis method in combination with user and product information
CN106383815B (en) * 2016-09-20 2019-03-01 清华大学 In conjunction with the neural network sentiment analysis method of user and product information
CN107861936A (en) * 2016-09-28 2018-03-30 平安科技(深圳)有限公司 The polarity probability analysis method and device of sentence
CN106502989A (en) * 2016-10-31 2017-03-15 东软集团股份有限公司 Sentiment analysis method and device
CN108205523B (en) * 2016-12-19 2023-05-23 北京天广汇通科技有限公司 Method and device for training dense word vectors by using corpus
CN108205523A (en) * 2016-12-19 2018-06-26 北京天广汇通科技有限公司 Utilize the method and device of the dense term vector of training
CN106919673B (en) * 2017-02-21 2019-08-20 浙江工商大学 Text mood analysis system based on deep learning
CN106919673A (en) * 2017-02-21 2017-07-04 浙江工商大学 Text mood analysis system based on deep learning
CN108664512A (en) * 2017-03-31 2018-10-16 华为技术有限公司 Text object sorting technique and device
CN107066445B (en) * 2017-04-11 2018-04-24 华东师范大学 The deep learning method of one attribute emotion word vector
CN107066445A (en) * 2017-04-11 2017-08-18 华东师范大学 The deep learning method of one attribute emotion word vector
CN107341685A (en) * 2017-05-24 2017-11-10 百度在线网络技术(北京)有限公司 Data analysing method and device
US10789415B2 (en) 2017-06-22 2020-09-29 Tencent Technology (Shenzhen) Company Limited Information processing method and related device
WO2018232699A1 (en) * 2017-06-22 2018-12-27 腾讯科技(深圳)有限公司 Information processing method and related device
CN107609009A (en) * 2017-07-26 2018-01-19 北京大学深圳研究院 Text emotion analysis method, device, storage medium and computer equipment
CN110019782A (en) * 2017-09-26 2019-07-16 北京京东尚科信息技术有限公司 Method and apparatus for exporting text categories
CN107644102A (en) * 2017-10-13 2018-01-30 北京京东尚科信息技术有限公司 Data characteristics building method and device, storage medium, electronic equipment
CN107644102B (en) * 2017-10-13 2020-11-03 北京京东尚科信息技术有限公司 Data feature construction method and device, storage medium and electronic equipment
CN110134934A (en) * 2018-02-02 2019-08-16 普天信息技术有限公司 Text emotion analysis method and device
CN108399158A (en) * 2018-02-05 2018-08-14 华南理工大学 Attribute sensibility classification method based on dependency tree and attention mechanism
CN108491497A (en) * 2018-03-20 2018-09-04 苏州大学 The medical document creation method of network technology is fought based on production
WO2019179100A1 (en) * 2018-03-20 2019-09-26 苏州大学张家港工业技术研究院 Medical text generation method based on generative adversarial network technology
CN108491497B (en) * 2018-03-20 2020-06-02 苏州大学 Medical text generation method based on generation type confrontation network technology
CN108875024A (en) * 2018-06-20 2018-11-23 清华大学深圳研究生院 File classification method, system, readable storage medium storing program for executing and electronic equipment
CN109460472A (en) * 2018-11-09 2019-03-12 北京京东金融科技控股有限公司 File classification method and device and electronic equipment
CN109783800A (en) * 2018-12-13 2019-05-21 北京百度网讯科技有限公司 Acquisition methods, device, equipment and the storage medium of emotion keyword
CN109783800B (en) * 2018-12-13 2024-04-12 北京百度网讯科技有限公司 Emotion keyword acquisition method, device, equipment and storage medium
CN109684634B (en) * 2018-12-17 2023-07-25 北京百度网讯科技有限公司 Emotion analysis method, device, equipment and storage medium
CN109684634A (en) * 2018-12-17 2019-04-26 北京百度网讯科技有限公司 Sentiment analysis method, apparatus, equipment and storage medium
CN109918550A (en) * 2019-01-22 2019-06-21 招银云创(深圳)信息技术有限公司 Information monitoring method, device, computer equipment and readable storage medium storing program for executing
WO2020186627A1 (en) * 2019-03-15 2020-09-24 深圳市赛为智能股份有限公司 Public opinion polarity prediction method and apparatus, computer device, and storage medium
CN110196977A (en) * 2019-05-31 2019-09-03 广西南宁市博睿通软件技术有限公司 A kind of intelligence alert inspection processing system and method
CN110196977B (en) * 2019-05-31 2023-06-09 广西南宁市博睿通软件技术有限公司 Intelligent warning condition supervision processing system and method
CN110310629A (en) * 2019-07-16 2019-10-08 湖南检信智能科技有限公司 Speech recognition control system based on text emotion classification
CN110851569B (en) * 2019-11-12 2022-11-29 北京创鑫旅程网络技术有限公司 Data processing method, device, equipment and storage medium
CN110851569A (en) * 2019-11-12 2020-02-28 北京创鑫旅程网络技术有限公司 Data processing method, device, equipment and storage medium
CN111291198A (en) * 2020-03-12 2020-06-16 重庆仙桃易云数据有限公司 Economic situation index analysis method and system based on big data and computer readable medium
JP2022026278A (en) * 2020-07-30 2022-02-10 クロスリバ株式会社 Sentence analyzer, method, and program
CN112115331A (en) * 2020-09-21 2020-12-22 朱彤 Capital market public opinion monitoring method based on distributed web crawler and NLP

Similar Documents

Publication Publication Date Title
CN105512687A (en) Emotion classification model training and textual emotion polarity analysis method and system
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
Cetto et al. Graphene: Semantically-linked propositions in open information extraction
Wang et al. Refined global word embeddings based on sentiment concept for sentiment analysis
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
Li et al. Improving convolutional neural network for text classification by recursive data pruning
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN106503055A (en) A kind of generation method from structured text to iamge description
CN107133214A (en) A kind of product demand preference profiles based on comment information are excavated and its method for evaluating quality
CN103544242A (en) Microblog-oriented emotion entity searching system
CN104636425A (en) Method for predicting and visualizing emotion cognitive ability of network individual or group
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
KR20120108095A (en) System for analyzing social data collected by communication network
CN109345272A (en) One kind is based on the markovian shop credit risk forecast method of improvement
Geng et al. Explainable zero-shot learning via attentive graph convolutional network and knowledge graphs
Paul et al. Argumentative relation classification with background knowledge
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
CN115329085A (en) Social robot classification method and system
Barbieri et al. Towards a natural language conversational interface for process mining
Kilroy et al. Using machine learning to improve lead times in the identification of emerging customer needs
Velmurugan et al. Mining implicit and explicit rules for customer data using natural language processing and apriori algorithm
CN108932247A (en) A kind of method and device optimizing text search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160420