CN107315797A

CN107315797A - A kind of Internet news is obtained and text emotion forecasting system

Info

Publication number: CN107315797A
Application number: CN201710463295.XA
Authority: CN
Inventors: 黄江林; 周继强; 王丽峰; 贠周会; 吴斌
Original assignee: Jiangxi Hongdu Aviation Industry Group Co Ltd
Current assignee: Jiangxi Hongdu Aviation Industry Group Co Ltd
Priority date: 2017-06-19
Filing date: 2017-06-19
Publication date: 2017-11-03

Abstract

A kind of Internet news is obtained and text emotion forecasting system, training set is used as using the newsletter archive that network is crawled, utilize Algorithm of documents categorization, training pattern is set up, expected news and journals text is treated according to training pattern and classified, automatic Emotion tagging, predict the influence that Internet news text to be delivered is likely to result in public sentiment, the text emotion forecasting system that social news influence on public's emotion is built, the public sentiment that one news of prediction is likely to result in provides facility for network security.

Description

A kind of Internet news is obtained and text emotion forecasting system

Technical field

Obtained and text emotion prediction system the present invention relates to intelligent use technical field, more particularly to a kind of Internet news System.

Background technology

The today developed rapidly in internet, network has turned into the important sources that people obtain information, promotes people more Easily understand society's dynamic.But in actual life, people are after a piece of news is read, and generation that can be autonomous is corresponding Emotion, for example：Most people see that " outdated " news can become indignation；See that " so-and-so does not hesitate to do what is right " news can be felt It is dynamic, in fact, body may be free of any emotion word (such as " gloomy ", " sad ", " happiness ") in itself, but read this Class news can but allow people to produce certain Sentiment orientation, and this Sentiment orientation has certain regularity of distribution, and most people are to certain The emotional responses of one news is basically identical.

With the increase of data volume, government and website maintenance person can not predict Internet news in advance and the common people may be produced Mood and its social influence, in consideration of it, in the urgent need to a text emotion forecasting system, the common people can be predicted and read news After can issuable emotion, prevention in advance predicted and the mesh of the analysis of public opinion emotion tendentiousness of text with intervening so as to reach 's.

The many text emotion forecasting systems occurred at present, mainly for subjective text (such as comment, viewpoint, taste) Analysis mining is carried out, i.e., the emotion keyword in subjective text is found out the user feeling that these subjective texts are reflected and inclined To usually containing positive and negative two classes emotion, it is impossible to by network analysis body, the shadow hidden in news is excavated with this Ring the factor of people's emotion.

The content of the invention

Technical problem solved by the invention is that providing a kind of Internet news obtains and text emotion forecasting system, with Solve the shortcoming in above-mentioned background technology.

Technical problem solved by the invention is realized using following technical scheme：

A kind of Internet news is obtained and text emotion forecasting system, and training set, profit are used as using the newsletter archive that network is crawled With Algorithm of documents categorization, training pattern is set up, expected news and journals text is treated according to training pattern and classified, automatic emotion mark Note, predicts the influence that Internet news text to be delivered is likely to result in public sentiment, specific steps：

One) training set is used as using the newsletter archive that network is crawled

Info web is crawled by reptile magnanimity, and body and votes are parsed during crawling, while basis The keyword of setting builds corpus to body progress pretreatment matching, and body is carried out certainly according to votes Dynamic Emotion tagging, to obtain the language material for the demand that meets and store to local；

1. social news are obtained

The social news website containing emotion votes is mainly captured, webpage is crawled using reptile, first analyzing web site Structure, will be extracted out with the news related content news links URL that will be crawled from webpage source code, obtain the corresponding URL of news Afterwards, sent and asked using HttpClient, receive response, and utilize HtmlParser resolution responses, to obtain and the news pair The content answered, such as title, text and votes；Using the principle filtered in crawl, if the news that analyzing web page is obtained is just There is not keyword or word approximate therewith (being given by user) in text or title, then it is assumed that the news is unrelated with keyword, Give up；

2. building of corpus and data storage

Corpus selects MySQL, sets up form, and the related text of the keyword for crawling to setting is stored in into corpus；

I) news table is created, the field of table has news links news_url, headline news_title, body News_content and news votes news_vote, using news links as major key, the content that reptile is crawled is stored in news Table；

Ii antistop list) is created, the field of table has keyword sequence number keyword_id, keyword keyword；Read and use The keyword of family setting, using keyword as major key, is stored to antistop list；

Iii concordance list) is created, the field of table has sequence number id, keyword keyword, headline news_title, new The news for including keyword in text news_content and news votes news_vote, selection news table is heard, with key Word is index, is stored to concordance list；

3. automatic marking emotional category

According to the corresponding responses of parsing news URL, the votes of every news are obtained, based on votes, are set certainly Dynamic mark embodiment：

A) the total threshold value N of self-defined ballot, if the ballot sum of some news is less than N, the news will be skipped；

B) self-defined difference threshold M, if votes most in some news and the difference of secondary many votes are less than M, Then it is not involved in building corpus；

If c) news votes exceed threshold value N and M, it is the most class of votes to mark the news, realizes automatic mark In note, the corresponding table of newsletter archive category deposit training set most at last after automatic marking；

Two) Text Pretreatment

Newsletter archive in training set is pre-processed, including participle and removes stop words, based on the Chinese Academy of Sciences ICTCLAS2015 and lucence Words partition system interfaces, to complete participle；User Defined is allowed to disable vocabulary, it is possible to use Acquiescence disables vocabulary, to filter out the word that class discrimination is indifferent, semantic information is few；

Three) feature selecting and feature weight are set

Feature selecting is carried out to the training set newsletter archive that pretreatment is finished to set with weight, feature selecting is that removing is special The feature of effective information can not be preferably represented in collection, to improve the classification degree of accuracy and reduce computation complexity；Weight is set Using the statistical information of newsletter archive, certain weights are assigned to characteristic item；

1) text vector spatial model is built

First, the newsletter archive of training set is converted into computer-readable format, structure will be converted to without structure text Change text, a newsletter archive document is converted into vector, vector is selected per one-dimensional value representative feature weight by feature Construction feature dictionary is selected, the vocabulary of feature lexicon is N, N-dimensional vector representation newsletter archive is built, using weighing computation method Calculate per one-dimensional weighted value, to build text vector spatial model；

2) feature selecting

Using feature is extracted under unitary word, three kinds of granularities of binary word and theme, feature selecting extraction deposits feature after finishing Storage is in HashMap；Text feature is being extracted, with related between chi quantity algorithm computation measure word and document classification Degree, the chi value of a certain class of word correspondence is higher, and explanation may represent a certain class document, that is, the class discrimination letter having Breath is more, for multi-class problem, first calculates chi-square value of the word for each classification, then chooses maximum of which value It is used as chi-square value of the word on whole language material；

3) feature weight is set

Feature weight is used to weigh significance level or separating capacity power of some characteristic item in text representation, uses TFIDF calculates weight, and wherein TF is word frequency, for calculating the ability that the word describes document content；IDF is inverse document frequency, is used In the ability for calculating word differentiation document；

Four) training pattern is set up

By SVM training methods, nonlinear transformation is carried out to the chi-square value kernel function for being provided with feature weight, will be inputted Nonlinear characteristic DUAL PROBLEMS OF VECTOR MAPPING to high-dimensional feature space, optimal linear classifying face is then found in high-dimensional feature space, with Text class is separated, training pattern is set up；

I) training set vector model

User-defined feature dimension, feature is extracted according to feature selection approach, sets the weight under granularity, but intrinsic dimensionality The excessive training speed that is easily caused is slow, and over-fitting and the excessive noise of introducing, intrinsic dimensionality are too small, can not carry enough texts Information, will all produce influence to classification performance, therefore the training pattern being arranged under different characteristic dimension, using cross validation or Person predicts the classification accuracy on test set, determines optimal input dimension, sets up training set vector model；

II) input normalization

Because training set vector model initial data possible range is excessive or too small, first by training set vector model original number Input normalization is carried out according to re-scaling to proper range, makes training and predetermined speed faster；

III) cross validation parameter optimization

Using grid search, it is allowed to the initial value of gamma functions, step-length in self-defined loss function and kernel function, use 5 folding cross-validation methods evaluate the quality of the training pattern under different loss functions and gamma functions, can so avoid random Factor is disturbed, and optimal loss function and kernel function is obtained, to set up SVM models；Wherein, refer to will be initial for 5 folding cross validations Sampling is divided into 5 subsamples, and an independent subsample is kept as verifying the data of model, and other 4 samples are used for instructing Practice, cross validation is repeated 5 times, once, average 5 times result finally gives single estimation for each subsample checking；

Five) prediction output

The info web that reptile magnanimity is crawled carries out being loaded into training vector model after input normalization, uses SVM models Treat classifying text to be predicted, output prediction class label.

Beneficial effect：The newsletter archive of the invention crawled using network is utilized Algorithm of documents categorization, set up as training set Training pattern, treats expected news and journals text according to training pattern and is classified, automatic Emotion tagging predicts news to be delivered The influence that text is likely to result in public sentiment, builds the text emotion forecasting system that social news influence on public's emotion, in advance The public sentiment that a news is likely to result in is surveyed, facility is provided for network security.

Brief description of the drawings

Fig. 1 is the flow chart of presently preferred embodiments of the present invention.

Embodiment

In order that the technical means, the inventive features, the objects and the advantages of the present invention are easy to understand, below With reference to being specifically illustrating, the present invention is expanded on further.

A kind of Internet news shown in Figure 1 is obtained and text emotion forecasting system, specific steps：

One) set up and training set is used as using the newsletter archive that network is crawled

1. social news are obtained

The social news website containing emotion votes is mainly captured, webpage is crawled using reptile, first analyzing web site Structure, will extract from webpage source code with the news related content (such as news links URL) that will be crawled, filters out Useless link such as advertisement；Obtain after the corresponding URL of news, sent and asked using HttpClient, receive response, and utilize HtmlParser resolution responses, to obtain content corresponding with the news, such as title, text and votes；Using in crawl The principle of filtering, if do not occur in body or title that analyzing web page is obtained keyword or word approximate therewith (by with Family gives), then it is assumed that the news is unrelated with keyword, gives up；

2. building of corpus and data storage

I) news table is created, the field of table has news links news_url, headline news_title, body News_content and news votes news_vote, using news links as major key, the news that can prevent insertion from repeating will be climbed The content deposit news table that worm crawls；

3. automatic marking emotional category

Two) Text Pretreatment

Newsletter archive in training set is pre-processed, including participle and removes stop words, based on the Chinese Academy of Sciences ICTCLAS2015 and lucence Words partition system interfaces, to complete participle；User Defined is allowed to disable vocabulary, it is possible to use Acquiescence disables vocabulary, to filter out the word (such as, etc.) that class discrimination is indifferent, semantic information is few；

Three) feature selecting and feature weight are set

1) text vector spatial model is built

First, the newsletter archive of training set is converted into computer-readable format, structure will be converted to without structure text Change text, in order to computer disposal；The present embodiment uses vector space model, i.e., turn a newsletter archive document Vector is changed to, vector is per one-dimensional value representative feature weight；

Specific steps：If newsletter archive document={ t1, a w1；... tm, wm }, wherein, t_nFor the n-th dimension Characteristic item, w_nFor the n-th right-safeguarding weight values, by taking this paper as an example, pass through feature selecting construction feature dictionary, the vocabulary of feature lexicon For N, build N-dimensional vector representation newsletter archive, calculated using weighing computation method per one-dimensional weighted value, with build text to Quantity space model；

2) feature selecting

Using feature is extracted under unitary word, three kinds of granularities of binary word and theme, it is characterized as with being extracted under binary word granularity Example：Using Skip-Bigrams binary word feature lexicons, training set newsletter archive content is subjected to middle largest interval according to word Operated for 2 sliding window, form the word fragment sequence that length is 2, be then stored in HashMap, you can produce with bright The binary feature word of aobvious emotion tendency, such as in " I | love | Chinese " the words, uses Skip-Bigram binary word Feature Words Allusion quotation can produce " I/love ", " love/China ", binary phrase as " I/Chinese ", wherein occurring in that " love/China " so The abundant Feature Words of one semanteme, on large-scale corpus, can obtain more this cooccurrence relations, divide after word and be stored in In HashMap；

After feature selecting structure is finished, text feature is extracted, the present embodiment is by taking chi quantity algorithm as an example：By right The comparison of theoretical value and actual value, it is determined that it is theoretical whether correct, the degree of correlation between word and document classification is mainly measured, Suppositive t and document classification c obeys the chi square distribution of the single order free degree, then the chi value of a certain class of word correspondence is higher, Just illustrate that it may represent a certain class document, that is, the class discrimination information having is more, and the formula of card side is as follows：

Wherein, A represents to include word t number of files in c classes, and B, which represents to remove, includes word t number of files in c classes, C is represented in c classes Number of files not comprising word t, D represents to remove the number of files for not including word t in c classes, and sum is all number of files；

For multi-class problem, chi-square values of the word t for each classification can be first calculated, is then chosen wherein most Chi-square value of the big value as word t on whole language material；

3) feature weight is set

Feature weight is used to weigh significance level or separating capacity power of some characteristic item in text representation, this reality Apply example and weight is calculated using TFIDF, wherein TF is word frequency, for calculating the ability that the word describes document content；IDF is inverse text Shelves frequency, for calculating the ability that the word distinguishes document；

Four) training pattern is set up

The present embodiment is by taking SVM training methods as an example, and its basic mode is to carry out nonlinear transformation by kernel function, will be defeated The nonlinear characteristic DUAL PROBLEMS OF VECTOR MAPPING entered then finds optimal linear classifying face to high-dimensional feature space in high-dimensional feature space, So that text class to be separated, training pattern is set up；

I) training set vector model

User-defined feature dimension, feature is extracted according to feature selection approach, sets the weight under binary word granularity, is built Text vector spatial model, but the excessive training speed that is easily caused of intrinsic dimensionality is slow, over-fitting and the excessive noise of introducing, feature Dimension is too small, can not carry enough text messages, influence will be all produced on classification performance, therefore be arranged on different characteristic dimension Under training pattern, the classification accuracy on test set using cross validation or prediction determines optimal input dimension, builds Vertical training set vector model；

II) input normalization

Because training set initial data possible range is excessive or too small, first training set initial data re-scaling can be arrived Proper range carries out input normalization, makes training and predetermined speed faster；

III) cross validation parameter optimization

Need to set gamma functions (G) in some important parameters, such as loss function (C) and kernel function in SVM, It just can guarantee that overall Generalization Capability is good, the present embodiment uses grid search, it is allowed to self-defined C, G initial value and step-length, makes The quality of the training pattern under different loss functions and gamma functions is evaluated with 5 folding cross-validation methods, can so avoid with Machine factor is disturbed, and optimal C, G is obtained, to set up SVM models.Wherein, 5 folding cross validations refer to initial samples being divided into 5 Individual subsample, an independent subsample is kept as verifying the data of model, and other 4 samples are used for training, cross validation It is repeated 5 times, once, average 5 times result finally gives single estimation for each subsample checking；

Five) prediction output

The general principle and principal character and advantages of the present invention of the present invention has been shown and described above.The skill of the industry Art personnel are it should be appreciated that the present invention is not limited to the above embodiments, and described in above-described embodiment and specification is explanation The principle of the present invention, without departing from the spirit and scope of the present invention, various changes and modifications of the present invention are possible, this A little changes and improvements all fall within the protetion scope of the claimed invention.The claimed scope of the invention is by appended claim Book and its equivalent thereof.

Claims

1. a kind of Internet news is obtained and text emotion forecasting system, it is characterised in that the newsletter archive crawled using network as Training set, using Algorithm of documents categorization, sets up training pattern, treats expected news and journals text according to training pattern and is classified, from Dynamic Emotion tagging, specific steps：

One）Training set is used as using the newsletter archive that network is crawled

Info web is crawled by reptile magnanimity, and body and votes are parsed during crawling, while according to setting Keyword pretreatment matching carried out to body build corpus, automatic emotion is being carried out to body according to votes Mark, to obtain the corpus for the demand that meets and store to local disk；

Two）Text Pretreatment

Newsletter archive in training set is pre-processed, including participle and removes stop words, based on Chinese Academy of Sciences ICTCLAS2015 and Lucence Words partition system interfaces, to complete participle；

Three）Feature selecting and feature weight are set

Feature selecting is carried out to the training set newsletter archive that pretreatment is finished to set with weight, feature selecting is removed in feature set The feature of effective information can not be preferably represented, to improve the classification degree of accuracy and reduce computation complexity；It is using new that weight, which is set, The statistical information of text is heard, certain weights are assigned to characteristic item；

1）Build text vector spatial model

First, the newsletter archive of training set is converted into computer-readable format, structuring text will be converted to without structure text This, vector is converted to by a newsletter archive document, and vector is built per one-dimensional value representative feature weight by feature selecting Feature lexicon, the vocabulary of feature lexicon is N, builds N-dimensional vector representation newsletter archive, calculates each using weighing computation method The weighted value of dimension, to build text vector spatial model；

2）Feature selecting

Using feature is extracted under unitary word, three kinds of granularities of binary word and theme, feature selecting extraction exists characteristic storage after finishing In HashMap；Text feature is being extracted, with the degree of correlation between chi quantity algorithm computation measure word and document classification, The chi value of a certain class of word correspondence is higher, and explanation may represent a certain class document, that is, the class discrimination information having is more, For multi-class problem, chi-square value of the word for each classification is first calculated, maximum of which value is then chosen and exists as word Chi-square value on whole corpus；

3）Feature weight is set

Feature weight calculates weight using TFIDF, and wherein TF is word frequency, for calculating the ability that the word describes document content； IDF is inverse document frequency, for calculating the ability that the word distinguishes document；

Four）Set up training pattern

By SVM training methods, nonlinear transformation is carried out to the chi-square value kernel function for being provided with feature weight, by the non-thread of input Property maps feature vectors then find optimal linear classifying face, by text to high-dimensional feature space in high-dimensional feature space Class is separated, and sets up training pattern；

I）Training set vector model

User-defined feature dimension, feature is extracted according to feature selection approach, sets the weight under granularity, builds the vectorial mould of training set Type；

II）Input normalization

Because training set vector model initial data possible range is excessive or too small, first by training set vector model initial data again Zoom to proper range and carry out input normalization；

III）Cross validation parameter optimization

Using grid search, it is allowed to the initial value of gamma functions, step-length in self-defined loss function and kernel function, handed over using 5 foldings The quality of proof method evaluation training pattern under different loss functions and gamma functions is pitched, optimal loss function and core letter is obtained Number, to set up SVM models；

Five）Prediction output

The info web that reptile magnanimity is crawled carries out being loaded into training vector model after input normalization, is treated using SVM models Classifying text is predicted, output prediction class label.

2. a kind of Internet news according to claim 1 is obtained and text emotion forecasting system, it is characterised in that training set Construction step is as follows：

1. social news are obtained

The social news website containing emotion votes is mainly captured, the structure of webpage, first analyzing web site is crawled using reptile, It will be extracted out with the news related content news links URL that will be crawled from webpage source code, obtain after the corresponding URL of news, make Sent and asked with HttpClient, receive response, and utilize HtmlParser resolution responses, to obtain in corresponding with the news Hold, such as title, text and votes；Using the principle filtered in crawl, if body or title that analyzing web page is obtained In there is not keyword or word approximate therewith, then it is assumed that the news is unrelated with keyword, gives up；

2. building of corpus and data storage

i）News table is created, the field of table has news links news_url, headline news_title, body news_ Content and news votes news_vote, using news links as major key, the content that reptile is crawled is stored in news table；

ii）Antistop list is created, the field of table has keyword sequence number keyword_id, keyword keyword；Read user's setting Keyword, using keyword as major key, store to antistop list；

iii）Concordance list is created, the field of table has sequence number id, keyword keyword, headline news_title, body The news of keyword is included in news_content and news votes news_vote, selection news table, using keyword as rope Draw, store to concordance list；

3. automatic marking emotional category

According to the corresponding responses of parsing news URL, the votes of every news are obtained, based on votes, automatic mark is set Note in emotional category, the corresponding table of newsletter archive category deposit training set most at last after automatic marking.

3. a kind of Internet news according to claim 2 is obtained and text emotion forecasting system, it is characterised in that set certainly Dynamic mark emotional category embodiment：

a）The self-defined total threshold value N of ballot, if the ballot sum of some news is less than N, the news will be skipped；

b）Self-defined difference threshold M, if votes most in some news and the difference of secondary many votes are less than M, no Participate in building corpus；

c）If news votes exceed threshold value N and M, it is the most class of votes to mark the news, realizes automatic marking.

4. a kind of Internet news according to claim 1 is obtained and text emotion forecasting system, it is characterised in that 5 foldings are handed over Fork proof method is concretely comprised the following steps is divided into 5 subsamples by initial samples, and an independent subsample is kept as verifying model Data, other 4 samples are used for training, and cross validation is repeated 5 times, the checking of each subsample once, average 5 times result, Finally give single estimation.