CN107315797A - A kind of Internet news is obtained and text emotion forecasting system - Google Patents

A kind of Internet news is obtained and text emotion forecasting system Download PDF

Info

Publication number
CN107315797A
CN107315797A CN201710463295.XA CN201710463295A CN107315797A CN 107315797 A CN107315797 A CN 107315797A CN 201710463295 A CN201710463295 A CN 201710463295A CN 107315797 A CN107315797 A CN 107315797A
Authority
CN
China
Prior art keywords
news
feature
text
keyword
votes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710463295.XA
Other languages
Chinese (zh)
Inventor
黄江林
周继强
王丽峰
贠周会
吴斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Hongdu Aviation Industry Group Co Ltd
Original Assignee
Jiangxi Hongdu Aviation Industry Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Hongdu Aviation Industry Group Co Ltd filed Critical Jiangxi Hongdu Aviation Industry Group Co Ltd
Priority to CN201710463295.XA priority Critical patent/CN107315797A/en
Publication of CN107315797A publication Critical patent/CN107315797A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

A kind of Internet news is obtained and text emotion forecasting system, training set is used as using the newsletter archive that network is crawled, utilize Algorithm of documents categorization, training pattern is set up, expected news and journals text is treated according to training pattern and classified, automatic Emotion tagging, predict the influence that Internet news text to be delivered is likely to result in public sentiment, the text emotion forecasting system that social news influence on public's emotion is built, the public sentiment that one news of prediction is likely to result in provides facility for network security.

Description

A kind of Internet news is obtained and text emotion forecasting system
Technical field
Obtained and text emotion prediction system the present invention relates to intelligent use technical field, more particularly to a kind of Internet news System.
Background technology
The today developed rapidly in internet, network has turned into the important sources that people obtain information, promotes people more Easily understand society's dynamic.But in actual life, people are after a piece of news is read, and generation that can be autonomous is corresponding Emotion, for example:Most people see that " outdated " news can become indignation;See that " so-and-so does not hesitate to do what is right " news can be felt It is dynamic, in fact, body may be free of any emotion word (such as " gloomy ", " sad ", " happiness ") in itself, but read this Class news can but allow people to produce certain Sentiment orientation, and this Sentiment orientation has certain regularity of distribution, and most people are to certain The emotional responses of one news is basically identical.
With the increase of data volume, government and website maintenance person can not predict Internet news in advance and the common people may be produced Mood and its social influence, in consideration of it, in the urgent need to a text emotion forecasting system, the common people can be predicted and read news After can issuable emotion, prevention in advance predicted and the mesh of the analysis of public opinion emotion tendentiousness of text with intervening so as to reach 's.
The many text emotion forecasting systems occurred at present, mainly for subjective text (such as comment, viewpoint, taste) Analysis mining is carried out, i.e., the emotion keyword in subjective text is found out the user feeling that these subjective texts are reflected and inclined To usually containing positive and negative two classes emotion, it is impossible to by network analysis body, the shadow hidden in news is excavated with this Ring the factor of people's emotion.
The content of the invention
Technical problem solved by the invention is that providing a kind of Internet news obtains and text emotion forecasting system, with Solve the shortcoming in above-mentioned background technology.
Technical problem solved by the invention is realized using following technical scheme:
A kind of Internet news is obtained and text emotion forecasting system, and training set, profit are used as using the newsletter archive that network is crawled With Algorithm of documents categorization, training pattern is set up, expected news and journals text is treated according to training pattern and classified, automatic emotion mark Note, predicts the influence that Internet news text to be delivered is likely to result in public sentiment, specific steps:
One) training set is used as using the newsletter archive that network is crawled
Info web is crawled by reptile magnanimity, and body and votes are parsed during crawling, while basis The keyword of setting builds corpus to body progress pretreatment matching, and body is carried out certainly according to votes Dynamic Emotion tagging, to obtain the language material for the demand that meets and store to local;
1. social news are obtained
The social news website containing emotion votes is mainly captured, webpage is crawled using reptile, first analyzing web site Structure, will be extracted out with the news related content news links URL that will be crawled from webpage source code, obtain the corresponding URL of news Afterwards, sent and asked using HttpClient, receive response, and utilize HtmlParser resolution responses, to obtain and the news pair The content answered, such as title, text and votes;Using the principle filtered in crawl, if the news that analyzing web page is obtained is just There is not keyword or word approximate therewith (being given by user) in text or title, then it is assumed that the news is unrelated with keyword, Give up;
2. building of corpus and data storage
Corpus selects MySQL, sets up form, and the related text of the keyword for crawling to setting is stored in into corpus;
I) news table is created, the field of table has news links news_url, headline news_title, body News_content and news votes news_vote, using news links as major key, the content that reptile is crawled is stored in news Table;
Ii antistop list) is created, the field of table has keyword sequence number keyword_id, keyword keyword;Read and use The keyword of family setting, using keyword as major key, is stored to antistop list;
Iii concordance list) is created, the field of table has sequence number id, keyword keyword, headline news_title, new The news for including keyword in text news_content and news votes news_vote, selection news table is heard, with key Word is index, is stored to concordance list;
3. automatic marking emotional category
According to the corresponding responses of parsing news URL, the votes of every news are obtained, based on votes, are set certainly Dynamic mark embodiment:
A) the total threshold value N of self-defined ballot, if the ballot sum of some news is less than N, the news will be skipped;
B) self-defined difference threshold M, if votes most in some news and the difference of secondary many votes are less than M, Then it is not involved in building corpus;
If c) news votes exceed threshold value N and M, it is the most class of votes to mark the news, realizes automatic mark In note, the corresponding table of newsletter archive category deposit training set most at last after automatic marking;
Two) Text Pretreatment
Newsletter archive in training set is pre-processed, including participle and removes stop words, based on the Chinese Academy of Sciences ICTCLAS2015 and lucence Words partition system interfaces, to complete participle;User Defined is allowed to disable vocabulary, it is possible to use Acquiescence disables vocabulary, to filter out the word that class discrimination is indifferent, semantic information is few;
Three) feature selecting and feature weight are set
Feature selecting is carried out to the training set newsletter archive that pretreatment is finished to set with weight, feature selecting is that removing is special The feature of effective information can not be preferably represented in collection, to improve the classification degree of accuracy and reduce computation complexity;Weight is set Using the statistical information of newsletter archive, certain weights are assigned to characteristic item;
1) text vector spatial model is built
First, the newsletter archive of training set is converted into computer-readable format, structure will be converted to without structure text Change text, a newsletter archive document is converted into vector, vector is selected per one-dimensional value representative feature weight by feature Construction feature dictionary is selected, the vocabulary of feature lexicon is N, N-dimensional vector representation newsletter archive is built, using weighing computation method Calculate per one-dimensional weighted value, to build text vector spatial model;
2) feature selecting
Using feature is extracted under unitary word, three kinds of granularities of binary word and theme, feature selecting extraction deposits feature after finishing Storage is in HashMap;Text feature is being extracted, with related between chi quantity algorithm computation measure word and document classification Degree, the chi value of a certain class of word correspondence is higher, and explanation may represent a certain class document, that is, the class discrimination letter having Breath is more, for multi-class problem, first calculates chi-square value of the word for each classification, then chooses maximum of which value It is used as chi-square value of the word on whole language material;
3) feature weight is set
Feature weight is used to weigh significance level or separating capacity power of some characteristic item in text representation, uses TFIDF calculates weight, and wherein TF is word frequency, for calculating the ability that the word describes document content;IDF is inverse document frequency, is used In the ability for calculating word differentiation document;
Four) training pattern is set up
By SVM training methods, nonlinear transformation is carried out to the chi-square value kernel function for being provided with feature weight, will be inputted Nonlinear characteristic DUAL PROBLEMS OF VECTOR MAPPING to high-dimensional feature space, optimal linear classifying face is then found in high-dimensional feature space, with Text class is separated, training pattern is set up;
I) training set vector model
User-defined feature dimension, feature is extracted according to feature selection approach, sets the weight under granularity, but intrinsic dimensionality The excessive training speed that is easily caused is slow, and over-fitting and the excessive noise of introducing, intrinsic dimensionality are too small, can not carry enough texts Information, will all produce influence to classification performance, therefore the training pattern being arranged under different characteristic dimension, using cross validation or Person predicts the classification accuracy on test set, determines optimal input dimension, sets up training set vector model;
II) input normalization
Because training set vector model initial data possible range is excessive or too small, first by training set vector model original number Input normalization is carried out according to re-scaling to proper range, makes training and predetermined speed faster;
III) cross validation parameter optimization
Using grid search, it is allowed to the initial value of gamma functions, step-length in self-defined loss function and kernel function, use 5 folding cross-validation methods evaluate the quality of the training pattern under different loss functions and gamma functions, can so avoid random Factor is disturbed, and optimal loss function and kernel function is obtained, to set up SVM models;Wherein, refer to will be initial for 5 folding cross validations Sampling is divided into 5 subsamples, and an independent subsample is kept as verifying the data of model, and other 4 samples are used for instructing Practice, cross validation is repeated 5 times, once, average 5 times result finally gives single estimation for each subsample checking;
Five) prediction output
The info web that reptile magnanimity is crawled carries out being loaded into training vector model after input normalization, uses SVM models Treat classifying text to be predicted, output prediction class label.
Beneficial effect:The newsletter archive of the invention crawled using network is utilized Algorithm of documents categorization, set up as training set Training pattern, treats expected news and journals text according to training pattern and is classified, automatic Emotion tagging predicts news to be delivered The influence that text is likely to result in public sentiment, builds the text emotion forecasting system that social news influence on public's emotion, in advance The public sentiment that a news is likely to result in is surveyed, facility is provided for network security.
Brief description of the drawings
Fig. 1 is the flow chart of presently preferred embodiments of the present invention.
Embodiment
In order that the technical means, the inventive features, the objects and the advantages of the present invention are easy to understand, below With reference to being specifically illustrating, the present invention is expanded on further.
A kind of Internet news shown in Figure 1 is obtained and text emotion forecasting system, specific steps:
One) set up and training set is used as using the newsletter archive that network is crawled
Info web is crawled by reptile magnanimity, and body and votes are parsed during crawling, while basis The keyword of setting builds corpus to body progress pretreatment matching, and body is carried out certainly according to votes Dynamic Emotion tagging, to obtain the language material for the demand that meets and store to local;
1. social news are obtained
The social news website containing emotion votes is mainly captured, webpage is crawled using reptile, first analyzing web site Structure, will extract from webpage source code with the news related content (such as news links URL) that will be crawled, filters out Useless link such as advertisement;Obtain after the corresponding URL of news, sent and asked using HttpClient, receive response, and utilize HtmlParser resolution responses, to obtain content corresponding with the news, such as title, text and votes;Using in crawl The principle of filtering, if do not occur in body or title that analyzing web page is obtained keyword or word approximate therewith (by with Family gives), then it is assumed that the news is unrelated with keyword, gives up;
2. building of corpus and data storage
Corpus selects MySQL, sets up form, and the related text of the keyword for crawling to setting is stored in into corpus;
I) news table is created, the field of table has news links news_url, headline news_title, body News_content and news votes news_vote, using news links as major key, the news that can prevent insertion from repeating will be climbed The content deposit news table that worm crawls;
Ii antistop list) is created, the field of table has keyword sequence number keyword_id, keyword keyword;Read and use The keyword of family setting, using keyword as major key, is stored to antistop list;
Iii concordance list) is created, the field of table has sequence number id, keyword keyword, headline news_title, new The news for including keyword in text news_content and news votes news_vote, selection news table is heard, with key Word is index, is stored to concordance list;
3. automatic marking emotional category
According to the corresponding responses of parsing news URL, the votes of every news are obtained, based on votes, are set certainly Dynamic mark embodiment:
A) the total threshold value N of self-defined ballot, if the ballot sum of some news is less than N, the news will be skipped;
B) self-defined difference threshold M, if votes most in some news and the difference of secondary many votes are less than M, Then it is not involved in building corpus;
If c) news votes exceed threshold value N and M, it is the most class of votes to mark the news, realizes automatic mark In note, the corresponding table of newsletter archive category deposit training set most at last after automatic marking;
Two) Text Pretreatment
Newsletter archive in training set is pre-processed, including participle and removes stop words, based on the Chinese Academy of Sciences ICTCLAS2015 and lucence Words partition system interfaces, to complete participle;User Defined is allowed to disable vocabulary, it is possible to use Acquiescence disables vocabulary, to filter out the word (such as, etc.) that class discrimination is indifferent, semantic information is few;
Three) feature selecting and feature weight are set
Feature selecting is carried out to the training set newsletter archive that pretreatment is finished to set with weight, feature selecting is that removing is special The feature of effective information can not be preferably represented in collection, to improve the classification degree of accuracy and reduce computation complexity;Weight is set Using the statistical information of newsletter archive, certain weights are assigned to characteristic item;
1) text vector spatial model is built
First, the newsletter archive of training set is converted into computer-readable format, structure will be converted to without structure text Change text, in order to computer disposal;The present embodiment uses vector space model, i.e., turn a newsletter archive document Vector is changed to, vector is per one-dimensional value representative feature weight;
Specific steps:If newsletter archive document={ t1, a w1;... tm, wm }, wherein, tnFor the n-th dimension Characteristic item, wnFor the n-th right-safeguarding weight values, by taking this paper as an example, pass through feature selecting construction feature dictionary, the vocabulary of feature lexicon For N, build N-dimensional vector representation newsletter archive, calculated using weighing computation method per one-dimensional weighted value, with build text to Quantity space model;
2) feature selecting
Using feature is extracted under unitary word, three kinds of granularities of binary word and theme, it is characterized as with being extracted under binary word granularity Example:Using Skip-Bigrams binary word feature lexicons, training set newsletter archive content is subjected to middle largest interval according to word Operated for 2 sliding window, form the word fragment sequence that length is 2, be then stored in HashMap, you can produce with bright The binary feature word of aobvious emotion tendency, such as in " I | love | Chinese " the words, uses Skip-Bigram binary word Feature Words Allusion quotation can produce " I/love ", " love/China ", binary phrase as " I/Chinese ", wherein occurring in that " love/China " so The abundant Feature Words of one semanteme, on large-scale corpus, can obtain more this cooccurrence relations, divide after word and be stored in In HashMap;
After feature selecting structure is finished, text feature is extracted, the present embodiment is by taking chi quantity algorithm as an example:By right The comparison of theoretical value and actual value, it is determined that it is theoretical whether correct, the degree of correlation between word and document classification is mainly measured, Suppositive t and document classification c obeys the chi square distribution of the single order free degree, then the chi value of a certain class of word correspondence is higher, Just illustrate that it may represent a certain class document, that is, the class discrimination information having is more, and the formula of card side is as follows:
Wherein, A represents to include word t number of files in c classes, and B, which represents to remove, includes word t number of files in c classes, C is represented in c classes Number of files not comprising word t, D represents to remove the number of files for not including word t in c classes, and sum is all number of files;
For multi-class problem, chi-square values of the word t for each classification can be first calculated, is then chosen wherein most Chi-square value of the big value as word t on whole language material;
3) feature weight is set
Feature weight is used to weigh significance level or separating capacity power of some characteristic item in text representation, this reality Apply example and weight is calculated using TFIDF, wherein TF is word frequency, for calculating the ability that the word describes document content;IDF is inverse text Shelves frequency, for calculating the ability that the word distinguishes document;
Four) training pattern is set up
The present embodiment is by taking SVM training methods as an example, and its basic mode is to carry out nonlinear transformation by kernel function, will be defeated The nonlinear characteristic DUAL PROBLEMS OF VECTOR MAPPING entered then finds optimal linear classifying face to high-dimensional feature space in high-dimensional feature space, So that text class to be separated, training pattern is set up;
I) training set vector model
User-defined feature dimension, feature is extracted according to feature selection approach, sets the weight under binary word granularity, is built Text vector spatial model, but the excessive training speed that is easily caused of intrinsic dimensionality is slow, over-fitting and the excessive noise of introducing, feature Dimension is too small, can not carry enough text messages, influence will be all produced on classification performance, therefore be arranged on different characteristic dimension Under training pattern, the classification accuracy on test set using cross validation or prediction determines optimal input dimension, builds Vertical training set vector model;
II) input normalization
Because training set initial data possible range is excessive or too small, first training set initial data re-scaling can be arrived Proper range carries out input normalization, makes training and predetermined speed faster;
III) cross validation parameter optimization
Need to set gamma functions (G) in some important parameters, such as loss function (C) and kernel function in SVM, It just can guarantee that overall Generalization Capability is good, the present embodiment uses grid search, it is allowed to self-defined C, G initial value and step-length, makes The quality of the training pattern under different loss functions and gamma functions is evaluated with 5 folding cross-validation methods, can so avoid with Machine factor is disturbed, and optimal C, G is obtained, to set up SVM models.Wherein, 5 folding cross validations refer to initial samples being divided into 5 Individual subsample, an independent subsample is kept as verifying the data of model, and other 4 samples are used for training, cross validation It is repeated 5 times, once, average 5 times result finally gives single estimation for each subsample checking;
Five) prediction output
The info web that reptile magnanimity is crawled carries out being loaded into training vector model after input normalization, uses SVM models Treat classifying text to be predicted, output prediction class label.
The general principle and principal character and advantages of the present invention of the present invention has been shown and described above.The skill of the industry Art personnel are it should be appreciated that the present invention is not limited to the above embodiments, and described in above-described embodiment and specification is explanation The principle of the present invention, without departing from the spirit and scope of the present invention, various changes and modifications of the present invention are possible, this A little changes and improvements all fall within the protetion scope of the claimed invention.The claimed scope of the invention is by appended claim Book and its equivalent thereof.

Claims (4)

1. a kind of Internet news is obtained and text emotion forecasting system, it is characterised in that the newsletter archive crawled using network as Training set, using Algorithm of documents categorization, sets up training pattern, treats expected news and journals text according to training pattern and is classified, from Dynamic Emotion tagging, specific steps:
One)Training set is used as using the newsletter archive that network is crawled
Info web is crawled by reptile magnanimity, and body and votes are parsed during crawling, while according to setting Keyword pretreatment matching carried out to body build corpus, automatic emotion is being carried out to body according to votes Mark, to obtain the corpus for the demand that meets and store to local disk;
Two)Text Pretreatment
Newsletter archive in training set is pre-processed, including participle and removes stop words, based on Chinese Academy of Sciences ICTCLAS2015 and Lucence Words partition system interfaces, to complete participle;
Three)Feature selecting and feature weight are set
Feature selecting is carried out to the training set newsletter archive that pretreatment is finished to set with weight, feature selecting is removed in feature set The feature of effective information can not be preferably represented, to improve the classification degree of accuracy and reduce computation complexity;It is using new that weight, which is set, The statistical information of text is heard, certain weights are assigned to characteristic item;
1)Build text vector spatial model
First, the newsletter archive of training set is converted into computer-readable format, structuring text will be converted to without structure text This, vector is converted to by a newsletter archive document, and vector is built per one-dimensional value representative feature weight by feature selecting Feature lexicon, the vocabulary of feature lexicon is N, builds N-dimensional vector representation newsletter archive, calculates each using weighing computation method The weighted value of dimension, to build text vector spatial model;
2)Feature selecting
Using feature is extracted under unitary word, three kinds of granularities of binary word and theme, feature selecting extraction exists characteristic storage after finishing In HashMap;Text feature is being extracted, with the degree of correlation between chi quantity algorithm computation measure word and document classification, The chi value of a certain class of word correspondence is higher, and explanation may represent a certain class document, that is, the class discrimination information having is more, For multi-class problem, chi-square value of the word for each classification is first calculated, maximum of which value is then chosen and exists as word Chi-square value on whole corpus;
3)Feature weight is set
Feature weight calculates weight using TFIDF, and wherein TF is word frequency, for calculating the ability that the word describes document content; IDF is inverse document frequency, for calculating the ability that the word distinguishes document;
Four)Set up training pattern
By SVM training methods, nonlinear transformation is carried out to the chi-square value kernel function for being provided with feature weight, by the non-thread of input Property maps feature vectors then find optimal linear classifying face, by text to high-dimensional feature space in high-dimensional feature space Class is separated, and sets up training pattern;
I)Training set vector model
User-defined feature dimension, feature is extracted according to feature selection approach, sets the weight under granularity, builds the vectorial mould of training set Type;
II)Input normalization
Because training set vector model initial data possible range is excessive or too small, first by training set vector model initial data again Zoom to proper range and carry out input normalization;
III)Cross validation parameter optimization
Using grid search, it is allowed to the initial value of gamma functions, step-length in self-defined loss function and kernel function, handed over using 5 foldings The quality of proof method evaluation training pattern under different loss functions and gamma functions is pitched, optimal loss function and core letter is obtained Number, to set up SVM models;
Five)Prediction output
The info web that reptile magnanimity is crawled carries out being loaded into training vector model after input normalization, is treated using SVM models Classifying text is predicted, output prediction class label.
2. a kind of Internet news according to claim 1 is obtained and text emotion forecasting system, it is characterised in that training set Construction step is as follows:
1. social news are obtained
The social news website containing emotion votes is mainly captured, the structure of webpage, first analyzing web site is crawled using reptile, It will be extracted out with the news related content news links URL that will be crawled from webpage source code, obtain after the corresponding URL of news, make Sent and asked with HttpClient, receive response, and utilize HtmlParser resolution responses, to obtain in corresponding with the news Hold, such as title, text and votes;Using the principle filtered in crawl, if body or title that analyzing web page is obtained In there is not keyword or word approximate therewith, then it is assumed that the news is unrelated with keyword, gives up;
2. building of corpus and data storage
Corpus selects MySQL, sets up form, and the related text of the keyword for crawling to setting is stored in into corpus;
i)News table is created, the field of table has news links news_url, headline news_title, body news_ Content and news votes news_vote, using news links as major key, the content that reptile is crawled is stored in news table;
ii)Antistop list is created, the field of table has keyword sequence number keyword_id, keyword keyword;Read user's setting Keyword, using keyword as major key, store to antistop list;
iii)Concordance list is created, the field of table has sequence number id, keyword keyword, headline news_title, body The news of keyword is included in news_content and news votes news_vote, selection news table, using keyword as rope Draw, store to concordance list;
3. automatic marking emotional category
According to the corresponding responses of parsing news URL, the votes of every news are obtained, based on votes, automatic mark is set Note in emotional category, the corresponding table of newsletter archive category deposit training set most at last after automatic marking.
3. a kind of Internet news according to claim 2 is obtained and text emotion forecasting system, it is characterised in that set certainly Dynamic mark emotional category embodiment:
a)The self-defined total threshold value N of ballot, if the ballot sum of some news is less than N, the news will be skipped;
b)Self-defined difference threshold M, if votes most in some news and the difference of secondary many votes are less than M, no Participate in building corpus;
c)If news votes exceed threshold value N and M, it is the most class of votes to mark the news, realizes automatic marking.
4. a kind of Internet news according to claim 1 is obtained and text emotion forecasting system, it is characterised in that 5 foldings are handed over Fork proof method is concretely comprised the following steps is divided into 5 subsamples by initial samples, and an independent subsample is kept as verifying model Data, other 4 samples are used for training, and cross validation is repeated 5 times, the checking of each subsample once, average 5 times result, Finally give single estimation.
CN201710463295.XA 2017-06-19 2017-06-19 A kind of Internet news is obtained and text emotion forecasting system Pending CN107315797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710463295.XA CN107315797A (en) 2017-06-19 2017-06-19 A kind of Internet news is obtained and text emotion forecasting system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710463295.XA CN107315797A (en) 2017-06-19 2017-06-19 A kind of Internet news is obtained and text emotion forecasting system

Publications (1)

Publication Number Publication Date
CN107315797A true CN107315797A (en) 2017-11-03

Family

ID=60181878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710463295.XA Pending CN107315797A (en) 2017-06-19 2017-06-19 A kind of Internet news is obtained and text emotion forecasting system

Country Status (1)

Country Link
CN (1) CN107315797A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885833A (en) * 2017-11-09 2018-04-06 山东师范大学 Method and system based on the change of Web newsletter archive quick detections ground mulching
CN108153853A (en) * 2017-12-22 2018-06-12 齐鲁工业大学 Chinese Concept Vectors generation method and device based on Wikipedia link structures
CN108363699A (en) * 2018-03-21 2018-08-03 浙江大学城市学院 A kind of netizen's school work mood analysis method based on Baidu's mhkc
CN108389082A (en) * 2018-03-15 2018-08-10 火烈鸟网络(广州)股份有限公司 A kind of game intelligence ranking method and system
CN108509629A (en) * 2018-04-09 2018-09-07 南京大学 Text emotion analysis method based on emotion dictionary and support vector machine
CN108595704A (en) * 2018-05-10 2018-09-28 成都信息工程大学 A kind of the emotion of news and classifying importance method based on soft disaggregated model
CN108829898A (en) * 2018-06-29 2018-11-16 无码科技(杭州)有限公司 HTML content page issuing time extracting method and system
CN109376244A (en) * 2018-10-25 2019-02-22 山东省通信管理局 A kind of swindle website identification method based on tagsort
CN109409537A (en) * 2018-09-29 2019-03-01 深圳市元征科技股份有限公司 A kind of Maintenance Cases classification method and device
CN109471942A (en) * 2018-11-07 2019-03-15 合肥工业大学 Chinese comment sensibility classification method and device based on evidential reasoning rule
CN109522927A (en) * 2018-10-09 2019-03-26 北京奔影网络科技有限公司 Sentiment analysis method and device for user message
CN109657057A (en) * 2018-11-22 2019-04-19 天津大学 A kind of short text sensibility classification method of combination SVM and document vector
CN109710825A (en) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 Webpage harmful information identification method based on machine learning
CN109783800A (en) * 2018-12-13 2019-05-21 北京百度网讯科技有限公司 Acquisition methods, device, equipment and the storage medium of emotion keyword
CN110298403A (en) * 2019-07-02 2019-10-01 郭刚 The sentiment analysis method and system of enterprise dominant in a kind of financial and economic news
TWI681308B (en) * 2018-11-01 2020-01-01 財團法人資訊工業策進會 Apparatus and method for predicting response of an article
CN110728139A (en) * 2018-06-27 2020-01-24 鼎复数据科技(北京)有限公司 Key information extraction model and construction method thereof
CN112100372A (en) * 2020-08-20 2020-12-18 西南电子技术研究所(中国电子科技集团公司第十研究所) Head news prediction classification method
CN112131384A (en) * 2020-08-27 2020-12-25 科航(苏州)信息科技有限公司 News classification method and computer-readable storage medium
WO2020258481A1 (en) * 2019-06-28 2020-12-30 平安科技(深圳)有限公司 Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium
CN112201225A (en) * 2020-09-30 2021-01-08 北京大米科技有限公司 Corpus obtaining method and device, readable storage medium and electronic equipment
CN112819023A (en) * 2020-06-11 2021-05-18 腾讯科技(深圳)有限公司 Sample set acquisition method and device, computer equipment and storage medium
CN113127595A (en) * 2021-04-26 2021-07-16 数库(上海)科技有限公司 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331506A (en) * 2014-11-20 2015-02-04 北京理工大学 Multiclass emotion analyzing method and system facing bilingual microblog text
CN104572613A (en) * 2013-10-21 2015-04-29 富士通株式会社 Data processing device, data processing method and program
CN105183717A (en) * 2015-09-23 2015-12-23 东南大学 OSN user emotion analysis method based on random forest and user relationship
CN105589941A (en) * 2015-12-15 2016-05-18 北京百分点信息科技有限公司 Emotional information detection method and apparatus for web text
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572613A (en) * 2013-10-21 2015-04-29 富士通株式会社 Data processing device, data processing method and program
CN104331506A (en) * 2014-11-20 2015-02-04 北京理工大学 Multiclass emotion analyzing method and system facing bilingual microblog text
CN105183717A (en) * 2015-09-23 2015-12-23 东南大学 OSN user emotion analysis method based on random forest and user relationship
CN105589941A (en) * 2015-12-15 2016-05-18 北京百分点信息科技有限公司 Emotional information detection method and apparatus for web text
CN105824922A (en) * 2016-03-16 2016-08-03 重庆邮电大学 Emotion classifying method fusing intrinsic feature and shallow feature

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘泽光: "网络舆情分析关键技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
叶升阳: "基于网络评论的倾向性分析研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885833A (en) * 2017-11-09 2018-04-06 山东师范大学 Method and system based on the change of Web newsletter archive quick detections ground mulching
CN107885833B (en) * 2017-11-09 2020-05-05 山东师范大学 Method and system for rapidly detecting earth surface coverage change based on Web news text
CN108153853A (en) * 2017-12-22 2018-06-12 齐鲁工业大学 Chinese Concept Vectors generation method and device based on Wikipedia link structures
CN108153853B (en) * 2017-12-22 2022-02-01 齐鲁工业大学 Chinese concept vector generation method and device based on Wikipedia link structure
CN108389082A (en) * 2018-03-15 2018-08-10 火烈鸟网络(广州)股份有限公司 A kind of game intelligence ranking method and system
CN108389082B (en) * 2018-03-15 2021-07-06 火烈鸟网络(广州)股份有限公司 Intelligent game rating method and system
CN108363699A (en) * 2018-03-21 2018-08-03 浙江大学城市学院 A kind of netizen's school work mood analysis method based on Baidu's mhkc
CN108509629A (en) * 2018-04-09 2018-09-07 南京大学 Text emotion analysis method based on emotion dictionary and support vector machine
CN108509629B (en) * 2018-04-09 2022-05-13 南京大学 Text emotion analysis method based on emotion dictionary and support vector machine
CN108595704A (en) * 2018-05-10 2018-09-28 成都信息工程大学 A kind of the emotion of news and classifying importance method based on soft disaggregated model
CN110728139A (en) * 2018-06-27 2020-01-24 鼎复数据科技(北京)有限公司 Key information extraction model and construction method thereof
CN108829898B (en) * 2018-06-29 2020-11-20 无码科技(杭州)有限公司 HTML content page release time extraction method and system
CN108829898A (en) * 2018-06-29 2018-11-16 无码科技(杭州)有限公司 HTML content page issuing time extracting method and system
CN109409537A (en) * 2018-09-29 2019-03-01 深圳市元征科技股份有限公司 A kind of Maintenance Cases classification method and device
CN109522927A (en) * 2018-10-09 2019-03-26 北京奔影网络科技有限公司 Sentiment analysis method and device for user message
CN109376244A (en) * 2018-10-25 2019-02-22 山东省通信管理局 A kind of swindle website identification method based on tagsort
TWI681308B (en) * 2018-11-01 2020-01-01 財團法人資訊工業策進會 Apparatus and method for predicting response of an article
CN109710825A (en) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 Webpage harmful information identification method based on machine learning
CN109471942B (en) * 2018-11-07 2021-09-07 合肥工业大学 Chinese comment emotion classification method and device based on evidence reasoning rule
CN109471942A (en) * 2018-11-07 2019-03-15 合肥工业大学 Chinese comment sensibility classification method and device based on evidential reasoning rule
CN109657057A (en) * 2018-11-22 2019-04-19 天津大学 A kind of short text sensibility classification method of combination SVM and document vector
CN109783800B (en) * 2018-12-13 2024-04-12 北京百度网讯科技有限公司 Emotion keyword acquisition method, device, equipment and storage medium
CN109783800A (en) * 2018-12-13 2019-05-21 北京百度网讯科技有限公司 Acquisition methods, device, equipment and the storage medium of emotion keyword
WO2020258481A1 (en) * 2019-06-28 2020-12-30 平安科技(深圳)有限公司 Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium
CN110298403B (en) * 2019-07-02 2023-12-12 北京金融大数据有限公司 Emotion analysis method and system for enterprise main body in financial news
CN110298403A (en) * 2019-07-02 2019-10-01 郭刚 The sentiment analysis method and system of enterprise dominant in a kind of financial and economic news
CN112819023A (en) * 2020-06-11 2021-05-18 腾讯科技(深圳)有限公司 Sample set acquisition method and device, computer equipment and storage medium
CN112819023B (en) * 2020-06-11 2024-02-02 腾讯科技(深圳)有限公司 Sample set acquisition method, device, computer equipment and storage medium
CN112100372A (en) * 2020-08-20 2020-12-18 西南电子技术研究所(中国电子科技集团公司第十研究所) Head news prediction classification method
CN112131384A (en) * 2020-08-27 2020-12-25 科航(苏州)信息科技有限公司 News classification method and computer-readable storage medium
CN112201225A (en) * 2020-09-30 2021-01-08 北京大米科技有限公司 Corpus obtaining method and device, readable storage medium and electronic equipment
CN112201225B (en) * 2020-09-30 2024-02-02 北京大米科技有限公司 Corpus acquisition method and device, readable storage medium and electronic equipment
CN113127595B (en) * 2021-04-26 2022-08-16 数库(上海)科技有限公司 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
CN113127595A (en) * 2021-04-26 2021-07-16 数库(上海)科技有限公司 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Similar Documents

Publication Publication Date Title
CN107315797A (en) A kind of Internet news is obtained and text emotion forecasting system
Gupta et al. Study of Twitter sentiment analysis using machine learning algorithms on Python
Ahuja et al. The impact of features extraction on the sentiment analysis
CN104951548B (en) A kind of computational methods and system of negative public sentiment index
WO2021051518A1 (en) Text data classification method and apparatus based on neural network model, and storage medium
CN107193801A (en) A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN109255012B (en) Method and device for machine reading understanding and candidate data set size reduction
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
CN110795525A (en) Text structuring method and device, electronic equipment and computer readable storage medium
CN112256861B (en) Rumor detection method based on search engine return result and electronic device
CN109670014A (en) A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning
WO2012096388A1 (en) Unexpectedness determination system, unexpectedness determination method, and program
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
CN115796181A (en) Text relation extraction method for chemical field
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN110134777A (en) Problem De-weight method, device, electronic equipment and computer readable storage medium
Chun et al. Detecting Political Bias Trolls in Twitter Data.
CN115329085A (en) Social robot classification method and system
Lim et al. Examining machine learning techniques in business news headline sentiment analysis
Asha et al. Fake news detection using n-gram analysis and machine learning algorithms
Al Mostakim et al. Bangla content categorization using text based supervised learning methods
Wambsganss et al. Improving Explainability and Accuracy through Feature Engineering: A Taxonomy of Features in NLP-based Machine Learning.
Anjum et al. Exploring humor in natural language processing: a comprehensive review of JOKER tasks at CLEF symposium 2023

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171103

RJ01 Rejection of invention patent application after publication