CN107045497A

CN107045497A - A kind of quick newsletter archive content sentiment analysis system and method

Info

Publication number: CN107045497A
Application number: CN201710309000.3A
Authority: CN
Inventors: 余军; 卢品吟; 刘盾; 张汨
Original assignee: Chengdu Hua Seiun Technology Co Ltd
Current assignee: Chengdu Hua Seiun Technology Co Ltd
Priority date: 2017-05-04
Filing date: 2017-05-04
Publication date: 2017-08-15

Abstract

Include the invention discloses a kind of quick newsletter archive content sentiment analysis system and method with lower module：News handling module：For capturing news documents from news portal, forum and microblogging, preliminary duplicate removal processing is carried out including to text；Newsletter archive preliminary treatment module：For carrying out preliminary text feature processing to text, including participle, remove stop words, modus tollens phrase is additionally marked；Newsletter archive affection computation module：Including TextRank calculating, participle affection computation, calculated value is normalized, COMPREHENSIVE CALCULATING obtains the affection index of document；Data memory module：Result after storage calculating.The present invention can quickly carry out affection index calculating under a large amount of public sentiment scenes.

Description

A kind of quick newsletter archive content sentiment analysis system and method

Technical field

The present invention relates to a kind of Domestic News field, and in particular to a kind of quick newsletter archive content sentiment analysis system And method.

Background technology

With the fast development of internet, network public-opinion is increasing to the influence power of society.Either government network carriage The need for feelings are monitored, or enterprise is the need for branding communication and brand public relations is carried out, how under conditions of substantial amounts of public sentiment, The Sentiment orientation of public sentiment is rapidly analyzed, is guided with carrying out decision support and public sentiment in time, the public opinion ring of response quickly change The problem of border is in the urgent need to address in the analysis of public opinion.Conventional sentiment analysis is, it is necessary to carry out the analysis of complexity, in reply greatly Under the conditions of the public sentiment of amount, it is impossible to accomplish that low latency is handled.

The content of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of news user feeling analysis system, in face To under a large amount of public sentiment scenes, the quick method for carrying out affection index calculating.

The purpose of the present invention is achieved through the following technical solutions：

A kind of quick newsletter archive content sentiment analysis system, including with lower module：

News handling module：For capturing news documents from news portal, forum and microblogging, carried out just including to text Walk duplicate removal processing；

Newsletter archive preliminary treatment module：For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right Modus tollens phrase is additionally marked；

Newsletter archive affection computation module：Including TextRank calculating, participle affection computation, calculated value is normalized place Reason, COMPREHENSIVE CALCULATING obtain the affection index of document；

Data memory module：Result after storage calculating.

A kind of quick newsletter archive content sentiment analysis method, comprises the following steps：

S01：News is crawled from internet news door, forum and microblogging, to text duplicate removal；

S02：Extract text message, the mainly information such as source, author, title, text；

S03：Participle is carried out to title, text, removes stop words；

S04：The weight of each word is calculated using Text Rank；

S05：Simultaneously according to sentiment dictionary, the Sentiment orientation and emotion strength S of each word are obtained；

S06：Finally the weight of word is multiplied with the emotion intensity of word, summation is calculated, is normalized, so as to obtain document Affection index.

Further, the use Text Rank described in described rapid S04 calculate the weight of each word, specifically include

Word to title is additionally weighted, and weighting algorithm is wt=n × wd, wherein, wt represents title participle, and wd is represented just Literary participle span is [0,100]）, n represent weighting weight weight value range be how many [2,10]；

Part of speech filtering is carried out to participle, only retains nominal and verb character participle；

The weight of each word is calculated using Text Rank algorithms；

Result of calculation is normalized, normalized calculation is wt=wt/(max (wt)+1).Wherein, wt The word weight that Text Rank are calculated is indicated, max (wt) represents weight maximum in the document.

Further, the affection index of document is calculated in described step S06 according to participle, specific calculation is

Sd = ∑(wt × St) × C/n

Wherein, S d represent the affection index of document, and wt represents the weight of each participle, and St represents the affection index of each participle The exponent value range is [- 100,100], C be a constant range value be how many [1,5], n is represented in the document, the number of word Amount

The beneficial effects of the invention are as follows：The present invention only need to can be obtained by corresponding emotion by simple text-processing and calculating Index analysis result, is solved in the low latency processing under the conditions of a large amount of public sentiments.

Brief description of the drawings

Fig. 1 is system structure diagram of the invention；

Fig. 2 is flow chart of the method for the present invention.

Embodiment

Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to It is as described below.

As shown in figure 1,

Data memory module：Result after storage calculating.

As shown in Figure 2：

S03：Participle is carried out to title, text, removes stop words；

S04：The weight of each word is calculated using Text Rank；

S05：Simultaneously according to sentiment dictionary, the Sentiment orientation and emotion intensity of each word are obtained；

Specific operation is to capture text first, duplicate removal processing, extracts text message, including source, date, title, just The information such as text, author, and then carry out word segmentation processing to title, text, are then handled in terms of two；One is to use Text Rank calculates the weight of each word, and does normalized, and two be by looking up the dictionary, obtaining the Sentiment orientation and emotion of each word Strength S（The value of emotion strength S is raising concrete numerical value scope how）.

Use Text Rank described in described rapid S04 calculate the weight of each word, specifically include

Word to title is additionally weighted, and weighting algorithm is wt=n × wd, wherein, wt represents title participle, and wd is represented just Literary participle span is [0,100], and n represents that weighting weight weight value range is [2,10]；

The weight of each word is calculated using Text Rank algorithms；

The affection index of document is calculated in described step S06 according to participle, specific calculation is

Sd = ∑(wt × St) × C/n

Wherein, Sd represents the affection index of document, and wt represents the weight of each participle, and St represents that the affection index of each participle should Exponent value range is [- 100,100], and C is that a constant range value is [1,5], and n is represented in the document, the quantity of word.

Described above is only the preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form, is not to be taken as the exclusion to other embodiment, and available for various other combinations, modification and environment, and can be at this In the text contemplated scope, it is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are entered Capable change and change does not depart from the spirit and scope of the present invention, then all should appended claims of the present invention protection domain It is interior.

Claims

1. a kind of quick newsletter archive content sentiment analysis system, it is characterised in that including with lower module：

Data memory module：Result after storage calculating.

2. a kind of quick newsletter archive content sentiment analysis method, it is characterised in that comprise the following steps：

S03：Participle is carried out to title, text, removes stop words；

S04：The weight of each word is calculated using Text Rank；

3. a kind of quick newsletter archive content sentiment analysis method according to claim 2, it is characterised in that：Described Use Text Rank described in rapid S04 calculate the weight of each word, specifically include

Word to title is additionally weighted, and weighting algorithm is wt=n × wd, wherein, wt represents title participle, and wd is represented just Literary participle, span is [0,100], and n represents to weight weight, value range be how many [2,10]；

The weight of each word is calculated using Text Rank algorithms；

Result of calculation is normalized, normalized calculation is wt=wt/(max (wt)+1), wherein, wt The word weight that Text Rank are calculated is indicated, max (wt) represents weight maximum in the document.

4. a kind of quick newsletter archive content sentiment analysis method according to claim 2, it is characterised in that：Described The affection index of document is calculated in step S06 according to participle, specific calculation is

Sd = ∑(wt × St) × C/n

Wherein, Sd represents the affection index of document, and wt represents the weight of each participle, and St represents the affection index model of each participle It is [- 100,100] to enclose, C be a constant range value be how many [1,5], n is represented in the document, the quantity of word.