CN108197259A

CN108197259A - A kind of online topic big data detection method of network

Info

Publication number: CN108197259A
Application number: CN201711489608.5A
Authority: CN
Inventors: 马永军; 柴梦瑶; 刘洋
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2017-12-30
Filing date: 2017-12-30
Publication date: 2018-06-22
Anticipated expiration: 2037-12-30
Also published as: CN108197259B

Abstract

The present invention relates to a kind of online topic big data detection method of network, technical characteristics：Big data network text crawls online；Extract text feature and expression；Using Single Pass clustering algorithms, multiple similarity factors are chosen, carry out topic detection.Reasonable design of the present invention, it is on the basis of existing Single Pass algorithms, by analyzing text feature, using more similarity calculating methods, by the value for assigning different weight factors, weighted array acquires the similarity of text, can use and be declined on omission factor, false drop rate and consuming functional value, Clustering Effect is obviously improved.

Description

A kind of online topic big data detection method of network

Technical field

The invention belongs to computer data modeling technique field, especially a kind of online topic big data detection side of network Method.

Background technology

Compared with traditional information transmission channel, network has the opening of bigger and virtual, with from Media Era Arrival, a variety of different viewpoints, words and mood form network public-opinion thing by the continuous fermentation scale-up of cyberspace, some Part.Under the background of current China's construction network power, Internet public opinion analysis is paid high attention to.Network public-opinion is studied and is sent out For exhibition, development abroad is in 19th century mid-term, and China will propose that network public-opinion refers to more a little later to the research of network public-opinion The public by interconnection netlist reaches and propagates different moods, attitude and opinion.

Topic is the discovery that one ring of key of Internet public opinion analysis, and research method is concentrated mainly on to Text Clustering Algorithm at present Selection on, such as clustering algorithm based on division, the clustering algorithm based on density space, is based on the clustering algorithm based on level Clustering algorithm of grid etc., wherein the most commonly used clustering algorithm is Single-Pass clusters.Since Single-Pass gathers Class algorithm employs single similarity calculating method, does not account for the design feature of text, thus affects the accurate of cluster Degree, omission factor and the bit error rate are higher.

Invention content

It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of online topic big data detection side of network Method solves the problems, such as that omission factor and the bit error rate are higher existing for existing clustering algorithm.

The present invention solves its technical problem and following technical scheme is taken to realize：

A kind of online topic big data detection method of network, includes the following steps：

Step 1, big data network text crawl online；

Step 2, extraction text feature and expression；

Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.

Further, the implementation method of the step 1 is：Hadoop distributed type assemblies are built, are installed in every machine CentOS and distributed Theme Crawler of Content, build big data data acquisition platform.

Further, the concrete methods of realizing of the step 2 includes the following steps：

(1) Text Pretreatment：Feature selection is carried out to text using text frequency techniques, from the Candidate Set of Feature Words Select some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to choosing Characteristic item distribution weight out；

(2) text representation model is built：Text is represented using vector space model, the expression formula of mathematical model is as follows：

VSM (d)=<(t₁,ω₁)；(t₂,ω₂)；(t₃,ω₃)；…(t_n,ω_n)>

Wherein n represents the number of characteristic item, t_i(1≤i≤n) be text feature item, ω_i(1≤i≤n) is characterized a t_iIt is right The weights answered, by establishing VSM by text representation into a vector in a n-dimensional space.

Further, the step 3 is chosen multiple similarity factors and is included：Time factor, the place factor and source factor.

Further, the time factor is：

In formula, Sim_time(d_i,d_j) represent document d_iAnd d_jTime gap, t=| t_i-t_j|, m then for set automatically when Between be spaced.

Further, the place factor is：

In formula, Sim_place(p_i,p_j) similarity for two place names, deep (p_i∩p_j) it is place name p_iWith place name p_jIn geography Apart from the common depth of root node, deep (p on tree_i) it is place name p_iDepth apart from root node, deep (p_j) it is place name p_jDistance The depth of root node.

Further, the source factor is：

Wherein, PR_(p)Represent the PR values of website p, wherein d is damped coefficient, and a is to judge that chain goes out whether website is station exterior chain The specific gravity factor connect, V1 are the set that chain page-out and the website p pages are not same website, C_iRepresentation page i whole chains go out The quantity of the page, V2 belong to the set of same website, C for chain page-out with page p_jRepresentation page j whole chain page-outs Quantity.

Further, the damped coefficient d values are 0.85, and the specific gravity factor a values are 0.75.

Further, the method that the step 3 carries out topic detection includes the following steps：

(1) a news documents d is inputted；

(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4)；

(3) create new topic and text d is added to new topic, turn to step (7)；

(4) text d is pre-processed and build vector space model；

(5) the similarity of document d and each text of existing topic, record maximum similarity S are calculated_maxAnd it finds and is corresponding to it Topic class T；

(6) if maximum similarity S_maxMore than preset threshold value T_c, then document d is clustered into topic class T, otherwise Turn to step (3)；

(7) primary cluster terminates.

The advantages and positive effects of the present invention are：

The present invention is on the basis of existing Single-Pass algorithms, by analyzing text feature, using more similarity calculations Method, by assigning the value of different weight factors, weighted array acquires the similarity of text, can use omission factor, miss Declined on inspection rate and consuming functional value, Clustering Effect is obviously improved.

Description of the drawings

Fig. 1 is the overhaul flow chart of the present invention；

Fig. 2 is false drop rate-fallout ratio change curve of the present invention and a variety of detection methods；

Fig. 3 is every evaluation metrics comparison diagram of the present invention and a variety of detection methods.

Specific embodiment

The embodiment of the present invention is further described below in conjunction with attached drawing.

A kind of online topic big data detection method of network, as shown in Figure 1, including the following steps：

Step 1, big data network text crawl online.

In this step, Hadoop distributed type assemblies are built, 64 CentOS and distributed theme are installed in every machine Reptile builds big data data acquisition platform.

Step 2, extraction text feature and expression.

In this step, after the data after participle are carried out with feature selecting and weighting processing, and pass through structure vector space Model converts text to the structural data that can be calculated.Specific method is：

(1) Text Pretreatment：Feature selection is carried out to text using text frequency techniques, from the Candidate Set of Feature Words Select some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to choosing Characteristic item distribution weight out.

(2) text representation model is built：Text, the expression of mathematical model are represented using vector space model (VSM) Formula is as follows：

VSM (d)=<(t₁,ω₁)；(t₂,ω₂)；(t₃,ω₃)；…(t_n,ω_n) ＞

The present invention using Single-Pass clustering algorithms, access time, place and source as the similarity factor, In：

(1) time factor：Introduce time gap concept

Sim_time(d_i,d_j) represent document d_iAnd d_jTime gap, wherein t=| t_i-t_j|, i.e., time of two documents it Difference, m are then the time interval set automatically.

(2) the place factor：It utilizes《Chinese place name is recorded》And " national place name atural object dictionary " data provided by data hall Structure one using China as root node geography set, using subordinate's relationship between place name by each place name be expressed as tree in one A node.

Calculating the similarity of two geographical trees needs to consider that each child node in geographical tree is the one of father node A branch, there are two the distance between child node, the common depth of two child nodes and each nodal distance root nodes Depth.Consider influence of the above-mentioned three kinds of factors to two place name similarities of calculating, it is as follows to define calculation formula：

Sim_place(p_i,p_j) similarity for two place names, wherein deep (p_i∩p_j) it is place name p_iWith place name p_jIn geography Apart from the common depth of root node, deep (p on tree_i) it is place name p_iDepth apart from root node, deep (p_j) it is place name p_jDistance The depth of root node.

(3) source factor：The PR values of the page are calculated using improved PageRank algorithms, calculation formula is as follows：

PR_(p)Represent the PR values of website p, wherein d is damped coefficient, and usual value is 0.85, a to judge that chain goes out website and is The no specific gravity factor for the outer link in station, relative to the page in station, the outer page of standing can more embody the importance of the affiliated website of the page, take It is worth the set for 0.75, V1 to be chain page-out with the p pages be not same websites, C_iThe number of representation page i whole chain page-outs Amount, V2 belong to the set of same website, C for chain page-out with page p_jThe quantity of representation page j whole chain page-outs.

In this step, it after the access time factor, the place factor and source factor are as more similarity factors, carries out Line topic detection.Its input is：The document sets of news report, similarity threshold T_c；It exports as multiple topic classes.Idiographic flow is such as Under：

(1) a news documents d is inputted

(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4)

(3) it creates new topic and text d is added to new topic, turn to step (7)

(4) text d is pre-processed and builds vector space model

(5) similarity of document d and each text of existing topic, record maximum similarity S are calculated_maxAnd find it is right therewith The topic class T answered

(6) if maximum similarity S_maxMore than preset threshold value T_c, then document d is clustered into topic class T, it is no Then turn to step (3)

(7) primary cluster terminates.

The present invention incorporates more similarity calculating methods in time, place name and source, by assigning different weight factors Value, weighted array acquire total similarity of two texts.Fig. 2 and Fig. 3 gives the present invention and Single-Pass clustering methods False drop rate-fallout ratio change curve and the present invention and every evaluation metrics comparison diagrams of Single-Pass clustering methods, can To find out, the present invention is declined on omission factor, false drop rate and consuming functional value, and detection result is preferable.

It is emphasized that embodiment of the present invention is illustrative rather than limited, therefore present invention packet Include the embodiment being not limited to described in specific embodiment, it is every by those skilled in the art according to the technique and scheme of the present invention The other embodiment obtained, also belongs to the scope of protection of the invention.

Claims

1. a kind of online topic big data detection method of network, it is characterised in that include the following steps：

Step 1, big data network text crawl online；

Step 2, extraction text feature and expression；

2. a kind of online topic big data detection method of network according to claim 1, it is characterised in that：The step 1 Implementation method be：Hadoop distributed type assemblies are built, in every machine installation CentOS and distributed Theme Crawler of Content, are built big Data Data obtains platform.

3. a kind of online topic big data detection method of network according to claim 1, it is characterised in that：The step 2 Concrete methods of realizing include the following steps：

(1) Text Pretreatment：Feature selection is carried out to text using text frequency techniques, is selected from the Candidate Set of Feature Words Some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to select come Characteristic item distribution weight；

VSM (d)=<(t₁,ω₁)；(t₂,ω₂)；(t₃,ω₃)；…(t_n,ω_n)>

Wherein n represents the number of characteristic item, t_i(1≤i≤n) be text feature item, ω_i(1≤i≤n) is characterized a t_iIt is corresponding Weights, by establishing VSM by text representation into a vector in a n-dimensional space.

4. a kind of online topic big data detection method of network according to claim 1, it is characterised in that：The step 3 Multiple similarity factors are chosen to include：Time factor, the place factor and source factor.

5. a kind of online topic big data detection method of network according to claim 4, it is characterised in that：The time because Son is：

In formula, Sim_time(d_i,d_j) represent document d_iAnd d_jTime gap, t=| t_i-t_j|, m is then between time for setting automatically Every.

6. a kind of online topic big data detection method of network according to claim 4, it is characterised in that：The place because Son is：

In formula, Sim_place(p_i,p_j) similarity for two place names, deep (p_i∩p_j) it is place name p_iWith place name p_jOn geography tree Common depth apart from root node, deep (p_i) it is place name p_iDepth apart from root node, deep (p_j) it is place name p_jApart from root section The depth of point.

7. a kind of online topic big data detection method of network according to claim 4, it is characterised in that：The source because Son is：

Wherein, PR_(p)Represent the PR values of website p, wherein d is damped coefficient, and a is to judge that chain goes out whether website is the outer link in station Specific gravity factor, V1 are the set that chain page-out and the website p pages are not same website, C_iRepresentation page i whole chain page-outs Quantity, V2 is the set that chain page-out and page p belong to same website, C_jThe quantity of representation page j whole chain page-outs.

8. a kind of online topic big data detection method of network according to claim 7, it is characterised in that：The damping system Number d values are 0.85, and the specific gravity factor a values are 0.75.

9. a kind of online topic big data detection method of network according to claim 1, it is characterised in that：The step 3 The method for carrying out topic detection includes the following steps：

(1) a news documents d is inputted；

(3) create new topic and text d is added to new topic, turn to step (7)；

(4) text d is pre-processed and build vector space model；

(5) the similarity of document d and each text of existing topic, record maximum similarity S are calculated_maxAnd if finding and being corresponding to it Inscribe class T；

(6) if maximum similarity S_maxMore than preset threshold value T_c, then document d is clustered into topic class T, otherwise turned to Step is (3)；

(7) primary cluster terminates.