CN108197259B

CN108197259B - Online topic big data detection method for network

Info

Publication number: CN108197259B
Application number: CN201711489608.5A
Authority: CN
Inventors: 马永军; 柴梦瑶; 刘洋
Original assignee: Tianjin University of Science and Technology
Current assignee: Tianjin University of Science and Technology
Priority date: 2017-12-30
Filing date: 2017-12-30
Publication date: 2024-03-05
Anticipated expiration: 2037-12-30
Also published as: CN108197259A

Abstract

The invention relates to a network online topic big data detection method, which is mainly technically characterized by comprising the following steps of: the large data network text is crawled online; extracting text characteristics and expressions; and selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics. The invention has reasonable design, and on the basis of the existing Single-Pass algorithm, the similarity of the text is obtained by adopting a multi-similarity calculation method and giving values of different weight factors and weighting combination through analyzing the characteristics of the text, so that the omission rate, the false detection rate and the consumption function value can be reduced, and the clustering effect is obviously improved.

Description

Online topic big data detection method for network

Technical Field

The invention belongs to the technical field of computer data modeling, and particularly relates to a network online topic big data detection method.

Background

Compared with the traditional information propagation channel, the network has greater openness and virtualization, and various viewpoints, negotiations and moods are continuously fermented and amplified through a network space with the arrival of the media age, so that network public opinion events are formed. Under the background of the construction of the network in China at present, the analysis of the network public opinion is highly concerned. In terms of research and development of internet public opinion, foreign countries develop in the middle of 19 th century, and research on internet public opinion in China is later, and the proposal of internet public opinion refers to the expression and propagation of different moods, attitudes and opinions by the public through the internet.

Topic discovery is a key ring of network public opinion analysis, and the current research method mainly focuses on selecting text clustering algorithms, such as a partition-based clustering algorithm, a hierarchical-based clustering algorithm, a density space-based clustering algorithm, a grid-based clustering algorithm and the like, wherein the most commonly used clustering algorithm is Single-Pass clustering. Because the Single-Pass clustering algorithm adopts a Single similarity calculation method, the structural characteristics of texts are not considered, the clustering accuracy is affected, and the omission ratio and the bit error rate are high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a network online topic big data detection method, and solves the problems of higher omission ratio and higher error rate of the existing clustering algorithm.

The invention solves the technical problems by adopting the following technical scheme:

a network online topic big data detection method comprises the following steps:

step 1, online crawling of a big data network text;

step 2, extracting text features and expressions;

and 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics.

Further, the implementation method of the step 1 is as follows: and (3) constructing a Hadoop distributed cluster, installing a CentOS and a distributed theme crawler on each machine, and constructing a big data acquisition platform.

Further, the specific implementation method of the step 2 includes the following steps:

preprocessing a text: selecting a text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of feature words as feature items, and distributing weights to the selected feature items by adopting a classical weight estimation method TF-IDF;

building a text representation model: the text is represented by a vector space model, and the expression of the mathematical model is as follows:

VSM(d)＝<(t ₁ ,ω ₁ )；(t ₂ ,ω ₂ )；(t ₃ ,ω ₃ )；…(t _n ,ω _n )>

wherein n represents the number of feature items, t _i (1.ltoreq.i.ltoreq.n) is a text feature term, ω _i (1.ltoreq.i.ltoreq.n) as a characteristic term t _i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.

Further, the selecting a plurality of similarity factors in the step 3 includes: time factor, place factor, and source factor.

Further, the time factor is:

in the formula, sim _time (d _i ,d _j ) Representing document d _i And d _j T= |t _i -t _j And m is an automatically set time interval.

Further, the location factor is:

in the formula, sim _place (p _i ,p _j ) For the similarity of two place names, deep (p _i ∩p _j ) Is the place name p _i And place name p _j Public depth from root node on geographical tree, deep (p _i ) Is the place name p _i Depth from root node, deep (p _j ) Is the place name p _j Depth from the root node.

Further, the source factors are:

wherein PR is PR _(p) PR value of website p, wherein d is damping coefficient, a is specific gravity coefficient for judging whether the linked-out website is out-of-site link, V1 is set of websites with linked-out pages and website p pages not being the same website, C _i Representing the number of all out pages of page i, V2 is the set of out pages belonging to the same site as page p, C _j Indicating the number of pages j that are all chained out.

Further, the damping coefficient d is 0.85, and the specific gravity coefficient a is 0.75.

Further, the method for topic detection in the step 3 includes the following steps:

inputting a news document d;

judging whether the d is a first news report or not, if so, turning to a step, otherwise, turning to the step;

creating a new topic and adding the text d to the new topic, and turning to the step;

preprocessing the text d and constructing a vector space model;

fifthly, calculating the similarity between the document d and each text of the existing topics, and recording the maximum similarity S _max Finding out topic class T corresponding to the topic class T;

sixth, if the similarity S is maximum _max Greater than a preset threshold T _c Clustering the document d into topic class T, otherwise turning to the step;

and finishing the clustering once.

The invention has the advantages and positive effects that:

on the basis of the existing Single-Pass algorithm, the method adopts a multi-similarity calculation method by analyzing the characteristics of the text, obtains the similarity of the text by giving values of different weight factors and weighting combination, can reduce the omission ratio, the false detection ratio and the consumed function value, and obviously improves the clustering effect.

Drawings

FIG. 1 is a flow chart of the detection of the present invention;

FIG. 2 is a plot of false positive rate versus false positive rate for the present invention and various detection methods;

FIG. 3 is a graph comparing various evaluation indexes of the invention with various detection methods.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

A network online topic big data detection method, as shown in figure 1, comprises the following steps:

and step 1, online crawling of the large data network text.

In the step, a Hadoop distributed cluster is built, 64-bit CentOS and distributed theme crawlers are installed on each machine, and a big data acquisition platform is built.

And 2, extracting text characteristics and expressions.

In the step, after feature selection and weighting processing are carried out on the segmented data, a vector space model is constructed to convert the text into computable structured data. The specific method comprises the following steps:

(1) Text preprocessing: and selecting the characteristic items of the text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of characteristic words as the characteristic items, and distributing weights to the selected characteristic items by adopting a classical weight estimation method TF-IDF.

(2) Constructing a text representation model: the text is represented using a Vector Space Model (VSM), whose mathematical model is expressed as follows:

VSM(d)＝<(t ₁ ,ω ₁ )；(t ₂ ,ω ₂ )；(t ₃ ,ω ₃ )；…(t _n ,ω _n )＞

The invention uses a Single-Pass clustering algorithm to select time, place and source as similarity factors, wherein:

(1) Time factor: introducing a concept of temporal distance

Sim _time (d _i ,d _j ) Representing document d _i And d _j Wherein t= |t _i -t _j I, i.e. the difference between the times of the two documents, m is the automatically set time interval.

(2) Location factor: a geographical tree using China as a root node is constructed by using the data of the national place name and place object dictionary provided by the data hall, and each place name is expressed as a node in the tree by using the subordinate relation among place names.

Calculating the similarity of two geographical trees requires taking into account that each child node in the geographical tree is a branch of the parent node, as well as the distance between the two child nodes, the common depth of the two child nodes, and the depth of each node from the root node. The influence of the three factors on the similarity of the two place names is comprehensively considered, and a calculation formula is defined as follows:

Sim _place (p _i ,p _j ) Is the similarity of two place names, where deep (p _i ∩p _j ) Is the place name p _i And place name p _j Public depth from root node on geographical tree, deep (p _i ) Is the place name p _i Depth from root node, deep (p _j ) Is the place name p _j Depth from the root node.

(3) Source factor: the PR value of the page is calculated by adopting the improved PageRank algorithm, and the calculation formula is as follows:

PR _(p) PR value of web site p, where d is damping coefficient, usually 0.85, a is specific gravity coefficient for judging whether the linked-out web site is an off-site link, and relative to the on-site page, the off-site page can reflect importance of the site to which the page belongs, and has a value of 0.75, V1 is a set of sites where the linked-out page and the p page are not the same, C _i Representing the number of all out pages of page i, V2 is the set of out pages belonging to the same site as page p, C _j Indicating the number of pages j that are all chained out.

In the step, after a time factor, a place factor and a source factor are selected as multiple similarity factors, online topic detection is performed. The input is as follows: document set for news stories, similarity threshold T _c The method comprises the steps of carrying out a first treatment on the surface of the The output is a plurality of topic classes. The specific flow is as follows:

(1) Inputting a news document d

(2) Judging whether d is the first news report, if so, going to step (3), otherwise, going to step (4)

(3) Creating a new topic and adding text d to the new topic, turning to step (7)

(4) Preprocessing the text d and constructing a vector space model

(5) Calculating the similarity of the document d and each text of the existing topics, and recording the maximum similarity S _max And find the topic class T corresponding to the topic class T

(6) If the maximum similarity S _max Greater than a preset threshold T _c Clustering the document d into topic class T, otherwise turning to step (3)

(7) And (5) finishing the primary clustering.

The method for calculating the multi-similarity of the time, the place name and the source is integrated, and the total similarity of the two texts is obtained through giving values of different weight factors and weighting combination. Fig. 2 and fig. 3 show a false detection rate-false detection rate change curve of the method for clustering the invention and the Single-Pass clustering method and a comparison graph of various evaluation indexes of the method for clustering the invention and the Single-Pass clustering method, and it can be seen that the method has a reduced missing detection rate, false detection rate and consumption function value and a good detection effect.

It should be emphasized that the examples described herein are illustrative rather than limiting, and therefore the invention includes, but is not limited to, the examples described in the detailed description, as other embodiments derived from the technical solutions of the invention by a person skilled in the art are equally within the scope of the invention.

Claims

1. The online topic big data detection method for the network is characterized by comprising the following steps of:

step 1, online crawling of a big data network text;

step 2, extracting text features and expressions;

step 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics;

the step 3 of selecting a plurality of similarity factors includes: time factors, place factors, and source factors;

the time factor is:

in the formula, sim _time (d _i ,d _j ) Representing document d _i And d _j T= |t _i -t _j I, m is an automatically set time interval;

the location factor is:

in the formula, sim _place( pi, pj) is the similarity of two place names, deep (p) _i ∩p _j ) Is the place name p _i And place name p _j Public depth from root node on geographical tree, deep (p _i ) Is the place name p _i Depth from root node _pj ) Is the place name p _j Depth from root node; the source factors are as follows:

wherein PR is PR _(p) PR value of web site p, where d is damping coefficient, a is specific gravity coefficient for judging whether link-out site is off-site link, V ₁ For a set of sites where the out page and the p page of the website are not the same, C _i Representing the number of pages i that are all chained out of the page, V ₂ To chain out the set of pages belonging to the same site as page p, C _j Indicating the number of pages j that are all chained out.

2. The method for detecting online topic big data of a network according to claim 1, wherein: the implementation method of the step 1 is as follows: and (3) constructing a Hadoop distributed cluster, installing a CentOS and a distributed theme crawler on each machine, and constructing a big data acquisition platform.

3. The method for detecting online topic big data of a network according to claim 1, wherein: the specific implementation method of the step 2 comprises the following steps:

building a text representation model: the text is represented by a vector space model, and the expression of the mathematical model is as follows: VSM (d) =<(t ₁ ,ω ₁ )；(t ₂ ,ω ₂ )；(t ₃ ,ω ₃ )；…(t _n ,ω _n )>Wherein n represents the number of feature items, t _i (1.ltoreq.i.ltoreq.n) is a text feature term, ω _i (1.ltoreq.i.ltoreq.n) as a characteristic term t _i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.

4. The method for detecting online topic big data of a network according to claim 1, wherein: the damping coefficient d takes a value of 0.85, and the specific gravity coefficient a takes a value of 0.75.

5. The method for detecting online topic big data of a network according to claim 1, wherein: the method for topic detection in the step 3 comprises the following steps:

inputting a news document d;

preprocessing the text d and constructing a vector space model;

calculating document d and existing callThe similarity of each text of the questions, recording the maximum similarity S _max Finding out topic class T corresponding to the topic class T;

and finishing the clustering once.