CN108197259B - Online topic big data detection method for network - Google Patents
Online topic big data detection method for network Download PDFInfo
- Publication number
- CN108197259B CN108197259B CN201711489608.5A CN201711489608A CN108197259B CN 108197259 B CN108197259 B CN 108197259B CN 201711489608 A CN201711489608 A CN 201711489608A CN 108197259 B CN108197259 B CN 108197259B
- Authority
- CN
- China
- Prior art keywords
- text
- topic
- similarity
- big data
- factors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 21
- 230000014509 gene expression Effects 0.000 claims abstract description 7
- 238000000034 method Methods 0.000 claims description 26
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000013016 damping Methods 0.000 claims description 5
- 230000005484 gravity Effects 0.000 claims description 5
- 210000002464 muscle smooth vascular Anatomy 0.000 claims description 4
- 238000001418 vibrating-sample magnetometry Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 238000013178 mathematical model Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 5
- 230000000694 effects Effects 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a network online topic big data detection method, which is mainly technically characterized by comprising the following steps of: the large data network text is crawled online; extracting text characteristics and expressions; and selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics. The invention has reasonable design, and on the basis of the existing Single-Pass algorithm, the similarity of the text is obtained by adopting a multi-similarity calculation method and giving values of different weight factors and weighting combination through analyzing the characteristics of the text, so that the omission rate, the false detection rate and the consumption function value can be reduced, and the clustering effect is obviously improved.
Description
Technical Field
The invention belongs to the technical field of computer data modeling, and particularly relates to a network online topic big data detection method.
Background
Compared with the traditional information propagation channel, the network has greater openness and virtualization, and various viewpoints, negotiations and moods are continuously fermented and amplified through a network space with the arrival of the media age, so that network public opinion events are formed. Under the background of the construction of the network in China at present, the analysis of the network public opinion is highly concerned. In terms of research and development of internet public opinion, foreign countries develop in the middle of 19 th century, and research on internet public opinion in China is later, and the proposal of internet public opinion refers to the expression and propagation of different moods, attitudes and opinions by the public through the internet.
Topic discovery is a key ring of network public opinion analysis, and the current research method mainly focuses on selecting text clustering algorithms, such as a partition-based clustering algorithm, a hierarchical-based clustering algorithm, a density space-based clustering algorithm, a grid-based clustering algorithm and the like, wherein the most commonly used clustering algorithm is Single-Pass clustering. Because the Single-Pass clustering algorithm adopts a Single similarity calculation method, the structural characteristics of texts are not considered, the clustering accuracy is affected, and the omission ratio and the bit error rate are high.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a network online topic big data detection method, and solves the problems of higher omission ratio and higher error rate of the existing clustering algorithm.
The invention solves the technical problems by adopting the following technical scheme:
a network online topic big data detection method comprises the following steps:
step 1, online crawling of a big data network text;
step 2, extracting text features and expressions;
and 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics.
Further, the implementation method of the step 1 is as follows: and (3) constructing a Hadoop distributed cluster, installing a CentOS and a distributed theme crawler on each machine, and constructing a big data acquisition platform.
Further, the specific implementation method of the step 2 includes the following steps:
preprocessing a text: selecting a text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of feature words as feature items, and distributing weights to the selected feature items by adopting a classical weight estimation method TF-IDF;
building a text representation model: the text is represented by a vector space model, and the expression of the mathematical model is as follows:
VSM(d)=<(t 1 ,ω 1 );(t 2 ,ω 2 );(t 3 ,ω 3 );…(t n ,ω n )>
wherein n represents the number of feature items, t i (1.ltoreq.i.ltoreq.n) is a text feature term, ω i (1.ltoreq.i.ltoreq.n) as a characteristic term t i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.
Further, the selecting a plurality of similarity factors in the step 3 includes: time factor, place factor, and source factor.
Further, the time factor is:
in the formula, sim time (d i ,d j ) Representing document d i And d j T= |t i -t j And m is an automatically set time interval.
Further, the location factor is:
in the formula, sim place (p i ,p j ) For the similarity of two place names, deep (p i ∩p j ) Is the place name p i And place name p j Public depth from root node on geographical tree, deep (p i ) Is the place name p i Depth from root node, deep (p j ) Is the place name p j Depth from the root node.
Further, the source factors are:
wherein PR is PR (p) PR value of website p, wherein d is damping coefficient, a is specific gravity coefficient for judging whether the linked-out website is out-of-site link, V1 is set of websites with linked-out pages and website p pages not being the same website, C i Representing the number of all out pages of page i, V2 is the set of out pages belonging to the same site as page p, C j Indicating the number of pages j that are all chained out.
Further, the damping coefficient d is 0.85, and the specific gravity coefficient a is 0.75.
Further, the method for topic detection in the step 3 includes the following steps:
inputting a news document d;
judging whether the d is a first news report or not, if so, turning to a step, otherwise, turning to the step;
creating a new topic and adding the text d to the new topic, and turning to the step;
preprocessing the text d and constructing a vector space model;
fifthly, calculating the similarity between the document d and each text of the existing topics, and recording the maximum similarity S max Finding out topic class T corresponding to the topic class T;
sixth, if the similarity S is maximum max Greater than a preset threshold T c Clustering the document d into topic class T, otherwise turning to the step;
and finishing the clustering once.
The invention has the advantages and positive effects that:
on the basis of the existing Single-Pass algorithm, the method adopts a multi-similarity calculation method by analyzing the characteristics of the text, obtains the similarity of the text by giving values of different weight factors and weighting combination, can reduce the omission ratio, the false detection ratio and the consumed function value, and obviously improves the clustering effect.
Drawings
FIG. 1 is a flow chart of the detection of the present invention;
FIG. 2 is a plot of false positive rate versus false positive rate for the present invention and various detection methods;
FIG. 3 is a graph comparing various evaluation indexes of the invention with various detection methods.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
A network online topic big data detection method, as shown in figure 1, comprises the following steps:
and step 1, online crawling of the large data network text.
In the step, a Hadoop distributed cluster is built, 64-bit CentOS and distributed theme crawlers are installed on each machine, and a big data acquisition platform is built.
And 2, extracting text characteristics and expressions.
In the step, after feature selection and weighting processing are carried out on the segmented data, a vector space model is constructed to convert the text into computable structured data. The specific method comprises the following steps:
(1) Text preprocessing: and selecting the characteristic items of the text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of characteristic words as the characteristic items, and distributing weights to the selected characteristic items by adopting a classical weight estimation method TF-IDF.
(2) Constructing a text representation model: the text is represented using a Vector Space Model (VSM), whose mathematical model is expressed as follows:
VSM(d)=<(t 1 ,ω 1 );(t 2 ,ω 2 );(t 3 ,ω 3 );…(t n ,ω n )>
wherein n represents the number of feature items, t i (1.ltoreq.i.ltoreq.n) is a text feature term, ω i (1.ltoreq.i.ltoreq.n) as a characteristic term t i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.
And 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics.
The invention uses a Single-Pass clustering algorithm to select time, place and source as similarity factors, wherein:
(1) Time factor: introducing a concept of temporal distance
Sim time (d i ,d j ) Representing document d i And d j Wherein t= |t i -t j I, i.e. the difference between the times of the two documents, m is the automatically set time interval.
(2) Location factor: a geographical tree using China as a root node is constructed by using the data of the national place name and place object dictionary provided by the data hall, and each place name is expressed as a node in the tree by using the subordinate relation among place names.
Calculating the similarity of two geographical trees requires taking into account that each child node in the geographical tree is a branch of the parent node, as well as the distance between the two child nodes, the common depth of the two child nodes, and the depth of each node from the root node. The influence of the three factors on the similarity of the two place names is comprehensively considered, and a calculation formula is defined as follows:
Sim place (p i ,p j ) Is the similarity of two place names, where deep (p i ∩p j ) Is the place name p i And place name p j Public depth from root node on geographical tree, deep (p i ) Is the place name p i Depth from root node, deep (p j ) Is the place name p j Depth from the root node.
(3) Source factor: the PR value of the page is calculated by adopting the improved PageRank algorithm, and the calculation formula is as follows:
PR (p) PR value of web site p, where d is damping coefficient, usually 0.85, a is specific gravity coefficient for judging whether the linked-out web site is an off-site link, and relative to the on-site page, the off-site page can reflect importance of the site to which the page belongs, and has a value of 0.75, V1 is a set of sites where the linked-out page and the p page are not the same, C i Representing the number of all out pages of page i, V2 is the set of out pages belonging to the same site as page p, C j Indicating the number of pages j that are all chained out.
In the step, after a time factor, a place factor and a source factor are selected as multiple similarity factors, online topic detection is performed. The input is as follows: document set for news stories, similarity threshold T c The method comprises the steps of carrying out a first treatment on the surface of the The output is a plurality of topic classes. The specific flow is as follows:
(1) Inputting a news document d
(2) Judging whether d is the first news report, if so, going to step (3), otherwise, going to step (4)
(3) Creating a new topic and adding text d to the new topic, turning to step (7)
(4) Preprocessing the text d and constructing a vector space model
(5) Calculating the similarity of the document d and each text of the existing topics, and recording the maximum similarity S max And find the topic class T corresponding to the topic class T
(6) If the maximum similarity S max Greater than a preset threshold T c Clustering the document d into topic class T, otherwise turning to step (3)
(7) And (5) finishing the primary clustering.
The method for calculating the multi-similarity of the time, the place name and the source is integrated, and the total similarity of the two texts is obtained through giving values of different weight factors and weighting combination. Fig. 2 and fig. 3 show a false detection rate-false detection rate change curve of the method for clustering the invention and the Single-Pass clustering method and a comparison graph of various evaluation indexes of the method for clustering the invention and the Single-Pass clustering method, and it can be seen that the method has a reduced missing detection rate, false detection rate and consumption function value and a good detection effect.
It should be emphasized that the examples described herein are illustrative rather than limiting, and therefore the invention includes, but is not limited to, the examples described in the detailed description, as other embodiments derived from the technical solutions of the invention by a person skilled in the art are equally within the scope of the invention.
Claims (5)
1. The online topic big data detection method for the network is characterized by comprising the following steps of:
step 1, online crawling of a big data network text;
step 2, extracting text features and expressions;
step 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics;
the step 3 of selecting a plurality of similarity factors includes: time factors, place factors, and source factors;
the time factor is:
in the formula, sim time (d i ,d j ) Representing document d i And d j T= |t i -t j I, m is an automatically set time interval;
the location factor is:
in the formula, sim place( pi, pj) is the similarity of two place names, deep (p) i ∩p j ) Is the place name p i And place name p j Public depth from root node on geographical tree, deep (p i ) Is the place name p i Depth from root node pj ) Is the place name p j Depth from root node; the source factors are as follows:
wherein PR is PR (p) PR value of web site p, where d is damping coefficient, a is specific gravity coefficient for judging whether link-out site is off-site link, V 1 For a set of sites where the out page and the p page of the website are not the same, C i Representing the number of pages i that are all chained out of the page, V 2 To chain out the set of pages belonging to the same site as page p, C j Indicating the number of pages j that are all chained out.
2. The method for detecting online topic big data of a network according to claim 1, wherein: the implementation method of the step 1 is as follows: and (3) constructing a Hadoop distributed cluster, installing a CentOS and a distributed theme crawler on each machine, and constructing a big data acquisition platform.
3. The method for detecting online topic big data of a network according to claim 1, wherein: the specific implementation method of the step 2 comprises the following steps:
preprocessing a text: selecting a text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of feature words as feature items, and distributing weights to the selected feature items by adopting a classical weight estimation method TF-IDF;
building a text representation model: the text is represented by a vector space model, and the expression of the mathematical model is as follows: VSM (d) =<(t 1 ,ω 1 );(t 2 ,ω 2 );(t 3 ,ω 3 );…(t n ,ω n )>Wherein n represents the number of feature items, t i (1.ltoreq.i.ltoreq.n) is a text feature term, ω i (1.ltoreq.i.ltoreq.n) as a characteristic term t i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.
4. The method for detecting online topic big data of a network according to claim 1, wherein: the damping coefficient d takes a value of 0.85, and the specific gravity coefficient a takes a value of 0.75.
5. The method for detecting online topic big data of a network according to claim 1, wherein: the method for topic detection in the step 3 comprises the following steps:
inputting a news document d;
judging whether the d is a first news report or not, if so, turning to a step, otherwise, turning to the step;
creating a new topic and adding the text d to the new topic, and turning to the step;
preprocessing the text d and constructing a vector space model;
calculating document d and existing callThe similarity of each text of the questions, recording the maximum similarity S max Finding out topic class T corresponding to the topic class T;
sixth, if the similarity S is maximum max Greater than a preset threshold T c Clustering the document d into topic class T, otherwise turning to the step;
and finishing the clustering once.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711489608.5A CN108197259B (en) | 2017-12-30 | 2017-12-30 | Online topic big data detection method for network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711489608.5A CN108197259B (en) | 2017-12-30 | 2017-12-30 | Online topic big data detection method for network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108197259A CN108197259A (en) | 2018-06-22 |
CN108197259B true CN108197259B (en) | 2024-03-05 |
Family
ID=62587443
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711489608.5A Active CN108197259B (en) | 2017-12-30 | 2017-12-30 | Online topic big data detection method for network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108197259B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111982393B (en) * | 2020-08-27 | 2021-11-19 | 天津科技大学 | Real-time monitoring vacuum instrument |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN105488092A (en) * | 2015-07-13 | 2016-04-13 | 中国科学院信息工程研究所 | Time-sensitive self-adaptive on-line subtopic detecting method and system |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
-
2017
- 2017-12-30 CN CN201711489608.5A patent/CN108197259B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN105488092A (en) * | 2015-07-13 | 2016-04-13 | 中国科学院信息工程研究所 | Time-sensitive self-adaptive on-line subtopic detecting method and system |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
Also Published As
Publication number | Publication date |
---|---|
CN108197259A (en) | 2018-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111159395B (en) | Chart neural network-based rumor standpoint detection method and device and electronic equipment | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
CN105653518A (en) | Specific group discovery and expansion method based on microblog data | |
CN103544242A (en) | Microblog-oriented emotion entity searching system | |
Liu et al. | A fast method based on multiple clustering for name disambiguation in bibliographic citations | |
CN111767725B (en) | Data processing method and device based on emotion polarity analysis model | |
CN103455562A (en) | Text orientation analysis method and product review orientation discriminator on basis of same | |
CN111899089A (en) | Enterprise risk early warning method and system based on knowledge graph | |
CN110134788B (en) | Microblog release optimization method and system based on text mining | |
CN107577665B (en) | Text emotional tendency judging method | |
CN113422761B (en) | Malicious social user detection method based on counterstudy | |
CN106503256B (en) | A kind of hot information method for digging based on social networks document | |
CN107832467A (en) | A kind of microblog topic detecting method based on improved Single pass clustering algorithms | |
CN110990718A (en) | Social network model building module of company image improving system | |
CN109992784A (en) | A kind of heterogeneous network building and distance metric method for merging multi-modal information | |
CN108197259B (en) | Online topic big data detection method for network | |
Wang et al. | An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning | |
CN103699568B (en) | A kind of from Wiki, extract the method for hyponymy between field term | |
CN110929509B (en) | Domain event trigger word clustering method based on louvain community discovery algorithm | |
Alfarra et al. | Graph-based technique for extracting keyphrases in a single-document (gtek) | |
Brochier et al. | New datasets and a benchmark of document network embedding methods for scientific expert finding | |
Wen et al. | OLMPT: research on online log parsing method based on prefix tree | |
Yang et al. | A topic-specific web crawler with concept similarity context graph based on FCA | |
Zhao et al. | Identifying and analyzing popular phrases multi-dimensionally in social media data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |