CN108197259B - Online topic big data detection method for network - Google Patents

Online topic big data detection method for network Download PDF

Info

Publication number
CN108197259B
CN108197259B CN201711489608.5A CN201711489608A CN108197259B CN 108197259 B CN108197259 B CN 108197259B CN 201711489608 A CN201711489608 A CN 201711489608A CN 108197259 B CN108197259 B CN 108197259B
Authority
CN
China
Prior art keywords
text
topic
similarity
big data
factors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711489608.5A
Other languages
Chinese (zh)
Other versions
CN108197259A (en
Inventor
马永军
柴梦瑶
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Science and Technology
Original Assignee
Tianjin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Science and Technology filed Critical Tianjin University of Science and Technology
Priority to CN201711489608.5A priority Critical patent/CN108197259B/en
Publication of CN108197259A publication Critical patent/CN108197259A/en
Application granted granted Critical
Publication of CN108197259B publication Critical patent/CN108197259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a network online topic big data detection method, which is mainly technically characterized by comprising the following steps of: the large data network text is crawled online; extracting text characteristics and expressions; and selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics. The invention has reasonable design, and on the basis of the existing Single-Pass algorithm, the similarity of the text is obtained by adopting a multi-similarity calculation method and giving values of different weight factors and weighting combination through analyzing the characteristics of the text, so that the omission rate, the false detection rate and the consumption function value can be reduced, and the clustering effect is obviously improved.

Description

Online topic big data detection method for network
Technical Field
The invention belongs to the technical field of computer data modeling, and particularly relates to a network online topic big data detection method.
Background
Compared with the traditional information propagation channel, the network has greater openness and virtualization, and various viewpoints, negotiations and moods are continuously fermented and amplified through a network space with the arrival of the media age, so that network public opinion events are formed. Under the background of the construction of the network in China at present, the analysis of the network public opinion is highly concerned. In terms of research and development of internet public opinion, foreign countries develop in the middle of 19 th century, and research on internet public opinion in China is later, and the proposal of internet public opinion refers to the expression and propagation of different moods, attitudes and opinions by the public through the internet.
Topic discovery is a key ring of network public opinion analysis, and the current research method mainly focuses on selecting text clustering algorithms, such as a partition-based clustering algorithm, a hierarchical-based clustering algorithm, a density space-based clustering algorithm, a grid-based clustering algorithm and the like, wherein the most commonly used clustering algorithm is Single-Pass clustering. Because the Single-Pass clustering algorithm adopts a Single similarity calculation method, the structural characteristics of texts are not considered, the clustering accuracy is affected, and the omission ratio and the bit error rate are high.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a network online topic big data detection method, and solves the problems of higher omission ratio and higher error rate of the existing clustering algorithm.
The invention solves the technical problems by adopting the following technical scheme:
a network online topic big data detection method comprises the following steps:
step 1, online crawling of a big data network text;
step 2, extracting text features and expressions;
and 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics.
Further, the implementation method of the step 1 is as follows: and (3) constructing a Hadoop distributed cluster, installing a CentOS and a distributed theme crawler on each machine, and constructing a big data acquisition platform.
Further, the specific implementation method of the step 2 includes the following steps:
preprocessing a text: selecting a text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of feature words as feature items, and distributing weights to the selected feature items by adopting a classical weight estimation method TF-IDF;
building a text representation model: the text is represented by a vector space model, and the expression of the mathematical model is as follows:
VSM(d)=<(t 11 );(t 22 );(t 33 );…(t nn )>
wherein n represents the number of feature items, t i (1.ltoreq.i.ltoreq.n) is a text feature term, ω i (1.ltoreq.i.ltoreq.n) as a characteristic term t i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.
Further, the selecting a plurality of similarity factors in the step 3 includes: time factor, place factor, and source factor.
Further, the time factor is:
in the formula, sim time (d i ,d j ) Representing document d i And d j T= |t i -t j And m is an automatically set time interval.
Further, the location factor is:
in the formula, sim place (p i ,p j ) For the similarity of two place names, deep (p i ∩p j ) Is the place name p i And place name p j Public depth from root node on geographical tree, deep (p i ) Is the place name p i Depth from root node, deep (p j ) Is the place name p j Depth from the root node.
Further, the source factors are:
wherein PR is PR (p) PR value of website p, wherein d is damping coefficient, a is specific gravity coefficient for judging whether the linked-out website is out-of-site link, V1 is set of websites with linked-out pages and website p pages not being the same website, C i Representing the number of all out pages of page i, V2 is the set of out pages belonging to the same site as page p, C j Indicating the number of pages j that are all chained out.
Further, the damping coefficient d is 0.85, and the specific gravity coefficient a is 0.75.
Further, the method for topic detection in the step 3 includes the following steps:
inputting a news document d;
judging whether the d is a first news report or not, if so, turning to a step, otherwise, turning to the step;
creating a new topic and adding the text d to the new topic, and turning to the step;
preprocessing the text d and constructing a vector space model;
fifthly, calculating the similarity between the document d and each text of the existing topics, and recording the maximum similarity S max Finding out topic class T corresponding to the topic class T;
sixth, if the similarity S is maximum max Greater than a preset threshold T c Clustering the document d into topic class T, otherwise turning to the step;
and finishing the clustering once.
The invention has the advantages and positive effects that:
on the basis of the existing Single-Pass algorithm, the method adopts a multi-similarity calculation method by analyzing the characteristics of the text, obtains the similarity of the text by giving values of different weight factors and weighting combination, can reduce the omission ratio, the false detection ratio and the consumed function value, and obviously improves the clustering effect.
Drawings
FIG. 1 is a flow chart of the detection of the present invention;
FIG. 2 is a plot of false positive rate versus false positive rate for the present invention and various detection methods;
FIG. 3 is a graph comparing various evaluation indexes of the invention with various detection methods.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
A network online topic big data detection method, as shown in figure 1, comprises the following steps:
and step 1, online crawling of the large data network text.
In the step, a Hadoop distributed cluster is built, 64-bit CentOS and distributed theme crawlers are installed on each machine, and a big data acquisition platform is built.
And 2, extracting text characteristics and expressions.
In the step, after feature selection and weighting processing are carried out on the segmented data, a vector space model is constructed to convert the text into computable structured data. The specific method comprises the following steps:
(1) Text preprocessing: and selecting the characteristic items of the text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of characteristic words as the characteristic items, and distributing weights to the selected characteristic items by adopting a classical weight estimation method TF-IDF.
(2) Constructing a text representation model: the text is represented using a Vector Space Model (VSM), whose mathematical model is expressed as follows:
VSM(d)=<(t 11 );(t 22 );(t 33 );…(t nn )>
wherein n represents the number of feature items, t i (1.ltoreq.i.ltoreq.n) is a text feature term, ω i (1.ltoreq.i.ltoreq.n) as a characteristic term t i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.
And 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics.
The invention uses a Single-Pass clustering algorithm to select time, place and source as similarity factors, wherein:
(1) Time factor: introducing a concept of temporal distance
Sim time (d i ,d j ) Representing document d i And d j Wherein t= |t i -t j I, i.e. the difference between the times of the two documents, m is the automatically set time interval.
(2) Location factor: a geographical tree using China as a root node is constructed by using the data of the national place name and place object dictionary provided by the data hall, and each place name is expressed as a node in the tree by using the subordinate relation among place names.
Calculating the similarity of two geographical trees requires taking into account that each child node in the geographical tree is a branch of the parent node, as well as the distance between the two child nodes, the common depth of the two child nodes, and the depth of each node from the root node. The influence of the three factors on the similarity of the two place names is comprehensively considered, and a calculation formula is defined as follows:
Sim place (p i ,p j ) Is the similarity of two place names, where deep (p i ∩p j ) Is the place name p i And place name p j Public depth from root node on geographical tree, deep (p i ) Is the place name p i Depth from root node, deep (p j ) Is the place name p j Depth from the root node.
(3) Source factor: the PR value of the page is calculated by adopting the improved PageRank algorithm, and the calculation formula is as follows:
PR (p) PR value of web site p, where d is damping coefficient, usually 0.85, a is specific gravity coefficient for judging whether the linked-out web site is an off-site link, and relative to the on-site page, the off-site page can reflect importance of the site to which the page belongs, and has a value of 0.75, V1 is a set of sites where the linked-out page and the p page are not the same, C i Representing the number of all out pages of page i, V2 is the set of out pages belonging to the same site as page p, C j Indicating the number of pages j that are all chained out.
In the step, after a time factor, a place factor and a source factor are selected as multiple similarity factors, online topic detection is performed. The input is as follows: document set for news stories, similarity threshold T c The method comprises the steps of carrying out a first treatment on the surface of the The output is a plurality of topic classes. The specific flow is as follows:
(1) Inputting a news document d
(2) Judging whether d is the first news report, if so, going to step (3), otherwise, going to step (4)
(3) Creating a new topic and adding text d to the new topic, turning to step (7)
(4) Preprocessing the text d and constructing a vector space model
(5) Calculating the similarity of the document d and each text of the existing topics, and recording the maximum similarity S max And find the topic class T corresponding to the topic class T
(6) If the maximum similarity S max Greater than a preset threshold T c Clustering the document d into topic class T, otherwise turning to step (3)
(7) And (5) finishing the primary clustering.
The method for calculating the multi-similarity of the time, the place name and the source is integrated, and the total similarity of the two texts is obtained through giving values of different weight factors and weighting combination. Fig. 2 and fig. 3 show a false detection rate-false detection rate change curve of the method for clustering the invention and the Single-Pass clustering method and a comparison graph of various evaluation indexes of the method for clustering the invention and the Single-Pass clustering method, and it can be seen that the method has a reduced missing detection rate, false detection rate and consumption function value and a good detection effect.
It should be emphasized that the examples described herein are illustrative rather than limiting, and therefore the invention includes, but is not limited to, the examples described in the detailed description, as other embodiments derived from the technical solutions of the invention by a person skilled in the art are equally within the scope of the invention.

Claims (5)

1. The online topic big data detection method for the network is characterized by comprising the following steps of:
step 1, online crawling of a big data network text;
step 2, extracting text features and expressions;
step 3, selecting a plurality of similarity factors by using a Single-Pass clustering algorithm to detect topics;
the step 3 of selecting a plurality of similarity factors includes: time factors, place factors, and source factors;
the time factor is:
in the formula, sim time (d i ,d j ) Representing document d i And d j T= |t i -t j I, m is an automatically set time interval;
the location factor is:
in the formula, sim place( pi, pj) is the similarity of two place names, deep (p) i ∩p j ) Is the place name p i And place name p j Public depth from root node on geographical tree, deep (p i ) Is the place name p i Depth from root node pj ) Is the place name p j Depth from root node; the source factors are as follows:
wherein PR is PR (p) PR value of web site p, where d is damping coefficient, a is specific gravity coefficient for judging whether link-out site is off-site link, V 1 For a set of sites where the out page and the p page of the website are not the same, C i Representing the number of pages i that are all chained out of the page, V 2 To chain out the set of pages belonging to the same site as page p, C j Indicating the number of pages j that are all chained out.
2. The method for detecting online topic big data of a network according to claim 1, wherein: the implementation method of the step 1 is as follows: and (3) constructing a Hadoop distributed cluster, installing a CentOS and a distributed theme crawler on each machine, and constructing a big data acquisition platform.
3. The method for detecting online topic big data of a network according to claim 1, wherein: the specific implementation method of the step 2 comprises the following steps:
preprocessing a text: selecting a text by adopting a text frequency method, selecting words with stronger text characterization capability from a candidate set of feature words as feature items, and distributing weights to the selected feature items by adopting a classical weight estimation method TF-IDF;
building a text representation model: the text is represented by a vector space model, and the expression of the mathematical model is as follows: VSM (d) =<(t 11 );(t 22 );(t 33 );…(t nn )>Wherein n represents the number of feature items, t i (1.ltoreq.i.ltoreq.n) is a text feature term, ω i (1.ltoreq.i.ltoreq.n) as a characteristic term t i The text is represented as a vector in an n-dimensional space by creating VSMs for the corresponding weights.
4. The method for detecting online topic big data of a network according to claim 1, wherein: the damping coefficient d takes a value of 0.85, and the specific gravity coefficient a takes a value of 0.75.
5. The method for detecting online topic big data of a network according to claim 1, wherein: the method for topic detection in the step 3 comprises the following steps:
inputting a news document d;
judging whether the d is a first news report or not, if so, turning to a step, otherwise, turning to the step;
creating a new topic and adding the text d to the new topic, and turning to the step;
preprocessing the text d and constructing a vector space model;
calculating document d and existing callThe similarity of each text of the questions, recording the maximum similarity S max Finding out topic class T corresponding to the topic class T;
sixth, if the similarity S is maximum max Greater than a preset threshold T c Clustering the document d into topic class T, otherwise turning to the step;
and finishing the clustering once.
CN201711489608.5A 2017-12-30 2017-12-30 Online topic big data detection method for network Active CN108197259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711489608.5A CN108197259B (en) 2017-12-30 2017-12-30 Online topic big data detection method for network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711489608.5A CN108197259B (en) 2017-12-30 2017-12-30 Online topic big data detection method for network

Publications (2)

Publication Number Publication Date
CN108197259A CN108197259A (en) 2018-06-22
CN108197259B true CN108197259B (en) 2024-03-05

Family

ID=62587443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711489608.5A Active CN108197259B (en) 2017-12-30 2017-12-30 Online topic big data detection method for network

Country Status (1)

Country Link
CN (1) CN108197259B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111982393B (en) * 2020-08-27 2021-11-19 天津科技大学 Real-time monitoring vacuum instrument

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN105488092A (en) * 2015-07-13 2016-04-13 中国科学院信息工程研究所 Time-sensitive self-adaptive on-line subtopic detecting method and system
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN105488092A (en) * 2015-07-13 2016-04-13 中国科学院信息工程研究所 Time-sensitive self-adaptive on-line subtopic detecting method and system
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method

Also Published As

Publication number Publication date
CN108197259A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN111159395B (en) Chart neural network-based rumor standpoint detection method and device and electronic equipment
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN105653518A (en) Specific group discovery and expansion method based on microblog data
CN103544242A (en) Microblog-oriented emotion entity searching system
Liu et al. A fast method based on multiple clustering for name disambiguation in bibliographic citations
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN110134788B (en) Microblog release optimization method and system based on text mining
CN107577665B (en) Text emotional tendency judging method
CN113422761B (en) Malicious social user detection method based on counterstudy
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN110990718A (en) Social network model building module of company image improving system
CN109992784A (en) A kind of heterogeneous network building and distance metric method for merging multi-modal information
CN108197259B (en) Online topic big data detection method for network
Wang et al. An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning
CN103699568B (en) A kind of from Wiki, extract the method for hyponymy between field term
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
Alfarra et al. Graph-based technique for extracting keyphrases in a single-document (gtek)
Brochier et al. New datasets and a benchmark of document network embedding methods for scientific expert finding
Wen et al. OLMPT: research on online log parsing method based on prefix tree
Yang et al. A topic-specific web crawler with concept similarity context graph based on FCA
Zhao et al. Identifying and analyzing popular phrases multi-dimensionally in social media data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant