CN112241492B

CN112241492B - Early identification method for multi-source heterogeneous online network topics

Info

Publication number: CN112241492B
Application number: CN202011141881.0A
Authority: CN
Inventors: 徐小艳; 周帅鹏; 张贝贝; 吕伟
Original assignee: Xian Shiyou University
Current assignee: Xian Shiyou University
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2023-04-07
Anticipated expiration: 2040-10-22
Also published as: CN112241492A

Abstract

The invention discloses a multisource heterogeneous online network topic early identification method, which comprises the following steps: 1) Obtaining short text keyword set D ₀ (ii) a 2) Constructing complex networks based on keyword coincidence

3) For the complex network constructed in the step 2)

Community structure division is carried out by utilizing dynamic community division method, and time interval t ₀ ,t _end ]Dividing the social network by taking the time progressive increment delta t as an interval, and constructing t through newly added short text information of various different source online social networks crawled in the time progressive increment delta t ₀ Complex network at time + Δ t

Then t is ₀ Complex network at time + Δ t

Community division is carried out by utilizing dynamic community division method to realize complex network

Dividing the community; 4) Statistical complex networks

Finally found topic keyword sets are constructed according to the community division results, and the method can be used for solving the problems of multiple online social networksAnd carrying out early topic discovery and extraction on the short text information data crawled by the platform.

Description

Early identification method for multi-source heterogeneous online network topics

Technical Field

The invention belongs to the research field of online network topic early identification methods, and relates to a multisource heterogeneous online network topic early identification method.

Background

On one hand, with the high-speed and deep development of the internet, particularly the mobile internet, the internet breaks the space-time limitation of the traditional information interaction circulation, subverts the traditional information propagation mode, and changes the role of an internet user in the information propagation and diffusion process from an information consumer to an information diffuser or even an information producer; the phenomenon that information is spread mutually is gradually started to appear and formed between different online social network system main bodies. The production, the transmission and the interaction of information among the multi-source heterogeneous online networks are more and more complex, so that the early discovery of topics is more difficult. And at present, more topic discovery methods are mainly used for researching the discovery and propagation rules of hot topics, and a great research space is provided for the early topic discovery method.

On the other hand, network information sources and propagation channels are increased rapidly, the scale and the influence of network public opinion are getting bigger and bigger, how to determine early topics in a heterogeneous online network is convenient for governments and supervision departments to perform timely and effective supervision and prevention.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an early identification method of multi-source heterogeneous online network topics, which can be used for early discovering and extracting topics from short text information data crawled from a plurality of online social network platforms.

In order to achieve the purpose, the method for early identifying the multi-source heterogeneous online network topics comprises the following steps:

1) Analyzing the characteristics of different online social network structures, designing a distributed parallel crawler engine aiming at the characteristics of the different online social network structures, crawling original short text information disclosed by the online social network by using the distributed parallel crawler engine, and then performing Chinese word segmentation and text characteristic valueThe extraction method carries out text preprocessing on the original short text information disclosed by the online social network to obtain a short text keyword set D ₀ ；

2) At an initial time t ₀ Using short text keyword sets D ₀ Complex network based on keyword coincidence is constructed according to behavior relation between network users represented by online social network text information and users

3) For the complex network constructed in the step 2)

Community structure division is carried out by utilizing dynamic community division method, and time interval t ₀ ,t _end ]Dividing the social network by taking the time progressive increment delta t as an interval, and constructing t through newly added short text information of various different source online social networks crawled in the time progressive increment delta t ₀ Complex network at time + Δ t>

Then t is ₀ Complex network at time + Δ t>

Community division is carried out by utilizing a dynamic community division method to realize the judgment of a complex network>

Dividing the community;

4) Statistical complex networks

The total number of the participating users of the short texts represented by all the nodes of each community in the community division result is then judged according to the complex network->

Total number of short text participated users represented by all nodes of community in community division resultSorting to obtain the top N communities;

5) And 4) counting keyword sets corresponding to the short texts in the first N communities obtained in the step 4), sequencing TF-IDF in the counted keyword sets, and constructing a finally found topic keyword set by using the first N keywords in the sequencing result.

In the step 1), original short text information disclosed by the crawled online social network comprises news titles of news websites and microblogs of microblog platforms, and a short text keyword set is constructed according to the crawled original short text information by a method of Chinese word segmentation and text characteristic value extraction TF-IDF.

Short text as a complex network

The edges between the nodes represent the association relation between the short texts.

Complex network

Where i, j denotes the time t ₀ Previously crawled microblog information and news headlines, C _i A set of keywords representing short text i; n is a radical of _ij Representing a short text keyword set C _i And C _j Is determined by the number of coincidences of the keyword(s), is greater than or equal to>

V _i Network node represented by short text message i, E _ij For the association between short text i and short text j, N _ij =0 denotes no continuous edge between short texts i and j, N _ij 0 indicates that there is an edge between the short texts i and j, and edge E _ij Is weighted by N _ij 。

Step 3) adopting a static community discovery method to the complex network

And carrying out community division.

Adding newly-added short text and connection information in time incremental increment delta t into complexNetwork

In order to form a new complex network &>

Adding the new short text and the connection information in the time increment delta t according to the complex network

Relationships in middle communities fall into two categories, where the first category is based on and/or associated with a complex network>

Newly added text node set with medium relationship close to each other>

The second type is associated with a complex network->

Newly added text node set with loose middle community relation>

Determining a newly added text node set based on the modularity gain index delta Q>

And complex network>

The membership of the middle community, and the newly added text node set is/are judged by using a static community division method>

Carrying out community division, determining a newly added community, and realizing the combination of a complex network>

Dynamic community partitioning.

The invention has the following beneficial effects:

when the method for early identifying the topics of the multi-source heterogeneous online network is specifically operated, the distributed parallel crawler engine is used for crawling the original short text information disclosed by the online social network, and a short text keyword set D is constructed according to the original short text information ₀ Reuse of short text keyword sets D ₀ Constructing complex networks based on keyword superposition

Then to the complex network->

The method comprises the steps of utilizing a dynamic community division method to divide community structures, and constructing t through newly added short text information of various source online social networks obtained by crawling in time incremental increment delta t ₀ Complex network at time + Δ t>

At the same time for t ₀ Complex network at time + Δ t>

The community division based on the time-varying dynamic network is realized, and finally, the complex network is utilized>

And extracting topic keyword set from the final community division result, and realizing effective and objective discovery of the multi-source online network topic.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flowchart of a first embodiment.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings:

referring to fig. 1, the method for early identifying a multi-source heterogeneous online network topic, provided by the invention, comprises the following steps:

1) Analyzing the characteristics of different online social network structures, designing a distributed parallel crawler engine aiming at the characteristics of the different online social network structures, crawling original short text information disclosed by the online social network by using the distributed parallel crawler engine, and performing text preprocessing on the original short text information disclosed by the online social network by using a Chinese word segmentation and text characteristic value extraction method to obtain a short text keyword set D ₀ ；

The method comprises the steps that original short text information disclosed by a crawled online social network comprises news titles of news websites and microblogs of microblog platforms, and a short text keyword set is constructed according to the crawled original short text information through Chinese word segmentation and text characteristic value extraction TF-IDF.

Wherein the short text is used as a complex network

The edges between the nodes represent the association relation between the short texts. Complex network>

Where i, j denotes the time t ₀ Previously crawled microblog information and news headlines, C _i A set of keywords representing short text i; n is a radical of _ij Representing short text keyword set C _i And C _j Is determined by the number of coincidences of the keyword(s), is greater than or equal to>

V _i Network node represented by short text message i, E _ij For the association between short text i and short text j, N _ij =0 tableShowing no continuous edge between short texts i and j, N _ij > 0 indicates that there is an edge between short texts i and j, and edge E _ij Is weighted by N _ij 。

3) For the complex network constructed in the step 2)

Then t is ₀ Complex network at time + Δ t>

Community division is carried out by utilizing a dynamic community division method to realize the purpose of combining a complex network>

Dividing the community;

wherein, a static community discovery method is adopted for a complex network

And carrying out community division.

Adding the newly added short text and connection information in the time incremental delta t to the complex network

To form a new complex network->

Newly-added short text and connection information in the time progressive increment delta t are based on the complex network->

Newly added text node set with close relation

The second category is with complex networks &>

Newly added text node set with middle community relation loose->

And complex network>

Carrying out community division, determining a newly added community and realizing the judgment of a complex network>

Dynamic community partitioning.

The specific calculation process of the modularity gain index delta Q is as follows:

newly added text node set

Each node i in the network is divided into communities of adjacent nodes j, and the complex network at the moment is calculated>

Traversing all nodes i and j, extracting the maximum modularity gain index max delta Q, and outputting the corresponding i _max And j _max And finally determining a complex network &>

The community structure of (1).

4) Statistical complex networks

Sequencing the total number of short text participating users represented by all the nodes of the communities in the community division result to obtain the top N communities;

Example one

Referring to fig. 2, the specific operation process of this embodiment is:

The method comprises the steps that original short text information disclosed by the online social network comprises news titles of news websites and microblogs of microblog platforms, and a short text keyword set is constructed according to the original short text information disclosed by the online social network through Chinese word segmentation and text characteristic value extraction TF-IDF.

2) At the beginningMoment t ₀ Using short text keyword sets D ₀ Complex network based on keyword coincidence is constructed according to behavior relation between network users represented by online social network text information and users

Wherein the short text is used as a complex network

3) For the complex network constructed in the step 2)

Then t is ₀ Complex network at time + Δ t>

The community division of (2);

wherein, a static community discovery method is adopted for a complex network

And carrying out community division.

Adding newly-added short text and connection information in time incremental increment delta t into complex network

To form a new complex network->

The relationship of the middle community is divided into two categories, wherein the first category is based on the complex network->

Newly added text node set with medium relationship close to each other>

The second category is with complex networks &>

Newly added text node set with loose middle community relation>

Determining from the modularity gain index Δ QNewly added text node set>

And complex network>

The membership of the middle community, and the newly added text node set and the method of dividing the static community are utilized to combine and combine the nodes>

Dynamic community partitioning.

newly added text node set

Traversing all nodes i and j, extracting the maximum modularity gain index max delta Q, and outputting corresponding i _max And j _max And finally determining a complex network &>

The community structure of (1).

4) Counting the complex network in step 3)

The total number of the participating users of the short texts represented by all the nodes of each community in the final community division result is output to a complex network ^ and ^>

Sequencing the first 1 communities according to the total number of short text participating users in the communities in the final community division result; c1:391238

5) Counting keyword sets corresponding to short texts in all communities in the first 1 communities, and taking out keywords in the first 5 ranked TF-IDF in the corresponding keyword sets;

the top 5 keyword set in the C1 community is { boy basket, suo mosaic, iran, chinese team, asia };

6) Taking the first n keywords corresponding to each community as a keyword set of finally discovered topics;

the key word set of the top 5 in the C1 community is { boy basket, sunday, iran, chinese team, asia }, and the formed topic is 'Chinese boy basket Sunday'.

Claims

1. A multi-source heterogeneous online network topic early identification method is characterized by comprising the following steps:

3) For the complex network constructed in the step 2)

The dynamic community dividing method is utilized to divide the community structure for the time interval t ₀ ,t _end ]Dividing the social network by taking the time progressive increment delta t as an interval, and constructing t through newly added short text information of various different source online social networks crawled in the time progressive increment delta t ₀ Complex network at time + Δ t>

Then t is ₀ Complex network at time + Δ t>

Dividing the community;

4) Statistical complex networks

2. The method for early identifying the multi-source heterogeneous online network topics according to claim 1, wherein in the step 1), the original short text information disclosed by the crawled online social network comprises news titles of news websites and microblogs of microblog platforms, and a short text keyword set is constructed according to the crawled original short text information by a method of Chinese word segmentation and text feature value extraction TF-IDF.

3. The method for early recognition of the multi-source heterogeneous online network topic according to claim 1, wherein the short text is taken as a complex network

The edges between the nodes represent the incidence relation between the short texts;

complex network

Where i, j denotes the time t ₀ Previously crawled microblog information and news headlines, C _i A set of keywords representing short text i; n is a radical of hydrogen _ij Representing a short text keyword set C _i And C _j Is determined by the number of coincidences of the keyword(s), is greater than or equal to>

4. The method for early identifying the multi-source heterogeneous online network topics as claimed in claim 1, wherein a static community discovery method is adopted in the step 3) to identify the complex network topics

And carrying out community division.

5. The method for early identifying the topic in the multi-source heterogeneous online network according to claim 1, wherein the newly added short texts and connection information in the time increment delta t are added to the complex network

To form a new complex network

6. The method for early identifying the multi-source heterogeneous online network topic as claimed in claim 1, wherein the short text and the connection information added in the time increment delta t are determined according to the complex network topic

Newly added text node set with medium relationship close to each other>

The second category is with complex networks &>

Newly added text node set with middle community relation loose->

Determining newly added text node set based on modularity gain index delta Q>

And complex network>

Go to societyDividing the groups, determining a newly added community and realizing the judgment of the complex network>

Dynamic community partitioning. />