CN114707044A

CN114707044A - Extraction method and system of collective social behaviors based on community discovery

Info

Publication number: CN114707044A
Application number: CN202111638174.7A
Authority: CN
Inventors: 杨海陆; 刘乾; 张建林; 张金; 陈晨; 王莉莉; 丁晓宇
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-07-05
Anticipated expiration: 2041-12-29
Also published as: CN114707044B

Abstract

The invention discloses a community discovery-based extraction method and system for collective social behaviors, wherein the method comprises the following steps: capturing posts published by a plurality of users in a social network as an initial data set, and preprocessing the initial data set to obtain a data set; processing the data set by using an LDA model to generate theme distribution; constructing a similarity calculation function based on sparse expression to solve the similarity of each post and the distribution of the subjects to obtain an affinity matrix; constructing a community discovery algorithm based on the adaptive loss function, and determining a target function; continuously learning the target function by using an alternative iteration method to obtain a communication component between posts under the same subject in the affinity matrix so as to construct a target similarity matrix to determine a community structure; and (3) introducing a node2vec model to visualize the community structure, and extracting collective social behaviors according to the distribution condition of nodes in the community structure. The method can accurately extract the collective social behaviors which are obviously different from the individual semantic behavior characteristics, and has high robustness.

Description

Extraction method and system of collective social behaviors based on community discovery

Technical Field

The invention relates to the technical field of social network analysis, in particular to a community discovery-based method and system for extracting collective social behaviors in an online social network.

Background

A social network is a network structure made up of participants and their interrelationships, which can be represented as a group of nodes and a set of links representing the connections between them. The group of nodes are connected with each other by individuals, groups, organizations and related systems through the same value view, environment and idea; or events such as social contacts, disputes, financial security transactions, businesses, etc., as one or more groups of many aspects of interpersonal relationships. When the above relationships are successfully formed, social networks can leverage broader social processes by capturing human, social, natural, material, and financial capital and related information content. In development work, they can influence policies, strategies, plans, and projects, as well as the partnership that underlies them. Based on these characteristics of online social networks, online social network analysis is made a valid point in dealing with many problems.

Social network analysis is commonly referred to as analytical research, with the goal of revealing relevant information about nodes and connections between nodes in a social network. By treating these relationships as information for social network analysis, a better understanding of the network structure may be ensured. Social network analysis is now used in almost all areas, such as detection of personal and social group structure and behavior (component decomposition, clustering, relationship determination), e-commerce online advertising (customer profiling and trend analysis, personalized advertising and proposal submission), large data set analysis (media tracking, academic publication analysis, genetic research), etc. Researchers may employ a variety of data mining techniques to achieve goals in social network analysis.

The community discovery is a type of algorithm based on a network topology structure, and can be divided into the following types according to different research contents: the hierarchical clustering algorithm is used for dividing communities based on the similarity or connection strength between nodes, and the most common clustering algorithms include a Newman quick algorithm, a Newman greedy algorithm, a spectrum-based clustering algorithm and the like; the spectral clustering algorithm is to find communities in the network by analyzing eigenvalues and eigenvectors of a Laplace matrix or a standard matrix formed by adjacent matrixes; the modular based algorithm comprises a modular optimization algorithm and an improved modular algorithm. The modular optimization algorithm detects communities in the network by targeting a modular function as an optimization. Common algorithms include greedy algorithm, simulated annealing algorithm, Louvain algorithm and the like; the improved modularization algorithm adopts an improved modularization function, and modularization is applied to different types of networks to achieve community discovery.

The research of the collective social behavior is the key for analyzing the community and the network foundation in the social network, and the accurate extraction of the collective social behavior in the online social network has important significance. For example, the crowd psychology of online shopping is researched through aspects of buyback rate, sales volume, different regional sources and the like; establishing a social community collective behavior characteristic model to reveal the relationship between collective behaviors and community topics; analyzing collective behavior in social data finds that users can communicate their own feelings of preference to other users with connections so that they gradually share the same or similar subjective feelings.

The existing method has the following problems: in the process of extracting the social behaviors, only structural features of communities in the social network are considered, semantic information of nodes in the social network is ignored, and the collective social behaviors which are obviously different from the individual semantic behavior features are difficult to accurately extract. Therefore, semantic information in the social network is extracted, users with similar behaviors in the social network form a community through community discovery, and therefore the collective social behavior in the social network is accurately extracted.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, one purpose of the present invention is to provide a method for extracting a collective social behavior based on community discovery, which solves the technical problems that the accuracy of the collective social behavior capable of representing an online social network is not high and the robustness is not sufficient due to the fact that the collective social behavior obviously different from the individual semantic behavior characteristics is difficult to accurately extract in the prior art.

Another object of the present invention is to provide a system for extracting collective social behaviors based on community discovery.

In order to achieve the above object, an embodiment of the present invention provides a method for extracting collective social behaviors based on community discovery, including the following steps: step S1, capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation processing the initial data set to obtain a data set; step S2, processing the data set by using an LDA model, and generating a plurality of subjects and subject distribution of each post; step S3, constructing a similarity calculation function based on sparse expression to solve the similarity of each post and the distribution of the subjects to obtain an affinity matrix; step S4, constructing a community discovery algorithm based on the adaptive loss function and the affinity matrix to determine a target function; step S5, continuously learning the target function by using an alternative iteration method to obtain a communication component between posts under the same subject in the affinity matrix so as to construct a target similarity matrix to determine a community structure in a community network; and step S6, introducing a node2vec model to visualize the community structure, and extracting collective social behaviors according to the distribution condition of nodes in the community structure.

According to the method for extracting the collective social behavior based on community discovery, provided by the embodiment of the invention, the initial data information of the social network is processed with high quality by utilizing the adaptive loss function learning similarity matrix, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesiveness and stability, the collective social behavior extraction of the online social network is realized, and the result has excellent accuracy and robustness.

In addition, the method for extracting collective social behaviors based on community discovery according to the above embodiment of the present invention may further have the following additional technical features:

further, in one embodiment of the present invention, the affinity matrix is:

wherein, c_i,jIs the value of the ith row and j column of the affinity matrix, m is the number of neighbors of the adaptive user,

l2-norm of the topic distribution for nodes i and j.

Further, in one embodiment of the present invention, the objective function is:

min_S,F||C^(v)-S||_σ+εTr(F^TLF)

s.t.1^Ts_i＝1,s_i,j≥0,F^TF＝I

wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indicator matrix, L is a Laplace matrix of the target variable, Tr () is a trace, 1^Ts_iIs the sum of all values in column ith, S_i,jIs the value of the ith row and j column of S, and I is the identity matrix.

Further, in an embodiment of the present invention, the step S5 specifically includes: fixing a clustering indication matrix to solve a target variable by using an alternative iteration method, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10^-3Or the iteration times are more than 150, the connected components of all posts under the same subject are obtained, and then the target similarity matrix is constructed to determine the community structure in the community network.

Further, in an embodiment of the present invention, the method for extracting the collective social behavior in step S6 is as follows: if the middle nodes of the community structure are distributed sparsely, all the nodes in the community are covered by a minimized circle, and the node closest to the center of the circle is taken as the collective social behavior; and if the middle nodes of the community structure are densely distributed, extracting the collective social behaviors by using the centrality.

In order to achieve the above object, another embodiment of the present invention provides a system for extracting collective social behaviors based on community discovery, including: the acquisition and preprocessing module is used for capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation processing are carried out on the initial data set to obtain a data set; a subject distribution generation module for processing the data set using an LDA model to generate a plurality of subjects and a subject distribution for each post; constructing an affinity matrix module for constructing a similarity calculation function based on sparse expression to solve the similarity of each post and the distribution of the subjects to obtain an affinity matrix; a target function determining module for constructing a community discovery algorithm based on an adaptive loss function and the affinity matrix to determine a target function; the iterative learning module is used for enabling the target function to learn continuously by using an alternative iteration method to obtain a communication component between posts under the same subject in the affinity matrix so as to construct a target similarity matrix and determine a community structure in a community network; and the collective social behavior extraction module is used for introducing a node2vec model to visualize the community structure and extracting collective social behaviors according to the distribution condition of the nodes in the community structure.

According to the extraction system of the collective social behavior based on community discovery, provided by the embodiment of the invention, the initial data information of the social network is processed with high quality by utilizing the adaptive loss function learning similarity matrix, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesiveness and stability, the collective social behavior extraction of the online social network is realized, and the result has excellent accuracy and robustness.

In addition, the extraction system of collective social behaviors based on community discovery according to the above embodiment of the present invention may also have the following additional technical features:

further, in one embodiment of the present invention, the affinity matrix is:

l2-norm of the topic distribution for nodes i and j.

Further, in one embodiment of the present invention, the objective function is:

min_S,F||C^(v)-S||_σ+εTr(F^TLF)

s.t.1^Ts_i＝1,s_i,j≥0,F^TF＝I

Further, in an embodiment of the present invention, the iterative learning module is specifically configured to: fixing a clustering indication matrix to solve a target variable by using an alternative iteration method, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10^-3Or the iteration times are more than 150, the connected components of all posts under the same subject are obtained, and then the target similarity matrix is constructed to determine the community structure in the community network.

Further, in an embodiment of the present invention, the method for extracting the social behavior of the group in the module for extracting the social behavior of the group is as follows: if the middle nodes of the community structure are distributed sparsely, all the nodes in the community are covered by a minimized circle, and the node closest to the center of the circle is taken as the collective social behavior; and if the middle nodes of the community structure are densely distributed, extracting the collective social behaviors by using centrality.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a method for extracting collective social behaviors based on community discovery according to an embodiment of the invention;

FIG. 2 is a graphical representation of the results of modularity as a function of the number of topics for one embodiment of the present invention;

FIG. 3 is a diagram of the visualization result of the node2vec model on the similarity matrix according to one embodiment of the present invention;

FIG. 4 is a diagram of collective social behavior extraction results, according to an embodiment of the invention;

FIG. 5 is a diagram of the modularity comparison analysis of the existing Ncut, Louvain and CAN algorithms of one embodiment of the present invention with the present application;

FIG. 6 is a schematic structural diagram of an extraction system of collective social behaviors based on community discovery according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The method and system for extracting community discovery-based collective social behaviors provided by the embodiment of the invention are described below with reference to the accompanying drawings, and first, the method for extracting community discovery-based collective social behaviors provided by the embodiment of the invention will be described with reference to the accompanying drawings.

FIG. 1 is a flowchart of a method for extracting collective social behaviors based on community discovery according to an embodiment of the present invention.

As shown in fig. 1, the method and system for extracting collective social behaviors based on community discovery includes the following steps:

in step S1, posts posted by a plurality of users in the social network are captured as an initial data set, and the initial data set is cleaned and word-segmented to obtain a data set.

Specifically, the online social network data information may be obtained by crawling posts on the social web page by a crawler written by Python, for example, a social media platform microblog based on the user relationship. After the data information is obtained, in order to ensure the accuracy of the experimental result, the data set is cleaned (for example, advertisements are removed, repeated posts are short), and word segmentation (jieba word segmentation) is performed to obtain the data set.

In step S2, the data set is processed using the LDA model to generate a distribution of topics for the plurality of topics and each post.

Specifically, the data set is processed by utilizing an LDA model to generate T subjects, and any node v_iThe node v may belong to one topic or to several topics (i.e. topic distributions), represented as floating point numbers_iProbability of the subject. The generation process of the LDA model corresponds to the following joint distribution:

wherein, theta_dSubject distribution, β, of Dirichlet (α)_dDirichlet (η) is a word distribution; z_d,nNumbering the subject, w_d,nFor word probability, the parameters alpha and eta are hyper-parameter vectors, D belongs to D, and the subject Z_d,nTopic distribution theta depending on text information published by a user_d(ii) a Word w_d,nWord distribution beta depending on all topics_1,kAnd subject Z_d,n. The data will be stored in the form of a matrix, denoted by X, with rows representing nodes v_iSubject matter Z_d,nThe column represents the node feature vector.

In step S3, a similarity calculation function based on sparse expression is constructed to solve the similarity between each post and the distribution of the topic, and an affinity matrix is obtained.

Specifically, the embodiment of the present invention obtains the affinity matrix by calculating the similarity between the feature vectors, so that users with a relatively high degree of association in the social network (the distance between the feature vectors corresponding to semantic information published by the users is relatively small) correspond to a relatively high similarity value, and the corresponding similarity value between users with a relatively low degree of association is relatively small or even zero similarity, so as to obtain the affinity matrix and complete the reconstruction of the social network, wherein the affinity matrix can be obtained by solving the following problems:

wherein the content of the first and second substances,

d is the dimension (number of topics) of the features of the semantic information in the social network, and n is the number of data (number of users in the social network) in the data matrix obtained in step S2; its j-th column vector is denoted x_jThe ith, j element is represented as x_i,j(ii) a Alpha is a sparse adjustment factor. The results are calculated and derived as follows:

wherein, the first and the second end of the pipe are connected with each other,

and will sort it from small to large, so that c_i,jIs satisfied with

And is provided with

And m is the number of neighbors of the self-adaptive user. And (4) obtaining an affinity matrix C of the social network according to the formula (3), and obtaining the existing connection relation between the users according to the affinity matrix. Compared with fixed connection graph structures such as a full connection graph and a K neighbor graph (obtained by calculating cosine similarity, Gaussian kernel similarity and the like), the calculation method shown in the formula (3) can be adaptive to the number m of neighbors of the user. The affinity matrix constructed in the way can accurately reflect the relationship among users in the social network, and can make up the disadvantage that the spectral clustering has higher requirement on the similarity of the nodes, so that the subsequent community discovery effect is better.

In step S4, a community discovery algorithm is constructed based on the adaptive loss function and the affinity matrix to determine an objective function.

Specifically, the embodiment of the invention selects l1-norm and l2-norm to construct the loss function, and the loss function constructed by l1-norm is not sensitive to a larger outlier but is sensitive to a smaller outlier; l2-norm is exactly the opposite, and the adaptive loss function neutralizes both of the above-mentioned problems. The function is defined as follows:

after the affinity matrix C of the social network is reconstructed by the formula (3), in order to learn the optimal similarity matrix S, the following objective function is proposed:

where L is the laplacian matrix of S, and rank (L) n-k is the rank constraint, such that the similarity matrix S has k connected components. To avoid the appearance of an abnormal node (without any neighbors), constraint 1 is set^TS_iLet S be 1 for each row.

However, L depends on the target variable S and the rank-constrained equation is non-linear, resulting in equation (5) being difficult to solve. So let λ_i(L) represents LI-th smallest eigenvalue if the first k smallest eigenvalues of L satisfy

The rank constraint is satisfied. Given the balance factor ε, equation (5) can be expressed as:

according to the theory of Fan, it can be known that

Wherein F ═ { F ═ F₁,f₂,…,f_kAnd is a clustering indication matrix. Substituting equation (7) into equation (6) yields:

the formula (8) is the final target function, the target variable S has k connected components, namely the final community discovery result can be directly obtained by using the algorithm, wherein S is the target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplace matrix of the target variable, Tr () is a trace, 1^Ts_iIs the sum of all values in column ith, S_i,jIs the value of the ith row and j column of S, and I is the identity matrix.

In step S5, the objective function is continuously learned by using an alternating iteration method to obtain a connected component between each post under the same topic in the affinity matrix, so as to construct an objective similarity matrix to determine a community structure in the community network.

Further, in an embodiment of the present invention, step S5 specifically includes: using an alternative iteration method, firstly fixing the clustering indication matrix to solve the target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10^-3Or the iteration times are more than 150, the connected components of all posts under the same subject are obtained, and then a target similarity matrix is constructed to determine the community structure in the community network.

Specifically, the objective function is solved by using an alternating iteration method, and one variable is updated while the other variables are kept unchanged, as shown below:

(1) and fixing the clustering indication matrix F, and solving a target variable S.

After the clustering indication matrix F is fixed, the property of the Laplace matrix is utilized

Equation (8) can be changed to:

definition matrix

Wherein the content of the first and second substances,

is the ith column of E, whose jth element is

Because each row in S has independence, equation (9) can be written out in vector form:

wherein s is_iA column vector formed for the i-th row element of the target similarity matrix S, c_iA column vector formed by the ith row element of the affinity matrix; u. of_iThe values of (A) are as follows:

equation (10) is simplified to:

order to

Using the Lagrange multiplier method, there are

Wherein eta and xi are Lagrange multipliers, the former is scalar, and the latter is vector. According to the KKT condition:

and due to 1^T

s

_i1 is represented by formula (13) 1

Substituting it into equation (13) to obtain the optimal solution

The following:

order to

To any j have

Obtained according to equation (13):

therefore, only need to determine

An optimal solution can be obtained

From equation (15) and equation (13), we can obtain:

due to the fact that

Thus, it is possible to provide

Is optimally solved as

Defining a relation xi^*Function of (2)

When f (xi)^*) Can be determined when the value is 0

Due to xi^*≧ 0, and f' (ξ)^*) 0 is a piecewise linear convex function, so f' (ξ)^*) The 0 root can be solved by the Newton method, i.e.

(2) Fixing target variable S, solving clustering indication matrix F

When the target variable S is fixed, it is equivalent to solving the following problem:

at this time, the optimal solution of the clustering indication matrix F is composed of eigenvectors corresponding to the k minimum eigenvalues before the laplacian matrix L.

Continuously iterating the two processes until the relative change of the target variable S is less than 10^-3Or the iteration times are more than 150 times, and the iterative learning is completed.

In step S6, a node2vec model is introduced to visualize the community structure, and a collective social behavior is extracted according to the distribution of nodes in the community structure.

In particular, in order to more easily understand and analyze the collective social behavior of the social network, the embodiment of the present invention may represent the result of the community discovery as a visual result. Therefore, a Node2Vec graph embedding model is introduced, is a Node vectorization model, obtains local information from truncated random walks, takes the nodes as terms, takes the walks as sentences to learn potential representation, and further expands the Deepwalk algorithm by changing the generation mode of a random walk sequence.

Then, extracting the collective social behavior, if the middle nodes of the community structure are distributed sparsely, covering all the nodes in the community by using a minimized circle, and taking the node closest to the center of the circle as the collective social behavior; and if the distribution of the middle nodes of the community structure is dense, extracting the collective social behaviors by using the centrality.

The method for extracting collective social behaviors based on community discovery provided by the embodiment of the invention is further explained by two specific embodiments.

Detailed description of the preferred embodiment

10176 posts posted by the user from 3/1/2021 to 3/5/2021 were crawled from the Sina microblog by a crawler. The data set was cleaned (posts removed from advertisements, repeated, short, etc.) leaving 1584 posts as the initial data set. The data set can be obtained by carrying out jieba word segmentation on semantic information, and the method comprises the following steps:

1: national asset investment and stock control group … … general cooperation construction semiconductor nondestructive testing intelligence of formal science and technology strain city

2: curima auto-arrival compression ignition spark plug mechanical pressurization … … stability factor lending ten thousand legal routes

3: median safety of second-stage research of main-ren new crown epidemic … … candidate vaccine of Russian vector science center

4: china successfully develops a plurality of long-distance transport modes of vaccines triphibian storage and transportation … …, namely sea, land and air in Yangzhou base

……

1584: new rule and affair education … … of early warning caution road icing in evening news and yellow early warning is more and more matched with dish market

Generating the distribution of the T subjects and the subject of each post by using an LDA model, wherein the number of the subjects is determined according to the evaluation index modularity Q, as shown in FIG. 2, the number of the subjects is selected to be 30, and finally, a data matrix (30 multiplied by 1584) is obtained as follows:

an affinity matrix (1584 × 1584) of microblog data can be obtained according to the formula (3):

according to the community discovery algorithm based on the adaptive loss function, a similarity matrix (1584 × 1584) with a specified number of connected components is learned as follows:

and (3) introducing a node2vec graph embedding model to visualize the similarity matrix, wherein the result is shown in the attached figure 3. And finally, extracting collective social behaviors from the social network by using two different methods, wherein the social behaviors are marked by using two different graphs respectively, and the method 1 is rhombus, and the method 2 is a five-pointed star, as shown in an attached figure 4.

Detailed description of the invention

The existing Ncut, Louvain and CAN algorithms are selected to be compared with the extraction method provided by the invention for realization, WebKB, BBC news reports and 20NGs news document data sets are used as verification data, the stability and cohesion of community discovery results are measured by using the modularity Q, and the verification results are shown in figure 5, so that the embodiment of the invention CAN be found to be in a leading position in performance.

According to the method for extracting the collective social behaviors based on the community discovery, provided by the embodiment of the invention, the initial data information of the social network is processed with high quality by utilizing the adaptive loss function learning similarity matrix, the reconstruction and the community discovery of the social network are completed, the output community structure is ensured to have higher cohesion and stability, the collective social behaviors of the online social network are extracted, and the result has excellent accuracy and robustness.

Next, a system for extracting collective social behaviors based on community discovery according to an embodiment of the present invention will be described with reference to the drawings.

FIG. 6 is a system for extracting collective social behavior based on community discovery, according to an embodiment of the present invention.

As shown in fig. 6, the system 10 includes: the system comprises an acquisition and preprocessing module 100, a topic distribution generation module 200, a construction affinity matrix module 300, a determination objective function module 400, an iterative learning module 500 and an extraction collective social behavior module 600.

The acquisition and preprocessing module 100 is configured to capture posts published by a plurality of users in a social network as an initial data set, and perform cleaning and word segmentation on the initial data set to obtain a data set; the subject distribution generating module 200 is configured to process the data set by using an LDA model to generate a plurality of subjects and a subject distribution of each post; the affinity matrix building module 300 is configured to build a similarity calculation function based on sparse expression to solve the similarity between each post and the distribution of the subject, so as to obtain an affinity matrix. The determine objective function module 400 is used to construct a community discovery algorithm based on the adaptive loss function and the affinity matrix to determine the objective function. The iterative learning module 500 is configured to continuously learn the objective function by using an alternating iterative method, to obtain a connected component between each post in the same topic in the affinity matrix, so as to construct a target similarity matrix to determine a community structure in the community network. The collective social behavior extraction module 600 is configured to introduce a node2vec model to visualize a community structure, and extract collective social behaviors according to distribution of nodes in the community structure.

Further, in one embodiment of the present invention, the affinity matrix is:

l2-norm of the topic distribution for nodes i and j.

Further, in one embodiment of the present invention, the objective function is:

min_S,F||C^(v)-S||_σ+εTr(F^TLF)

s.t.1^Ts_i＝1,s_i,j≥0,F^TF＝I

Further, in an embodiment of the present invention, the iterative learning module 500 is specifically configured to:

using an alternative iteration method, firstly fixing the clustering indication matrix to solve the target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10^-3Or the iteration times are more than 150, the connected components of all posts under the same subject are obtained, and then a target similarity matrix is constructed to determine the community structure in the community network.

Further, in an embodiment of the present invention, the method for extracting the social behavior of the group in the group social behavior extracting module 600 is: if the middle nodes of the community structure are distributed sparsely, all the nodes in the community are covered by a minimized circle, and the node closest to the center of the circle is taken as a collective social behavior; and if the middle nodes of the community structure are densely distributed, using the centrality to extract the collective social behaviors.

It should be noted that the explanation of the embodiment of the method for extracting a collective social behavior based on community discovery is also applicable to the system of the embodiment, and is not repeated here.

According to the extraction system of the collective social behaviors based on community discovery provided by the embodiment of the invention, the initial data information of the social network is processed with high quality by utilizing the adaptive loss function learning similarity matrix, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesion and stability, the collective social behaviors of the online social network are extracted, and the result has excellent accuracy and robustness.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A community discovery based collective social behavior extraction method is characterized by comprising the following steps:

step S1, capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation processing the initial data set to obtain a data set;

step S2, processing the data set by using an LDA model, and generating a plurality of subjects and subject distribution of each post;

step S3, constructing a similarity calculation function based on sparse expression to solve the similarity of each post and the distribution of the subjects to obtain an affinity matrix;

step S4, constructing a community discovery algorithm based on the adaptive loss function and the affinity matrix to determine a target function;

step S5, continuously learning the target function by using an alternative iteration method to obtain a communication component between posts under the same subject in the affinity matrix so as to construct a target similarity matrix to determine a community structure in a community network;

and step S6, introducing a node2vec model to visualize the community structure, and extracting collective social behaviors according to the distribution condition of nodes in the community structure.

2. The method for extracting collective social behavior based on community discovery as claimed in claim 1, wherein the affinity matrix is:

l2-norm of the topic distribution for nodes i and j.

3. The method for extracting collective social behaviors based on community discovery according to claim 1, wherein the objective function is:

min_S,F||C^(v)-S||_σ+εTr(F^TLF)

s.t.1^Ts_i＝1,s_i,j≥0,F^TF＝I

wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplace matrix of the target variable, Tr () is a trace, 1^Ts_iIs the sum of all values in column ith, S_i,jIs the value of the ith row and j column of S, and I is the identity matrix.

4. The method for extracting collective social behaviors based on community discovery according to claim 1, wherein the step S5 is specifically as follows:

fixing a clustering indication matrix to solve a target variable by using an alternative iteration method, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10^-3Or the iteration times are more than 150, the connected components of all posts under the same subject are obtained, and then the target similarity matrix is constructed to determine the community structure in the community network.

5. The method for extracting collective social behavior based on community discovery as claimed in claim 1, wherein the method for extracting collective social behavior in step S6 is as follows:

if the middle nodes of the community structure are distributed sparsely, all the nodes in the community are covered by a minimized circle, and the node closest to the center of the circle is taken as the collective social behavior;

and if the middle nodes of the community structure are densely distributed, extracting the collective social behaviors by using the centrality.

6. An extraction system of collective social behaviors based on community discovery, comprising:

the acquisition and preprocessing module is used for capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation processing are carried out on the initial data set to obtain a data set;

a subject distribution generation module for processing the data set using an LDA model to generate a plurality of subjects and a subject distribution for each post;

constructing an affinity matrix module for constructing a similarity calculation function based on sparse expression to solve the similarity of each post and the distribution of the subjects to obtain an affinity matrix;

the target function determining module is used for constructing a community discovery algorithm based on an adaptive loss function and the affinity matrix so as to determine a target function;

the iterative learning module is used for enabling the target function to learn continuously by using an alternating iterative method to obtain a communication component between posts under the same subject in the affinity matrix so as to construct a target similarity matrix and determine a community structure in a community network;

and the collective social behavior extraction module is used for introducing a node2vec model to visualize the community structure and extracting collective social behaviors according to the distribution condition of the nodes in the community structure.

7. The community discovery based extraction system of collective social behaviors of claim 6, wherein the affinity matrix is:

l2-norm of the topic distribution for nodes i and j.

8. The system for extracting collective social behavior based on community discovery of claim 6, wherein the objective function is:

min_S,F||C^(v)-S||_σ+εTr(F^TLF)

s.t.1^Ts_i＝1,s_i,j≥0,F^TF＝I

9. The system for extracting collective social behavior based on community discovery of claim 6, wherein the iterative learning module is specifically configured to:

10. The community discovery-based collective social behavior extraction system according to claim 6, wherein the method for extracting collective social behaviors in the collective social behavior extraction module is as follows:

and if the middle nodes of the community structure are densely distributed, extracting the collective social behaviors by using centrality.