CN114707044B - Method and system for extracting collective social behavior based on community discovery - Google Patents

Method and system for extracting collective social behavior based on community discovery Download PDF

Info

Publication number
CN114707044B
CN114707044B CN202111638174.7A CN202111638174A CN114707044B CN 114707044 B CN114707044 B CN 114707044B CN 202111638174 A CN202111638174 A CN 202111638174A CN 114707044 B CN114707044 B CN 114707044B
Authority
CN
China
Prior art keywords
matrix
community
community structure
data set
affinity matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111638174.7A
Other languages
Chinese (zh)
Other versions
CN114707044A (en
Inventor
杨海陆
刘乾
张建林
张金
陈晨
王莉莉
丁晓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202111638174.7A priority Critical patent/CN114707044B/en
Publication of CN114707044A publication Critical patent/CN114707044A/en
Application granted granted Critical
Publication of CN114707044B publication Critical patent/CN114707044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for extracting collective social behavior based on community discovery, wherein the method comprises the following steps: capturing posts published by a plurality of users in a social network as an initial data set, and preprocessing the posts to obtain a data set; processing the data set by using an LDA model to generate theme distribution; constructing a similarity calculation function based on sparse expression to solve the similarity between each post and the topic distribution to obtain an affinity matrix; constructing a community discovery algorithm based on the self-adaptive loss function, and determining an objective function; continuously learning an objective function by using an alternate iteration method to obtain a connected component between each post under the same theme in the affinity matrix so as to construct an objective similarity matrix to determine a community structure; and a node2vec model is introduced to visualize the community structure, and collective social behavior is extracted according to the distribution condition of the nodes in the community structure. The method can accurately extract the collective social behavior which is obviously different from the individual semantic behavior characteristics, and has high robustness.

Description

Method and system for extracting collective social behavior based on community discovery
Technical Field
The invention relates to the technical field of social network analysis, in particular to a method and a system for extracting collective social behaviors in an online social network based on community discovery.
Background
Social networks are composed of participants and their interrelationships, which may be represented as a network of a group of nodes and a set of links representing connections between them. The group of nodes are interconnected by individuals, communities, organizations, and related systems through the same value views, environments, ideas; events such as social contacts, disputes, financial securities exchanges, businesses, etc., may also be combined with one another as one or more groups of many aspects of personal relationships. When the above relationships are successfully formed, social networks can affect a wider social process by capturing human, social, natural, physical and financial capital and related information content. In development work, they can affect policies, strategies, plans and projects, as well as partnerships that form their basis. Based on these features of online social networks, online social network analysis is made an effective point in addressing many problems.
Social network analysis is often referred to as analysis research, whose purpose is to reveal relevant information about nodes and connections between nodes in a social network. By treating these relationships as information for social network analysis, a better understanding of the network structure may be ensured. Social network analysis is now almost used in many areas, such as detection of personal and social group structure and behavior (component decomposition, clustering, relationship determination), e-commerce online advertising (customer profile and trend analysis, personalized advertising and proposal submission), large dataset analysis (media tracking, academic publication analysis, genetic research), and the like. Researchers may use a variety of data mining techniques to achieve goals in social network analysis.
Community discovery is a type of algorithm based on network topology, and can be classified into the following categories according to different research contents: the hierarchical clustering algorithm divides communities based on similarity or connection strength among nodes, and the most common clustering algorithm is a Newman quick algorithm, a Newman greedy algorithm, a clustering algorithm based on spectrums and the like; the spectral clustering algorithm is to find communities in the network by analyzing eigenvalues and eigenvectors of a Laplace matrix or a standard matrix formed by adjacent matrices; the modularization-based algorithm includes a modularization optimization algorithm and a modified modularization algorithm. The modular optimization algorithm detects communities in the network by targeting modular functions for optimization. The common algorithms include greedy algorithm, simulated annealing algorithm, louvain algorithm and the like; the improved modularization algorithm employs an improved modularization function to apply modularization to different types of networks to implement community discovery.
The research of the collective social behavior is the key for analyzing the community and the network basis in the social network, and the accurate extraction of the collective social behavior in the online social network has important significance. Study of the popular psychology of online shopping, for example, through the aspects of rate of return, sales, different regional sources, etc.; establishing a social community collective behavior feature model to reveal the relationship between the collective behavior and community topics; analyzing collective behavior in social data finds that users can communicate their own preferences to other users with connections so that they gradually share the same or similar subjective experiences.
The existing method has the following problems: in the process of extracting social behaviors, only structural features of communities in the social network are considered, semantic information of nodes in the social network is ignored, and collective social behaviors which show semantic features which are obviously different from those of individuals are difficult to accurately extract. Therefore, the semantic information in the social network is extracted, and users with similar behaviors in the social network form a community through community discovery, so that the collective social behaviors in the social network are accurately extracted.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, an object of the invention is to provide a method for extracting a collective social behavior based on community discovery, which solves the technical problems of low accuracy and insufficient robustness of the collective social behavior capable of representing an online social network caused by the fact that the collective social behavior which is obviously different from the semantic behavior characteristics of an individual is difficult to extract accurately in the prior art.
Another object of the present invention is to provide an extraction system for collective social behavior based on community discovery.
In order to achieve the above objective, an embodiment of one aspect of the present invention provides a method for extracting a collective social behavior based on community discovery, including the following steps: step S1, capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation are carried out on the initial data set to obtain a data set; s2, processing the data set by using an LDA model to generate a plurality of topics and topic distribution of each post; s3, constructing a similarity calculation function based on sparse expression to solve the similarity between each post and the topic distribution to obtain an affinity matrix; s4, constructing a community discovery algorithm based on the adaptive loss function and the affinity matrix to determine an objective function; step S5, continuously learning the objective function by using an alternate iteration method to obtain a connected component between each post under the same theme in the affinity matrix so as to construct a target similarity matrix and determine a community structure in a community network; and S6, introducing a node2vec model to visualize the community structure, and extracting collective social behavior according to the distribution condition of the nodes in the community structure.
According to the extraction method of the collective social behavior based on community discovery, disclosed by the embodiment of the invention, the similarity matrix is learned by utilizing the self-adaptive loss function, the initial data information of the social network is processed with high quality, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesiveness and stability, the extraction of the collective social behavior of the online social network is realized, and the result is enabled to have excellent accuracy and robustness.
In addition, the method for extracting the collective social behavior based on community discovery according to the embodiment of the invention can also have the following additional technical characteristics:
further, in one embodiment of the present invention, the affinity matrix is:
Figure GDA0003612201380000031
wherein c i,j The value of row j of the i-th row of the affinity matrix, m is the neighbor number of the adaptive user,
Figure GDA0003612201380000032
l2-norm distributed for the subjects of nodes i and j.
Further, in one embodiment of the present invention, the objective function is:
min S,F ||C (v) -S|| σ +εTr(F T LF)
s.t.1 T s i =1,s i,j ≥0,F T F=I
wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplacian matrix of the target variable, tr () is a trace, and 1 T s i Is the sum of all values of the ith row, S i,j And I is an identity matrix, which is the value of the ith row and the jth column of S.
Further, in one embodiment of the present invention, the step S5 specifically includes: utilizing an alternate iteration method, firstly fixing a clustering indication matrix to solve a target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10 -3 Or the iteration times are more than 150 times, so that connected components among posts under the same theme are obtained, and then the target similarity matrix is constructed to determine a community structure in a community network.
Further, in one embodiment of the present invention, the method for extracting the collective social behavior in the step S6 is: if the nodes in the community structure are sparsely distributed, covering all the nodes in the community by using a minimized circle, and taking the node closest to the center of the circle as the collective social behavior; and if the nodes in the community structure are densely distributed, extracting the collective social behavior by using centrality.
To achieve the above objective, another embodiment of the present invention provides a system for extracting collective social behavior based on community discovery, including: the acquisition and preprocessing module is used for capturing posts published by a plurality of users in the social network as an initial data set, and cleaning and word segmentation are carried out on the initial data set to obtain a data set; the topic distribution generation module is used for processing the data set by utilizing an LDA model to generate a plurality of topics and topic distribution of each post; an affinity matrix module is constructed and used for constructing a similarity calculation function based on sparse expression to solve the similarity between each post and the topic distribution so as to obtain an affinity matrix; the objective function determining module is used for constructing a community finding algorithm based on the adaptive loss function and the affinity matrix so as to determine an objective function; the iterative learning module is used for continuously learning the objective function by using an alternate iterative method to obtain a connected component between each post under the same theme in the affinity matrix so as to construct a target similarity matrix and determine a community structure in a community network; the collective social behavior extraction module is used for introducing a node2vec model to visualize the community structure, and extracting collective social behaviors according to the distribution condition of nodes in the community structure.
According to the extraction system of the collective social behavior based on community discovery, disclosed by the embodiment of the invention, the similarity matrix is learned by utilizing the self-adaptive loss function, the initial data information of the social network is processed with high quality, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesiveness and stability, the extraction of the collective social behavior of the online social network is realized, and the result is enabled to have excellent accuracy and robustness.
In addition, the extraction system of collective social behavior based on community discovery according to the embodiment of the invention may further have the following additional technical features:
further, in one embodiment of the present invention, the affinity matrix is:
Figure GDA0003612201380000041
wherein c i,j The value of row j of the i-th row of the affinity matrix, m is the neighbor number of the adaptive user,
Figure GDA0003612201380000042
l2-norm distributed for the subjects of nodes i and j.
Further, in one embodiment of the present invention, the objective function is:
min S,F ||C (v) -S|| σ +εTr(F T LF)
s.t.1 T s i =1,s i,j ≥0,F T F=I
wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, and F is a cluster indicationMatrix, L is Laplacian matrix of target variable, tr () is trace, 1 T s i Is the sum of all values of the ith row, S i,j And I is an identity matrix, which is the value of the ith row and the jth column of S.
Further, in one embodiment of the present invention, the iterative learning module is specifically configured to: utilizing an alternate iteration method, firstly fixing a clustering indication matrix to solve a target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10 -3 Or the iteration times are more than 150 times, so that connected components among posts under the same theme are obtained, and then the target similarity matrix is constructed to determine a community structure in a community network.
Further, in one embodiment of the present invention, the method for extracting the collective social behavior in the collective social behavior module includes: if the nodes in the community structure are sparsely distributed, covering all the nodes in the community by using a minimized circle, and taking the node closest to the center of the circle as the collective social behavior; and if the nodes in the community structure are densely distributed, extracting the collective social behavior by using centrality.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a method of extracting collective social behavior based on community discovery in accordance with one embodiment of the present invention;
FIG. 2 is a graphical representation of the results of modularity versus topic number for one embodiment of the present invention;
FIG. 3 is a graph of the visual results of a node2vec model versus a similarity matrix according to an embodiment of the present invention;
FIG. 4 is a graph of collective social behavior extraction results for one embodiment of the invention;
FIG. 5 is a block diagram of the prior Ncut, louvain, and CAN algorithms versus the modularity of the present application for one embodiment of the invention;
FIG. 6 is a schematic diagram of a system for extracting collective social behavior based on community discovery according to one embodiment of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The method and system for extracting the collective social behavior based on the community discovery according to the embodiment of the invention are described below with reference to the accompanying drawings, and the method for extracting the collective social behavior based on the community discovery according to the embodiment of the invention will be described first.
FIG. 1 is a flow chart of a method of extracting collective social behavior based on community discovery in accordance with one embodiment of the invention.
As shown in fig. 1, the method and system for extracting the collective social behavior based on community discovery comprise the following steps:
in step S1, posts published by a plurality of users in a social network are captured as an initial data set, and the initial data set is cleaned and word-segmented to obtain a data set.
Specifically, online social network data information can be obtained by capturing posts on a social webpage through a Python writing crawler program, such as a social media platform microblog based on user relations. After obtaining the data information, in order to ensure the accuracy of the experimental result, the data set is cleaned (for example, advertisement is removed, repeated, short posts are made), and the data set is obtained by word segmentation (jieba word segmentation).
In step S2, the data set is processed using the LDA model, generating a plurality of topics and a topic distribution for each post.
Specifically, the data set is processed by using an LDA model to generate T topics, and any node v i May belong to one topic or a plurality of topics (i.e. topic distribution), and the node v is represented by a floating point number i Probability of the subject matter. The generation process of the LDA model corresponds to the following joint distribution:
Figure GDA0003612201380000051
wherein θ d =dirichlet (α) as subject distribution, β d =dirichlet (η) is word distribution; z is Z d,n Numbering the topics, w d,n For word probability, the parameters alpha and eta are super-parameter vectors, D E D, and the topic Z d,n Subject distribution θ depending on text information published by user d The method comprises the steps of carrying out a first treatment on the surface of the Word w d,n Word distribution beta depending on all topics 1,k And subject Z d,n . The data will be stored in the form of a matrix, denoted by X, with the rows representing nodes v i Subject Z d,n The column represents the node feature vector.
In step S3, a similarity calculation function based on sparse representation is constructed to solve the similarity between each post and the topic distribution, and an affinity matrix is obtained.
Specifically, the embodiment of the invention obtains the affinity matrix by calculating the similarity between the feature vectors, so that a larger similarity value is corresponding between users with larger association degree (the feature vector distance corresponding to the semantic information published by the users is smaller) in the social network, the similarity value corresponding to the users with smaller association degree is smaller or even zero, and the affinity matrix is obtained, and the reconstruction of the social network is completed, wherein the affinity matrix can be obtained by solving the following problems:
Figure GDA0003612201380000061
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0003612201380000062
the data matrix obtained in the step S2 is d is semantic information in the social networkFeature dimension (number of topics), n is the number of data (number of users in the social network); its jth column vector is denoted as x j The i, j-th element is denoted as x i,j The method comprises the steps of carrying out a first treatment on the surface of the Alpha is a sparse adjustment factor. The result is obtained through calculation and deduction:
Figure GDA0003612201380000063
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0003612201380000064
and will order it from small to large, making c i,j Learning to satisfy->
Figure GDA0003612201380000065
And is also provided with
Figure GDA0003612201380000066
m is the number of neighbors of the adaptive user. And (3) using the formula (3) to obtain an affinity matrix C of the social network, and knowing the connection relation existing between the users according to the affinity matrix. Compared with the fixed connection graph structures such as the full connection graph and the K neighbor graph (obtained by calculation such as cosine similarity, gaussian kernel similarity and the like), the calculation method shown in the formula (3) can be adaptive to the neighbor number m of the user. The affinity matrix constructed in this way can accurately reflect the relationship among users in the social network, and can make up for the disadvantage that the spectrum clustering has high requirement on node similarity, so that the subsequent community discovery effect is better.
In step S4, a community finding algorithm is constructed based on the adaptive loss function and the affinity matrix to determine an objective function.
Specifically, the embodiment of the invention selects the defects that the loss function is constructed by using l1-norm and l2-norm, and the loss function constructed by l1-norm is insensitive to larger outliers and is very sensitive to smaller outliers; l2-norm is the exact opposite, while the adaptive loss function neutralizes both of the problems described above. The function is defined as follows:
Figure GDA0003612201380000071
after reconstructing the affinity matrix C of the social network through the formula (3), in order to learn the optimal similarity matrix S, the following objective function is proposed:
Figure GDA0003612201380000072
where L is a laplace matrix of S and rank (L) =n-k is a rank constraint such that the similarity matrix S has k connected components. To avoid the occurrence of abnormal nodes (without any neighbors), constraint 1 is set T S i =1, such that the sum of each row of S is 1.
However, L depends on the target variable S, and the rank constraint is nonlinear, resulting in equation (5) being difficult to solve. So let lambda i (L) represents the ith small eigenvalue of L if the first k minimum eigenvalues of L satisfy
Figure GDA0003612201380000073
Rank constraint is satisfied. Given the balance factor ε, equation (5) can be expressed as:
Figure GDA0003612201380000074
based on Fan theory, it can be seen that
Figure GDA0003612201380000075
Wherein F= { F 1 ,f 2 ,…,f k And the cluster indication matrix. Substituting formula (7) into formula (6) yields:
Figure GDA0003612201380000076
equation (8) is the final objective function, the objectiveThe variable S has k connected components, namely, the final community discovery result can be directly obtained by using the algorithm, wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplacian matrix of the target variable, tr () is a trace, and 1 T s i Is the sum of all values of the ith row, S i,j And I is an identity matrix, which is the value of the ith row and the jth column of S.
In step S5, an alternate iterative method is used to continuously learn the objective function, so as to obtain a connected component between each post under the same topic in the affinity matrix, so as to construct a target similarity matrix to determine a community structure in the community network.
Further, in one embodiment of the present invention, step S5 specifically includes: the method comprises the steps of utilizing an alternate iteration method, firstly fixing a clustering indication matrix to solve a target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is smaller than 10 -3 Or the iteration times are more than 150 times, so that connected components among posts under the same theme are obtained, and then a target similarity matrix is constructed to determine a community structure in a community network.
Specifically, the objective function is solved using an alternating iterative method, one variable is updated while the other variables are kept unchanged, as follows:
(1) And (5) fixing the clustering indication matrix F, and solving a target variable S.
When the clustering indication matrix F is fixed, the Laplace matrix property is utilized
Figure GDA0003612201380000081
Equation (8) can be changed to:
Figure GDA0003612201380000082
definition matrix
Figure GDA0003612201380000083
Wherein (1)>
Figure GDA0003612201380000084
The ith column of E, the jth element of which is
Figure GDA0003612201380000085
Because each row in S has independence, equation (9) can be written in vector form:
Figure GDA0003612201380000086
wherein s is i Column vector c formed by the ith row element of the target similarity matrix S i Column vectors formed for the i-th row element of the affinity matrix; u (u) i The values of (2) are as follows:
Figure GDA0003612201380000087
the formula (10) is simplified to obtain:
Figure GDA0003612201380000088
order the
Figure GDA0003612201380000089
By Lagrangian multiplier method, there is +.>
Figure GDA00036122013800000810
Where η, ζ is the Lagrangian multiplier, the former is a scalar and the latter is a vector. According to the KKT conditions:
Figure GDA0003612201380000091
and due to 1 T s i =1, according to equation 1 in equation (13)
Figure GDA0003612201380000092
Substituting it into formula (13) to obtainOptimal solution->
Figure GDA0003612201380000093
The following are provided:
Figure GDA0003612201380000094
order the
Figure GDA0003612201380000095
There is +.>
Figure GDA0003612201380000096
Obtained according to formula (13):
Figure GDA0003612201380000097
thus only need to determine
Figure GDA0003612201380000098
The optimal solution can be obtained>
Figure GDA0003612201380000099
From equation (15) and equation (13) can be obtained:
Figure GDA00036122013800000910
due to
Figure GDA00036122013800000911
Thus->
Figure GDA00036122013800000912
Is->
Figure GDA00036122013800000913
Define a relation of xi * Function of->
Figure GDA00036122013800000914
When f (xi) * ) By means of the method of the formula =0>
Figure GDA00036122013800000915
Due to xi * Not less than 0 and f' (ζ) * ) Less than or equal to 0 is a piecewise linear convex function, so f' (ζ) * ) Root=0 can be solved using the Newton method, i.e.
Figure GDA00036122013800000916
(2) Fixing a target variable S, and solving a clustering indication matrix F
When the target variable S is fixed, this corresponds to solving the following problem:
Figure GDA00036122013800000917
at this time, the optimal solution of the clustering indication matrix F is composed of eigenvectors corresponding to the first k minimum eigenvalues of the Laplace matrix L.
The two processes are continuously iterated until the relative change of the target variable S is less than 10 -3 Or the iteration times are more than 150 times, and the iterative learning is completed.
In step S6, a node2vec model is introduced to visualize the community structure, and collective social behavior is extracted according to the distribution condition of nodes in the community structure.
In particular, in order to facilitate understanding and analysis of collective social behavior of a social network, the embodiment of the invention can represent the result of community discovery in a visual result. Therefore, a Node2Vec graph embedded model is introduced, which is a Node vectorization model, local information is obtained from truncated random walks, nodes are regarded as terms, the walks are regarded as sentences to learn potential representation, and the deep walk algorithm is further expanded by changing the generation mode of random walk sequences.
Then extracting the collective social behavior, if the middle node of the community structure is sparse in distribution, covering all nodes in the community by adopting a minimized circle, and taking the node closest to the center of the circle as the collective social behavior; if the nodes in the community structure are densely distributed, the central degree is used for extracting the collective social behavior.
The method for extracting the collective social behavior based on community discovery provided by the embodiment of the invention is further described by two specific embodiments.
Detailed description of the preferred embodiments
10176 posts published by users between 2021, 3, 1 and 2021, 3, 5 and 5 are captured from newwave microblogs by crawlers. The dataset was washed (advertisement removed, repeated, brief, etc. posts) leaving 1584 posts as the initial dataset. The data set can be obtained after the semantic information is subjected to jieba word segmentation, and the data set is as follows:
1: intelligent nondestructive testing method for construction of semiconductors by using … … full-force cooperation of asset investment control groups in Zhengzhou scientific and technological, utility and national province
2: curie-mada compression ignition spark plug mechanical boost … … stability factor lender ten thousand legal routes
3: median safety in second-stage research of novel epidemic … … candidate vaccine of Russian vector science center
4: chinese successfully develops vaccine triphibian storage and transportation … … service Yangzhou base sea, land, air and space multiple long-distance transportation modes
……
1584: evening news yellow early warning careless road icing new law education … … increasingly matches the vegetable market
Generating T topics and topic distribution of each post by using an LDA model, determining the topic number according to the evaluation index modularity Q, selecting the topic number as 30 according to the evaluation index modularity Q as shown in figure 2, and finally obtaining a data matrix (30 multiplied by 1584) as follows:
Figure GDA0003612201380000111
the affinity matrix (1584×1584) of microblog data can be obtained according to formula (3):
Figure GDA0003612201380000112
according to the community discovery algorithm based on the adaptive loss function, the similarity matrix (1584×1584) with a specified number of connected components is learned as follows:
Figure GDA0003612201380000113
the similarity matrix is visualized by introducing a node2vec graph embedded model, and the result is shown in figure 3. Finally, extracting the collective social behavior of the social network by using two different methods, respectively marking the social network by using two different graphs, wherein the method 1 is diamond, and the method 2 is five-pointed star, as shown in fig. 4.
Second embodiment
The method is realized by selecting the existing Ncut, louvain and CAN algorithm and comparing with the extraction method provided by the invention, using WebKB, BBC news report and 20NGs news document data sets as verification data, and using the modularity Q to measure the stability and the cohesive force of the community discovery result, wherein the verification result is shown in figure 5, and the embodiment of the invention CAN be found to be in a leading position in performance.
According to the extraction method of the collective social behavior based on community discovery, which is provided by the embodiment of the invention, the initial data information of the social network is processed with high quality by utilizing the self-adaptive loss function learning similarity matrix, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesiveness and stability, the extraction of the collective social behavior of the online social network is realized, and the result is enabled to have excellent accuracy and robustness.
Next, an extraction system of collective social behavior based on community discovery according to an embodiment of the present invention is described with reference to the accompanying drawings.
FIG. 6 is an extraction system of collective social behavior based on community discovery in accordance with one embodiment of the invention.
As shown in fig. 6, the system 10 includes: the system comprises an acquisition and preprocessing module 100, a topic distribution generation module 200, an affinity matrix construction module 300, a determination objective function module 400, an iterative learning module 500 and an extraction collective social behavior module 600.
The acquiring and preprocessing module 100 is used for capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation are carried out on the initial data set to obtain the data set; the topic distribution generation module 200 is configured to process the data set by using an LDA model to generate a plurality of topics and topic distribution of each post; the affinity matrix constructing module 300 is configured to construct a similarity calculating function based on sparse representation to solve the similarity between each post and the topic distribution, so as to obtain an affinity matrix. The determine objective function module 400 is configured to construct a community discovery algorithm based on the adaptive loss function and the affinity matrix to determine an objective function. The iterative learning module 500 is configured to continuously learn the objective function by using an alternating iterative method, so as to obtain a connected component between each post under the same topic in the affinity matrix, so as to construct a target similarity matrix to determine a community structure in the community network. The collective social behavior extraction module 600 is used for introducing a node2vec model to visualize a community structure, and extracting collective social behaviors according to the distribution situation of nodes in the community structure.
Further, in one embodiment of the invention, the affinity matrix is:
Figure GDA0003612201380000121
wherein c i,j The value of row j of the i-th row of the affinity matrix, m is the neighbor number of the adaptive user,
Figure GDA0003612201380000122
l2-norm distributed for the subjects of nodes i and j.
Further, in one embodiment of the invention, the objective function is:
min S,F ||C (v) -S|| σ +εTr(F T LF)
s.t.1 T s i =1,s i,j ≥0,F T F=I
wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplacian matrix of the target variable, tr () is a trace, and 1 T s i Is the sum of all values of the ith row, S i,j And I is an identity matrix, which is the value of the ith row and the jth column of S.
Further, in one embodiment of the present invention, the iterative learning module 500 is specifically configured to:
the method comprises the steps of utilizing an alternate iteration method, firstly fixing a clustering indication matrix to solve a target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is smaller than 10 -3 Or the iteration times are more than 150 times, so that connected components among posts under the same theme are obtained, and then a target similarity matrix is constructed to determine a community structure in a community network.
Further, in one embodiment of the present invention, the method for extracting the collective social behavior in the collective social behavior module 600 is as follows: if the middle node of the community structure is sparse in distribution, covering all nodes in the community by adopting a minimized circle, and taking the node closest to the center of the circle as a collective social behavior; if the nodes in the community structure are densely distributed, the central degree is used for extracting the collective social behavior.
It should be noted that the foregoing explanation of the embodiment of the method for extracting the collective social behavior based on the community discovery is also applicable to the system of the embodiment, and will not be repeated herein.
According to the extraction system for the collective social behavior based on community discovery, provided by the embodiment of the invention, the similarity matrix is learned by utilizing the self-adaptive loss function, the initial data information of the social network is processed with high quality, the reconstruction and community discovery of the social network are completed, the output community structure is ensured to have higher cohesiveness and stability, the extraction of the collective social behavior of the online social network is realized, and the result is enabled to have excellent accuracy and robustness.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (2)

1. The extraction method of the collective social behavior based on community discovery is characterized by comprising the following steps of:
step S1, capturing posts published by a plurality of users in a social network as an initial data set, and cleaning and word segmentation are carried out on the initial data set to obtain a data set;
s2, processing the data set by using an LDA model to generate a plurality of topics and topic distribution of each post;
step S3, constructing a similarity calculation function based on sparse expression, and solving the similarity between each post and the topic distribution to obtain an affinity matrix, wherein the affinity matrix is as follows:
Figure FDA0004227765970000011
wherein c i,j The value of row j of the i-th row of the affinity matrix, m is the neighbor number of the adaptive user,
Figure FDA0004227765970000012
l2-norm distributed for the subjects of nodes i and j;
step S4, constructing a community discovery algorithm based on the adaptive loss function and the affinity matrix to determine an objective function, wherein the objective function is:
min S,F ||C (v) -S|| σ +εTr(F T LF)
s.t.1 T s i =1,s i,j ≥0,F T F=I
wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplace matrix of the target variable, tr () is a trace, 1 T s i Is the sum of all values of the ith row, S i,j The value of the row j is S, and I is a cell matrix;
step S5, continuously learning the objective function by using an alternate iteration method to obtain a connected component between each post under the same theme in the affinity matrix so as to construct a target similarity matrix to determine a community structure in a community network, wherein the method specifically comprises the following steps of: utilizing an alternate iteration method, firstly fixing a clustering indication matrix to solve a target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10 -3 Or the iteration times are more than 150 times, so that connected components among posts under the same theme are obtained, and then the target similarity matrix is constructed to determine a community structure in a community network;
step S6, a node2vec model is introduced to visualize the community structure, and collective social behavior is extracted according to the distribution condition of nodes in the community structure, wherein if the distribution of the nodes in the community structure is sparse, a minimized circle is adopted to cover all nodes in the community, and the node closest to the center of the circle is taken as the collective social behavior; and if the nodes in the community structure are densely distributed, extracting the collective social behavior by using centrality.
2. A community discovery-based extraction system for collective social behavior, comprising:
the acquisition and preprocessing module is used for capturing posts published by a plurality of users in the social network as an initial data set, and cleaning and word segmentation are carried out on the initial data set to obtain a data set;
the topic distribution generation module is used for processing the data set by utilizing an LDA model to generate a plurality of topics and topic distribution of each post;
an affinity matrix building module, configured to construct a similarity calculation function based on sparse representation, to solve the similarity between each post and the topic distribution, so as to obtain an affinity matrix, where the affinity matrix is:
Figure FDA0004227765970000021
wherein c i,j The value of row j of the i-th row of the affinity matrix, m is the neighbor number of the adaptive user,
Figure FDA0004227765970000022
l2-norm distributed for the subjects of nodes i and j;
the objective function determining module is used for constructing a community finding algorithm based on the adaptive loss function and the affinity matrix to determine an objective function, wherein the objective function is as follows:
min S,F ||C (v) -S|| σ +εTr(F T LF)
s.t.1 T s i =1,s i,j ≥0,F T F=I
wherein S is a target variable, C is an affinity matrix, sigma is an adaptive parameter, epsilon is a balance factor, F is a clustering indication matrix, L is a Laplace matrix of the target variable, tr () is a trace, 1 T s i Is the sum of all values of the ith row, S i,j The value of the row j is S, and I is a cell matrix;
the iterative learning module is used for continuously learning the objective function by using an alternate iterative method to obtain a connected component between each post under the same theme in the affinity matrix so as to construct a target similarity matrix to determine a community structure in a community network, and is specifically used for: utilizing an alternate iteration method, firstly fixing a clustering indication matrix to solve a target variable, and then fixing the target variable to solve the clustering indication matrix until the relative change of the target variable is less than 10 -3 Or the iteration times are more than 150 times, so that connected components among posts under the same theme are obtained, and then the target similarity matrix is constructed to determine a community structure in a community network;
the system comprises a community structure acquisition module, a community structure generation module and a community structure generation module, wherein the community structure acquisition module is used for acquiring a community structure, and acquiring a community structure; and if the nodes in the community structure are densely distributed, extracting the collective social behavior by using centrality.
CN202111638174.7A 2021-12-29 2021-12-29 Method and system for extracting collective social behavior based on community discovery Active CN114707044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111638174.7A CN114707044B (en) 2021-12-29 2021-12-29 Method and system for extracting collective social behavior based on community discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111638174.7A CN114707044B (en) 2021-12-29 2021-12-29 Method and system for extracting collective social behavior based on community discovery

Publications (2)

Publication Number Publication Date
CN114707044A CN114707044A (en) 2022-07-05
CN114707044B true CN114707044B (en) 2023-06-23

Family

ID=82166741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111638174.7A Active CN114707044B (en) 2021-12-29 2021-12-29 Method and system for extracting collective social behavior based on community discovery

Country Status (1)

Country Link
CN (1) CN114707044B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269922A (en) * 2020-10-14 2021-01-26 西华大学 Community public opinion key character discovery method based on network representation learning

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760426B (en) * 2016-01-28 2018-12-21 仲恺农业工程学院 A kind of theme community's method for digging towards online social networks
US10498690B2 (en) * 2017-03-01 2019-12-03 Oath Inc. Latent user communities
CN107705213B (en) * 2017-07-17 2022-01-28 西安电子科技大学 Overlapped community discovery method of static social network
CN108776844B (en) * 2018-04-13 2021-09-14 中国科学院信息工程研究所 Social network user behavior prediction method based on context perception tensor decomposition
CN110457477A (en) * 2019-08-09 2019-11-15 东北大学 A kind of Interest Community discovery method towards social networks
CN111143704B (en) * 2019-12-20 2023-10-20 北京理工大学 Online community friend recommendation method and system integrating user influence relationship
CN111292197A (en) * 2020-01-17 2020-06-16 福州大学 Community discovery method based on convolutional neural network and self-encoder

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269922A (en) * 2020-10-14 2021-01-26 西华大学 Community public opinion key character discovery method based on network representation learning

Also Published As

Publication number Publication date
CN114707044A (en) 2022-07-05

Similar Documents

Publication Publication Date Title
Xu et al. Credit scoring algorithm based on link analysis ranking with support vector machine
CN111191092B (en) Label determining method and label determining model training method
Kuo et al. Integration of ART2 neural network and genetic K-means algorithm for analyzing Web browsing paths in electronic commerce
CN113705772A (en) Model training method, device and equipment and readable storage medium
WO2021203854A1 (en) User classification method and apparatus, computer device and storage medium
Ye et al. Identifying high potential talent: A neural network based dynamic social profiling approach
Karpatne et al. Predictive learning in the presence of heterogeneity and limited training data
CN113407784A (en) Social network-based community dividing method, system and storage medium
Shingari et al. A review of applications of data mining techniques for prediction of students’ performance in higher education
Li et al. Explain graph neural networks to understand weighted graph features in node classification
Chen et al. An extended study of the K-means algorithm for data clustering and its applications
Zhang et al. A novel hybrid correlation measure for probabilistic linguistic term sets and crisp numbers and its application in customer relationship management
CN114970736A (en) Network node depth anomaly detection method based on density estimation
Gong Deep Belief Network‐Based Multifeature Fusion Music Classification Algorithm and Simulation
CN113409157B (en) Cross-social network user alignment method and device
Duarte et al. Machine Learning and Marketing: A Literature Review.
Ahan et al. Social network analysis using data segmentation and neural networks
CN114707044B (en) Method and system for extracting collective social behavior based on community discovery
Wang et al. Temporal dual-attributed network generation oriented community detection model
Park et al. Variable selection for Gaussian process regression through a sparse projection
CN114529399A (en) User data processing method, device, computer equipment and storage medium
Karaaslanli et al. Constrained spectral clustering for dynamic community detection
Derevitskii et al. Clustering interest graphs for customer segmentation problems
Zhang et al. Semi-supervised community detection via constraint matrix construction and active node selection
Chen et al. Incomplete data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant