CN111428741A

CN111428741A - Network community discovery method and device, electronic equipment and readable storage medium

Info

Publication number: CN111428741A
Application number: CN201811565878.4A
Authority: CN
Inventors: 陈川; 钱慧; 林志伟; 凌国惠; 张宗一; 郑子彬
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2020-07-17
Anticipated expiration: 2038-12-20
Also published as: CN111428741B

Abstract

The embodiment of the invention provides a network community discovery method and device, electronic equipment and a readable storage medium, and belongs to the technical field of community discovery. The method comprises the following steps: the method comprises the steps of obtaining multi-source social network data of a social network user, wherein the multi-source social network data comprises data corresponding to at least two data sources; respectively determining the association relation between each auxiliary data source and the main data source based on the user relation between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source; clustering social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain a clustering result; and obtaining the network community division result of the social network user corresponding to the main data source according to the clustering result. According to the scheme of the embodiment of the invention, the accuracy of community discovery can be effectively improved.

Description

Network community discovery method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of community discovery, in particular to a network community discovery method and device, electronic equipment and a readable storage medium.

Background

The community discovery is a generalized clustering algorithm, namely, the community discovery is used for discovering a community structure in a network, and dividing and extracting entity sets with similarity attributes in the network structure. A web community corresponds to a cluster (class) in a cluster.

In recent years, various community discovery algorithms have been proposed, but most of these algorithms are single-source community discovery algorithms. The single-source community discovery algorithm divides the data examples into a plurality of communities according to the data characteristics of a single data source, so that the similarity of the data examples in the communities is large, and the similarity of the data examples among the communities is small. Although the community discovery can be realized through the existing single-source community discovery algorithm, the existing scheme is realized based on the data of a single data source, the problems of single visual angle and low fault tolerance rate exist, and the accuracy of the community discovery result is low.

Disclosure of Invention

The object of the present invention is to solve at least one of the technical problems of the prior art. The scheme provided by the embodiment of the invention is as follows:

in a first aspect, the present invention provides a method for discovering a web community, the method including:

the method comprises the steps of obtaining multi-source social network data of a social network user, wherein the multi-source social network data comprises data corresponding to at least two data sources;

respectively determining the association relationship between each auxiliary data source and a main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, wherein the main data source is one of at least two specified data sources, and the auxiliary data source is a data source except the main data source in the at least two data sources;

and performing community division on the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain the division result of the network community of the social network users corresponding to the main data source.

In an alternative of the first aspect, the respectively determining an association relationship between each auxiliary data source and a main data source based on a user relationship between a social network user corresponding to each auxiliary data source and a social network user corresponding to the main data source includes:

respectively constructing a relationship matrix between each auxiliary data source and the main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source;

the relationship matrix corresponding to each auxiliary data source is used for representing the incidence relationship between each auxiliary data source and the main data source, and the elements in the relationship matrix are used for representing the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source.

In an alternative of the first aspect, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network users corresponding to the row where the element is located in the relationship matrix and the social network users corresponding to the column where the element is located are the same user.

In an alternative of the first aspect, based on the data of the main data source, the data of each auxiliary data source, and the association relationship between each auxiliary data source and the main data source, performing community division on the social network users corresponding to the main data source to obtain a division result of the network community of the social network users corresponding to the main data source, includes:

obtaining a first objective function through a first clustering algorithm based on data of the main data source, wherein the first objective function comprises a community indication matrix before solving corresponding to the main data source;

obtaining a second objective function through a second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source;

obtaining a final objective function based on the first objective function and the second objective function;

solving the final objective function to obtain a solved community indication matrix corresponding to the main data source;

and obtaining the network community division result of the social network user corresponding to the main data source based on the solved community indication matrix.

In an alternative of the first aspect, the number of rows of the solved community indication matrix is the number of social network users corresponding to the main data source, and the number of columns of the solved community indication matrix is the number of pre-divided network communities.

In an alternative of the first aspect, obtaining a second objective function through a second clustering algorithm based on data of each auxiliary data source and an association relationship between each auxiliary data source and a main data source includes:

obtaining sub-objective functions corresponding to each auxiliary data source through a second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source;

and obtaining a second objective function based on the sub-objective functions corresponding to each auxiliary data source.

In an alternative of the first aspect, obtaining the second objective function based on the sub-objective function corresponding to each auxiliary data source includes:

and obtaining a second objective sub-function based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source.

In an alternative of the first aspect, obtaining a first objective function through a first clustering algorithm based on data of a main data source includes:

calculating a user similarity matrix corresponding to the main data source based on the data of the main data source;

obtaining a first objective function through a first clustering algorithm based on a user similarity matrix corresponding to a main data source;

obtaining the sub-objective function corresponding to each auxiliary data source through a second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source, wherein the sub-objective function comprises:

calculating a user similarity matrix corresponding to each auxiliary data source based on the data of each auxiliary data source;

obtaining sub-objective functions corresponding to each auxiliary data source through a second clustering algorithm based on the user similarity matrix corresponding to each auxiliary data source and the relation matrix corresponding to each auxiliary data source;

the relationship matrix corresponding to each auxiliary data source is used for representing the incidence relationship between each auxiliary data source and the main data source, the relationship matrix corresponding to each auxiliary data source is a matrix constructed based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, and elements in the relationship matrix are used for representing the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source.

In an alternative of the first aspect, the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U^TL_v′,vU)

wherein the content of the first and second substances,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving^TA transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S_v′,vRepresenting a relationship matrix, S, between a secondary data source v' and a primary data source v_v,v′Denotes S_v′,vTransposed matrix of A_v′Representing auxiliary data sourcesA corresponding matrix of the user's similarity degrees,

representation matrix S_v,v′A_v′S_v′,vDegree matrix, | · | | non-conducting phosphor_FRepresenting the F-norm.

In an alternative of the first aspect, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function is:

wherein the content of the first and second substances,

v denotes the total number of primary and secondary data sources, μ_v′Representing the weight of the secondary data source v'.

In an alternative of the first aspect, the method further comprises:

constructing a regular term of a weight vector, wherein the weight vector is a vector formed by weights corresponding to each auxiliary data source;

obtaining a final objective function based on the first objective function and the second objective function, including:

and obtaining a final objective function according to the first objective function, the second objective function and the regular terms of the weight vector, wherein the weight vector is a term to be solved in the final objective function.

In an alternative of the first aspect, the regularization term of the weight vector is:

where μ denotes a weight vector, β denotes a first regularization coefficient,

the square of the 2-norm of μ is shown.

In an alternative of the first aspect, the method further comprises:

acquiring Must-link supervision information, wherein the Must-link supervision information is used for identifying that two community network users belong to the same network community;

constructing a constraint function according to the Must-link supervision information;

and obtaining a final objective function based on the first objective function, the second objective function and the constraint function.

In an alternative of the first aspect, constructing the constraint function according to the best-link supervision information includes:

constructing a constraint matrix according to the Must-link supervision information;

and obtaining a constraint function based on the constraint matrix and the community indication matrix before solving corresponding to the main data source.

In an alternative of the first aspect, each row of elements in the constraint matrix represents a piece of the best-link supervision information, the row number of the constraint matrix is the number of the pieces of the best-link supervision information, and the column number of the constraint matrix is the number of the community network users corresponding to the main data source.

In an alternative of the first aspect, the constraint function is:

γ||Z||₁

wherein γ is a second regularization coefficient, MU is Z, M represents a constraint matrix, U represents a community indication matrix before solving corresponding to the main data source, | | Z | | y₁Represents the 1-norm of Z.

In an alternative of the first aspect, the final objective function comprises a first objective function, a second objective function, regular terms of the weight vector and a constraint function.

In an alternative of the first aspect, solving the final objective function to obtain a solved community indication matrix includes:

and solving the final objective function by utilizing an AMDD (Alternating Direction Method of Multipliers) and a Lagrange multiplier Method to obtain a solved community indication matrix.

In an alternative of the first aspect, solving the final objective function by using an alternating direction multiplier algorithm AMDD and a lagrange multiplier method to obtain a solved community indication matrix includes:

initializing a community indication matrix U, a weight vector mu and a Lagrange multiplier before solving;

and repeatedly executing the operations of fixing mu and iteratively updating U, fixing U and Z, and iteratively updating mu until the convergence condition is met, wherein U when the convergence condition is met is the solved community indication matrix.

In an alternative of the first aspect, obtaining a community division result of the network community based on the solved community indication matrix includes:

and clustering the solved community indication matrix by adopting a K-means algorithm to obtain a community division result of the network community.

In a second aspect, the present invention provides an apparatus for discovering a web community, the apparatus comprising:

the multi-source social network data acquisition module is used for acquiring multi-source social network data of a social network user, and the multi-source social network data comprises data corresponding to at least two data sources;

the data source relation determining module is used for respectively determining the association relation between each auxiliary data source and the main data source based on the user relation between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, wherein the main data source is one of at least two specified data sources, and the auxiliary data source is a data source except the main data source in the at least two data sources;

and the community division result determining module is used for clustering the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain the division result of the network community of the social network users corresponding to the main data source.

In an alternative of the second aspect, the data source relationship determining module is specifically configured to:

In an alternative of the second aspect, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network users corresponding to the row where the element is located in the relationship matrix and the social network users corresponding to the column where the element is located are the same user.

In an alternative of the second aspect, the community division result determining module is specifically configured to:

In an alternative of the second aspect, the number of rows of the solved community indication matrix is the number of social network users corresponding to the main data source, and the number of columns of the solved community indication matrix is the number of pre-divided network communities.

In an alternative of the second aspect, the community division result determining module is specifically configured to, when obtaining the second objective function through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source:

In an alternative of the second aspect, when the community partition result determining module obtains the first objective function through the first clustering algorithm based on the data of the main data source, the community partition result determining module is specifically configured to:

when the community division result determining module obtains the sub-targeting function corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source, specifically:

In an alternative of the second aspect, the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U^TL_v′,vU)

wherein the content of the first and second substances,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving^TA transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S_v′,vRepresenting a relationship matrix, S, between a secondary data source v' and a primary data source v_v,v′Denotes S_v′,vTransposed matrix of A_v′A user similarity matrix representing the correspondence of the secondary data sources,

In an alternative of the second aspect, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function is:

wherein the content of the first and second substances,

In an alternative of the second aspect, the apparatus further comprises:

the first regular term building module is used for building regular terms of weight vectors, and the weight vectors are vectors formed by weights corresponding to each auxiliary data source;

the community division result determining module is specifically configured to, when obtaining the final objective function based on the first objective function and the second objective function:

In an alternative of the second aspect, the regularization term of the weight vector is:

represents the square of the 2-norm of μ.

In an alternative of the second aspect, the apparatus further comprises a constraint function construction module, the constraint function construction module being configured to:

acquiring the Must-link supervision information, wherein the Must-link supervision information is used for identifying that two community network users belong to the same network community;

In an alternative of the second aspect, when the constraint function building module builds the constraint function according to the Must-link supervision information, the constraint function building module is specifically configured to:

In an alternative of the second aspect, each row of elements in the constraint matrix represents a piece of the best-link supervision information, the row number of the constraint matrix is the number of the pieces of the best-link supervision information, and the column number of the constraint matrix is the number of the community network users corresponding to the main data source.

In an alternative of the second aspect, the constraint function is:

γ||Z||₁

In an alternative of the second aspect, the final objective function comprises a first objective function, a second objective function, regular terms of the weight vector and a constraint function.

In an alternative of the second aspect, the community division result determining module is specifically configured to, when solving the final objective function to obtain a solved community indication matrix:

and solving the final objective function by utilizing an AMDD (amplitude modulation and direct digital display) and Lagrange multiplier method to obtain a solved community indication matrix.

In an alternative of the second aspect, the community division result determining module is specifically configured to, when solving the final objective function by using AMDD and a lagrange multiplier method to obtain a solved community indication matrix:

In an alternative of the second aspect, when the community division result determining module obtains the community division result of the network community based on the solved community indication matrix, the community division result determining module is specifically configured to:

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory; the memory has stored therein readable instructions which, when loaded and executed by the processor, implement a method of discovery of a network community as set forth in the first aspect or any of the alternatives to the first aspect above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which readable instructions are stored, and when the readable instructions are loaded and executed by a processor, the method for discovering a network community as shown in the first aspect or any alternative of the first aspect is implemented.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

according to the scheme provided by the embodiment of the invention, one main data source can be selected according to actual requirements, and the information of a plurality of auxiliary data sources can be synthesized for community discovery.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below.

FIG. 1 is a schematic diagram illustrating a discovery method for a web community according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a discovery method for a web community according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating relationships between nodes in three data sources in an example of the invention;

fig. 4 is a schematic structural diagram illustrating a discovery apparatus of a web community according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" include plural referents unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better illustration and understanding of the solutions of the embodiments of the present invention, the following briefly describes the technical solutions related to the solutions provided in the embodiments of the present invention.

(1) Spectral clustering

The spectral clustering algorithm converts the community discovery problem into a graph cutting problem, so that nodes in a community (one node represents one user) have high similarity, and nodes in a community interval have low similarity. According to a given data set (data corresponding to users to be partitioned), a spectral clustering algorithm firstly utilizes a similarity function to calculate the similarity between data instances, constructs an undirected weighted graph, constructs a Laplacian matrix according to the similarity matrix, calculates the eigenvalue of the Laplacian matrix, selects the least eigenvectors of K eigenvalues to construct an indication matrix, and finally obtains a clustering result.

(2) Semi-supervised nonnegative matrix factorization

Like the traditional semi-supervised method, semi-supervised non-negative matrix factorization achieves the aim of improving clustering effect by adding cluster labels and paired constraint information (including Must-link and don't-link). The objective function of the non-negative matrix factorization minimizes the loss of the matrix factorization, while the semi-supervised non-negative matrix factorization further utilizes constraint information to guide the matrix factorization process, and the constraint information is an important method for improving the community discovery effect.

Although various community discovery technologies exist in the prior art, the existing community discovery technologies are generally community division based on data of a single data source, and the community division accuracy is low.

The invention provides a network community discovery method, aiming at solving the problems in the prior art and improving the accuracy of network community division. The invention aims to fuse multi-source information of a multi-source social network by using a multi-view learning mechanism, realize division of network communities based on multi-source data and improve accuracy of network community discovery. In addition, the embodiment of the invention can also effectively solve the problem of data loss of partial sources, and further improve the accuracy of community discovery by adding an automatic screening technology and supervision information Must-link on the basis.

Fig. 1 is a schematic diagram illustrating a discovery method for a web community in an alternative embodiment of the present invention, and as shown in the diagram, the method may be mainly divided into two major parts as a whole: the first part is to select a reasonable similarity calculation method according to the characteristics of each data source in the multi-source social network data (i.e. the multi-source information shown in the figure) to obtain a similarity matrix between the user and the user corresponding to each data source, i.e. a user similarity matrix (the similarity matrix shown in the figure). The second part is to select a main data source (the weight of the data source corresponding to the main data source can be regarded as 1), adopt a multi-view learning mechanism to fuse other multi-source information, and adopt an automatic screening technology to learn the weight of the data source (such as W shown in the figure)₁、W₂、W₃I.e. the weight mu of the auxiliary data source as described hereinafter_v′And will be described in detail later), and guides community discovery by adding the supervision information, best-link, to obtain a final community discovery result, such as the partitioning result of the three network communities shown in the figure.

The following describes the technical solution of the present invention and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a discovery method for a web community according to the present invention, and as shown in fig. 2, the method may include the following steps:

step S110: multi-source social network data of a social network user is obtained.

The multi-source social network data, that is, the social network data of at least two data sources, that is, the multi-source data of the social network, means that a user of the social network (including but not limited to Facebook, Twitter, WeChat, QQ, Singles microblog, and the like) has multi-source information, such as attribute features, behavior features of the user, interactions between the user and the user, and specifically, may be topology information (such as friend relationships) of the user, attribute information (such as age, gender, and the like of the user), behavior information (such as published statements, praise, forward, and the like), and the like.

In practical application, since not all users have the multi-source information, a problem of data loss of partial sources exists, some data sources may only have corresponding social network data for some users, for example, some users have behavior information, that is, behavior data, and some users do not have behavior information, when the behavior information is used as a data source, the social network data corresponding to the data source only has transaction data of some users, and thus, the number of social network users corresponding to different data sources is likely to be different.

After the multi-source social network data is obtained, one data source needs to be designated as a main data source according to needs, namely, the data source plays a main role in dividing network communities, and the data sources except the main data source in the multiple data sources are called as auxiliary data sources. As can be seen, there is data from one primary data source and data from at least one secondary data source in the multi-source social network data.

The main data source is the data source which plays the main decision role for the network community division result. In practical application, which data source is specifically selected by the main data source can be selected according to the community classification requirement, that is, one main data source can be selected according to the community discovery target. For example, if advertisement delivery is needed, the interest data source of the user may be used as a main data source, and the other data sources are used as auxiliary data sources; for another example, when the intimacy degree between users needs to be analyzed or the users need to be classified, the direct friend relationship of the users can be used as a main data source, and other data sources can be used as auxiliary data sources.

Step S120: and respectively determining the association relation between each auxiliary data source and the main data source based on the user relation between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source.

For one auxiliary data source, the user relationship, that is, the relationship between the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source, may be configured according to actual requirements in practical applications. For example, in an optional manner, the user relationship may refer to whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user, or may refer to whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are in a friend relationship, or the like.

The user relationship is a relationship between the auxiliary data source and the user in the main data source, so that the association relationship between each auxiliary data source and the main data source can be obtained based on the user relationship corresponding to each auxiliary data source, the association relationship corresponding to one auxiliary data source reflects the relationship between the main data source and the user corresponding to the auxiliary data source, and the fusion of the auxiliary data source and the main data source is realized through the association relationship. For example, when the user relationship is that the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user, the corresponding association relationship reflects the condition of the same user corresponding to the auxiliary data source and the main data source.

In addition, the association relationship is determined based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, so that even if some data sources lack part of data, multi-source data can be effectively fused.

Step S130: and performing community division on the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source to obtain the division result of the network community of the social network users corresponding to the main data source.

As can be seen from the above description, the association relationship corresponding to each type of auxiliary data source reflects the relationship between the main data source and the user corresponding to the auxiliary data source, and therefore, the data of each type of auxiliary data source and the association relationship between each type of auxiliary data source and the main data source can assist the community discovery for the user corresponding to the main data source.

According to the scheme provided by the embodiment of the invention, one main data source can be selected according to actual requirements, and the information of a plurality of auxiliary data sources can be synthesized for carrying out community discovery.

In an optional embodiment of the present invention, the determining, based on a user relationship between a social network user corresponding to each auxiliary data source and a social network user corresponding to a main data source, an association relationship between each auxiliary data source and the main data source includes:

based on the user relationships between the social networking users corresponding to each secondary data source and the social networking users corresponding to the primary data source,

In particular, as an alternative, the incidence relation between the secondary data source and the primary data source may be characterized by a relation matrix, each element of the matrix corresponding to a user relation between one social network user in the secondary data source and one social network user in the secondary data source.

In an optional embodiment of the present invention, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network user corresponding to the row where the element is located in the relationship matrix and the social network user corresponding to the column where the element is located are the same user.

As an alternative, the user relationship between the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source may be whether the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user. Specifically, if the social network user corresponding to the auxiliary data source and the social network user corresponding to the main data source are the same user, the value of the element at the corresponding position in the relationship matrix may be 1, and if the social network users are not the same user, the value of the element at the corresponding position in the relationship matrix may be 0.

In this scheme, a relationship matrix corresponding to an auxiliary data source, that is, a relationship matrix between the auxiliary data source and a main data source, reflects the conditions of the same user corresponding to the main data source and the auxiliary data source, and if the auxiliary data source and the main data source correspond to the same user, the relationship (which can be reflected by user similarity) of the same user based on the auxiliary data source can be used to assist the community discovery of the user corresponding to the main data source.

As an example, fig. 3 shows a schematic relationship diagram between nodes corresponding to three data sources (one node corresponds to one social network user), and for convenience of representing nodes in different data sources, the nodes corresponding to the same data source are located on the same plane, as shown in the figure. In this example, direct buddy relationships may be used as the primary data sources, i.e., source2, source1, and source3 are shown as two secondary data sources. The connection line, i.e. the edge, between the nodes of each data source represents the relationship between the two nodes, the weight of the edge between the two nodes in the same data source can be the similarity between the two nodes, and when the similarity is zero, the two nodes can not be connected.

As can be seen in the figure, in this example, the social networking user corresponding to the primary data source2 includes all users, specifically 7 users. The number of social network users corresponding to source1 is 6, and the number of social network users corresponding to source3 is also 6. For source1 and source2, the social networking user corresponding to source1 is the social networking user corresponding to source2 except for node P in the figure₁All users except the corresponding user, the social network user corresponding to the source3 is the social network user corresponding to the source2 except the node P in the figure₂All users except the corresponding user can see that the source1 and the source2 correspond to 6 same users, and the source3 and the source2 also correspond to 6 same users.

Taking the number of the social network users corresponding to the auxiliary data source as the row number of the relation matrix, and taking the number of the social network users corresponding to the main data source as the column number of the relation matrix, the relation matrix S between the source1 and the source2 is obtained₁₂And a relationship matrix S between source3 and source2₃₂As follows:

with S₁₂For the purpose of illustration, S₁₂The element in the first row and the first column in (1) represents S₁₂The user corresponding to the first row and the user corresponding to the first column are the same user, and S₁₂The user corresponding to the first row in the matrix is the user corresponding to source1, the user corresponding to the first column is the user corresponding to source2, and 1 in the matrix represents the same user of the users corresponding to source1 and the users corresponding to source2, S₁₂The user corresponding to the third column is the node P₁Corresponding user, due to P₁In the same user who is not source1 and source2, S is therefore₁₂All elements of the third column in (1) are 0.

By constructing the relationship matrix for each auxiliary data source and the main data source, the fusion of data of different data sources is realized, and the problem of difficult fusion caused by data confirmation of different data sources can be effectively solved.

In an optional embodiment of the present invention, based on the data of the main data source, the data of each auxiliary data source, and the association relationship between each auxiliary data source and the main data source, performing community division on the social network users corresponding to the main data source to obtain a division result of the network community of the social network users corresponding to the main data source, including:

The community indication matrix is a matrix for indicating a community division result, that is, an object matrix for indicating a clustering result. Specifically, the solved community indication matrix is the clustered target matrix corresponding to the main data source, and the final objective function is optimized and solved to realize iterative optimization of the community indication matrix before solving, so that the required clustered target matrix, namely the cluster indication matrix of the clustering result, is obtained.

Specifically, the number of rows of the solved community indication matrix may be the number of social network users corresponding to the main data source, the number of columns of the solved community indication matrix is the number of pre-partitioned communities, that is, the number of clustered clusters, and each column of the solved community indication matrix is a clustering result corresponding to one cluster.

And (4) carrying out optimization solution on the final objective function, namely carrying out iterative processing on the final objective function based on a pre-configured convergence condition, namely a constraint condition until the finally obtained value of the final objective function meets the convergence condition. In an optional manner, the convergence condition may mean that a difference between values of the final objective function after two iterations is smaller than a set value, that is, a difference between a value of the final objective function after the last iteration optimization and a value of the final objective function after the current iteration optimization is smaller than a set value, that is, a difference between a community indication matrix obtained by the current solution and a community indication matrix obtained at the last time is smaller than a set threshold, and the like, and that each parameter to be solved in the final objective function respectively satisfies a respective preset condition.

In practical application, when the objective function is optimized, a clustering index may be set, and whether the algorithm converges or not may be determined by the clustering index, for example, the clustering index may be NMI (Normalized Mutual Information), ACC (calibration method of clustering accuracy), and the like. For different clustering indexes, the corresponding convergence conditions may also be different, for example, for NMI and ACC, the value of the objective function is slowly decreased and tends to be stable, the NMI or ACC index is slowly increased and tends to be stable, and the convergence of the algorithm can be judged.

In the scheme of the invention, the final objective function comprises a first objective function obtained based on the data of the main data source and a second objective function obtained based on the data of the auxiliary data source and the incidence relation between the auxiliary data source and the main data source, and the second objective function is obtained based on the incidence relation between each auxiliary data source and the main data source, and the objective function identifies the influence of each auxiliary data source on the clustering of the community network users corresponding to the main data source, so the final objective function effectively fuses multi-source data, and the scheme of determining the community division result based on the community indication matrix obtained by solving the final objective function can greatly improve the accuracy of community discovery compared with the existing community discovery technology.

In an optional embodiment of the present invention, obtaining, by a second clustering algorithm, a second objective function based on data of each auxiliary data source and an association relationship between each auxiliary data source and a main data source includes:

The sub-targeting function corresponding to each auxiliary data source identifies the influence of the data source on the clustering of the main data source, so that after the sub-targeting function corresponding to each auxiliary data source is determined, a second objective function for representing the total influence of each auxiliary data source on the clustering can be obtained based on the sub-targeting function corresponding to each auxiliary data source. For example, one alternative may be to add the sub-objective functions corresponding to the auxiliary data sources to obtain the second objective function.

In an optional embodiment of the present invention, obtaining the second objective function based on the sub-objective function corresponding to each auxiliary data source includes:

In practical applications, since each auxiliary data source may only provide partial information, and the information of each view angle has different effects on the clusters corresponding to the main data sources, each auxiliary data source may be assigned a weight parameter, i.e. a weight, where the weight corresponding to each auxiliary data source is used to indicate the importance degree of the auxiliary data source to the clusters, i.e. the importance degree of the data of each auxiliary data source, where the weight corresponding to each auxiliary data source is not negative, and the sum of the weights corresponding to all auxiliary data sources is 1.

In an optional embodiment of the present invention, obtaining a first objective function through a first clustering algorithm based on data of a main data source includes:

The user similarity matrix is a matrix in which the user represents the similarity between the user and the user. For each data source, generally, the number of rows and columns in the user similarity matrix is the number of social network users corresponding to the corresponding data source, and an element in the user similarity matrix is the similarity between a social network user corresponding to the row where the element is located and a social network user corresponding to the column where the element is located. For example, if the data source is behavior information of social network users, similarity between the users based on the obtained behavior information of the social network users may be calculated.

For different types of data sources, different methods for calculating the similarity can be adopted, so that the calculated similarity can better reflect the relationship between users corresponding to a certain data source.

As an alternative, the following provides a way to compute similarity for several different types of data sources.

(1) Graph-based topological structure relationships: the friend relationship can be generally represented by a connection line between nodes, and the more common friends two nodes have, the more close the relationship is. For this type of data source, a Jaccard coefficient (Jaccard similarity coefficient) can be used to calculate the similarity between two users, and the calculation formula of the Jaccard coefficient is:

where A, B represent two different users, N (A) represents friends of user A, N (A) ∩ N (B) represents common friends of users A, B, | N (A) ∪ N (B) | represents all friends of users A, B the greater the Jaccard coefficient value, i.e., Jaccard (A, B), the greater the similarity between users A, B.

(2) Numerical attribute relationship: an object type can be determined by the value of the object attribute, and the similarity between two instances can be solved by using a kernel function mode aiming at the numerical attribute of the object.

The cosine kernel function records the similarity of two objects by solving the cosine value (the value range is 0-1) between the vectors, and the geometric meaning of the cosine kernel function is that when the included angle of the two vectors in a multi-dimensional space is smaller (the cosine value is larger), the vectors tend to be in the same direction, and the similarity is larger. The gaussian kernel function is used to represent the weight of the connecting edge of two nodes in the data graph structure, i.e. the similarity.

For example, with X_i＝(x_i1；x_i2；…；x_im),X_j＝(x_j1；x_j2；…；x_jm) The attribute vectors of two samples are represented, m is the dimension representing each sample vector, and the formula for solving the similarity of the two samples by using a cosine kernel function and a gaussian kernel function respectively can be as follows:

cosine kernel function:

gaussian kernel function:

wherein the content of the first and second substances,

represents a dot product of two vectors, | X_i||*||X_j| | denotes a 2-norm multiplication of two vectors, | | X_i-X_jAnd | | l represents the Euclidean distance, and gamma is a scaling parameter and is used for controlling the problem of sudden similarity change caused by the fact that the difference of the Euclidean distance is larger, so that the speed of the output result of the kernel function, which is reduced along with the increase of the distance, can be changed.

(3) Document type: an alternative is that the document similarity can be calculated using bag-of-words model (bag-of-words model). The principle of the word bag model is that the attribute vector of the document is obtained by calculating the number of each keyword in the document, and the problem of solving the similarity measurement of the document is converted into the problem of vector similarity to be solved.

It should be noted that the three similarity calculation methods are only optional ways for the similarity between the data of the three types of data sources, and are not exclusive, and in practical applications, a scheme for calculating the similarity between the data of each data source by a user may be configured as required.

In practical application, when the objective function is obtained through a clustering algorithm, which clustering algorithm is specifically adopted can be determined according to actual requirements. The first clustering algorithm and the second clustering algorithm may be the same or different. In an alternative, the first clustering algorithm and the second clustering algorithm may be spectral clustering algorithms.

Spectral clustering is based on graph segmentation principle, and the main idea is to look all data points as nodes in the graph, and the points can be connected by edges. The edge weight value between two points with a longer distance is lower, the edge weight value between two points with a shorter distance is higher, and the graph formed by all data points is cut, so that the edge weight sum between different subgraphs after graph cutting is as low as possible, and the edge weight sum in the subgraph is as high as possible, thereby achieving the purpose of clustering. Spectral clustering has the characteristic of clustering on spatial samples of any shape and converging to an optimal solution according to a divided target function. The spectral clustering objective function can be expressed as:

wherein, U_iAn indication vector representing user i, i.e. a vector indicating to which community user i belongs, A_ijAnd representing the similarity between the user i and the user j, namely the weight of the connecting edges in the network structure, and N representing the number of the users needing community division.

Spectral clustering causes

At the minimum, namely, the more similar the indication vectors among the nodes in the same network community are, and the more dissimilar the indication vectors among the nodes in different social regions are, the spectral clustering objective function can be converted into:

wherein the content of the first and second substances,

U∈R^N*Cis a community indication matrix, each line of U is an indication vector of a user, U^TA transpose matrix representing U, N representing the number of users, C representing the number of web communities, L being a normalized Laplacian matrix, A being a user similarity matrix, D being a degree matrix of A,

tr () represents a trace of the matrix.

Therefore, when the first clustering algorithm is a spectral clustering algorithm, the user similarity matrix corresponding to the main data source may be used as the similarity matrix of the spectral clustering algorithm, so as to obtain a first objective function, specifically, the first objective function may be represented as:

at this time, U represents the community indication matrix before the solution corresponding to the main data source, i.e. the item to be solved in the final objective function, and U after the final objective function is optimized and solved is the clustering result of the spectral clustering,

wherein, A is the user similarity matrix corresponding to the main data source, Tr (U)^TL_vU) represents U^TL U.

In an alternative of the present invention, when the second clustering algorithm is a spectral clustering algorithm, the sub-objective function may be:

Tr(U^TL_v′,vU)

wherein the content of the first and second substances,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving^TA transpose matrix representing U, upsilon representing the primary data source, upsilon' representing the secondary data source, S_v′,vRepresenting a relation matrix, S, between a secondary data source v' and a primary data source v_v,v′Denotes S_v′,vTransposed matrix of A_v′A user similarity matrix representing the correspondence of the secondary data sources,

Wherein, the F-norm of the matrix, i.e. Frobenius norm, also called Euclid norm or E-norm, is marked as | | · | | purple_FFor any matrix T, its F-norm | | | T | | luminance_FThe square root of the sum of squares of the elements of the matrix T is obtained by first summing the squares of the elements and then squaring.

For any auxiliary data source v ', when two nodes (i.e. community network users) belong to the same community in one auxiliary data source v', if the two nodes are also two nodes in the main data source, that is, the two nodes are common nodes of the main data source and the auxiliary data source, when the community division is performed based on the main data source, the indication vectors of the two nodes should be similar as much as possible, and the elements in the relationship matrix are used for identifying the user relationship (such as whether the two nodes are the same user) between the user corresponding to the main data source and the user corresponding to the auxiliary data source, so that the indication vector of the node in each auxiliary data source can be represented by the indication vector of the node in the main data source and the relationship matrix of each auxiliary data source and the main data source.

Still taking fig. 3 as an example for illustration, as shown in fig. 3, node P₂And a node P₃Is two nodes common to source1 and source2, node P being in the primary data source, source2₂And a node P₃Is irrelevant, i.e. similarity is zero, and node P is in source1₂And nodeP₃It is relevant, therefore, the relationship between the same users in source1 and the main data source can be used in the clustering based on the main data source, that is, the influence of the auxiliary data source is fused into the clustering result based on the main data source, so as to improve the accuracy of the clustering result.

Specifically, the community indication matrix corresponding to the main data source is U, and the relationship matrix between the auxiliary data source v' and the main data source v is S_v′,vThe corresponding indication matrix of the secondary data source v' may then be denoted as S_v′,vU, namely the dot product of the relation matrix and the community indication matrix corresponding to the main data source, is ∑ when performing spectral clustering based on the auxiliary data source v_i,jA_v′(i,j)[(S_v′,vU)_i-(S_v′,vU)_j]²Or make Tr (U)^TL_v′,vU) is as small as possible.

Wherein A is_v′(i, j) represents the similarity between the user i and the user j in the auxiliary data source v ', that is, the values of the elements corresponding to the user i and the user j in the similarity matrix corresponding to the auxiliary data source v'.

In an alternative aspect of the present invention, the second objective function may be:

where V represents the total number of primary and secondary data sources, i.e., the number or number of sources of the various data sources. In an alternative of the present invention, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function may be:

wherein the content of the first and second substances,

v denotes the total number of primary and secondary data sources, i.e. the number or number of sources of the various data sources, μ_v′Representing the corresponding weight of the secondary data source v'.

In an alternative aspect of the present invention, the method may further comprise:

correspondingly, the obtaining of the final objective function based on the first objective function and the second objective function specifically includes:

That is to say, the final objective function may further include a regular term of the weight vector, where the regular term is a term to be solved in the final objective function, and when performing optimal solution on the final objective function based on the regular term, the automatic solution of the weight corresponding to each auxiliary data source may be implemented. The weight vector can be thinned through the regularization term, so that data sources (containing a large amount of noise and irrelevant information) which are irrelevant to the main data source are deleted, and the weight of the relevant auxiliary data source can be automatically solved, namely the influence degree of each auxiliary data source on the main data source. According to the scheme, the automatic screening of the auxiliary data sources is realized, so that the auxiliary data sources related to the main data sources are reserved, the data sources irrelevant to the main data sources are removed, and particularly, the weight of the screened irrelevant data is 0, so that the influence of the screened irrelevant data on the community division result is 0.

In an alternative of the present invention, the regularizing term of the weight vector may be

represents the square of the 2-norm of μ. The 2-norm of the vector is commonly used for calculating the length of the vector, and specifically is the square sum and the reopening of the absolute value of vector elements.

In practical application, the weight vector can be sparse by L regular terms, and compared with L regular terms, the sparse degree of the weight vector can be more reasonably controlled by L regular terms, and the problem that too much auxiliary data source is removed due to too strong sparsity (namely, the weight after optimization solution is zero) is avoided^-5To 10⁵A value within the range.

correspondingly, in step S150, obtaining a final objective function based on the first objective function and the second objective function may include:

The existing supervision information comprises class labels and paired constraint information (Must-L ink and Cannot-L ink which respectively indicate that two users necessarily belong to the same community and necessarily belong to different communities).

Specifically, while multi-source social network data is collected, some supervision information such as the Must-link information can be obtained, and the supervision information Must-link is used for community discovery as constraint information, so that the community discovery effect can be effectively improved.

In an alternative scheme of the present invention, constructing a constraint function according to the best-link supervision information may specifically include:

Optionally, elements of each row in the constraint matrix represent a piece of the best-link supervision information, the row number of the constraint matrix is the number of the pieces of the best-link supervision information, and the column number of the constraint matrix is the number of the community network users corresponding to the main data source.

In the embodiment of the invention, the Must-link supervision information is converted into the constraint matrix M ∈ R^n*NAnd n is the number of the Must-links, namely the number of pieces of the Must-link supervision information. The expression of a piece of Must-link information (corresponding to one row of the constraint matrix) is as follows:

(1 -1 0 … 0)

the expression indicates that node 1 (the user corresponding to the first column in the constraint matrix) and node 2 (the user corresponding to the second column in the constraint matrix) necessarily belong to the same community.

However, the equality constraint function has two defects, namely that the clustering result is distorted by too strong constraint and that uncertain inferred Must-link supervision information cannot be expressed, so that L can be applied in practical application₁Regularization relaxes the constraint strength of the constraint function while controlling the number of Must-links that satisfy the condition.

Thus, in an alternative, γ | | | Z | | may be used to count the cells₁As a constraint function, γ is a second regularization coefficient, MU ═ Z, M denotes a constraint matrix, and U denotes a community before solution corresponding to the primary data sourceIndicating the matrix, | Z | | non-conducting phosphor₁The same, gamma can be configured according to actual needs, such as the configuration β can be 10 according to needs^-5To 10⁵A value within the range.

Where the 1-norm of the matrix, also called the column and norm, is the maximum of the sum of the absolute values of all matrix column vectors.

In an alternative of the present invention, the final objective function may include a first objective function, a second objective function, a regularization term of the weight vector, and a constraint function.

Specifically, the mathematical expression of the final objective function can be expressed as:

the first objective function + the second objective function + the regularization term of the weight vector + the constraint function.

Wherein, the third term can adopt L2 regular term, that is, the above-mentioned

L adding weight vector mu on the objective function as an important step of the automatic screening technique₂β controls the sparsity of mu, only one data source has non-zero weight when β is small, and all data sources tend to be large when β is large

When β is between these two values, a sparse μ is obtained, and for each auxiliary data source, the corresponding sub-targeting function w_v′When the spectral clustering algorithm is selected, the sub-target function w_v′＝Tr(U^TL_v′,vU)＝∑_i,jA_v′(i,j)[(S_v′,vU)_i-(S_v′,vU)_j]²As can be seen from the foregoing description, the greater w_v′Meaning that the more noisy and uncorrelated information the v' data source contains, the greater w_v′Greater possibility ofSex is such that the weight of the v' data source is set to 0, and is therefore based on

Data sources containing a lot of noise and irrelevant information can also be deleted, while the weight of the relevant data source is automatically derived therefrom, w_v′The smaller, mu_v′The larger the corresponding secondary data source is.

In an optional embodiment of the present invention, when both the first clustering algorithm and the second clustering algorithm are spectral clustering algorithms, the final objective function may be written as:

where s.t denotes the constraints of the final objective function, and v ' ≠ v denotes that v ' is not the primary data source, i.e., v ' is the secondary data source. The first term in the final objective function is a first objective function, the second term is an alternative of a second objective function, the third term is a data term corresponding to the described automatic screening technology, namely a regular term of a weight vector, and is used for realizing automatic determination of the weight of each auxiliary data source, and the fourth term is an alternative of a constraint function obtained based on the Must-link supervision information.

In an alternative scheme of the present invention, solving the final objective function to obtain a solved community indication matrix may include:

In order to solve the optimal objective function, the embodiment of the invention provides that an iterative algorithm is used for solving the relatively optimal solution. In theory, the target problem can be decomposed into two sub-problems, respectively a prediction indication matrix and an automatic solution weight. When the final objective function includes the weight vector μ, the community indication matrix U before the solution, and the constraint function Z (Z ═ MU), in order to solve the first sub-problem, μmay be fixed first, and U and Z may be iteratively optimized by using the ADMM method. In order to solve the second subproblem, U and Z are fixed, a closed-form solution of mu is obtained by using a Lagrange multiplier method, and the two processes are repeated until convergence meets the preset convergence condition. Similarly, when the final objective function includes the weight vector μ and the community indication matrix U before the solution, μmay be fixed, U may be iteratively optimized by using the ADMM method, U may be fixed, a closed solution of μmay be obtained by using the lagrange multiplier method, and these two processes may be repeated until the convergence meets the preset convergence condition.

Taking the example that the final objective function contains the weight vector μ, the community indication matrix U before the solution, and the constraint function Z, in the alternative of the present invention, the solution is performed on the final objective function by using AMDD and a lagrange multiplier method to obtain the solved community indication matrix, which may specifically include:

As can be seen from the foregoing description, the convergence condition can be configured according to actual requirements.

The following final objective function is taken as an example to explain a specific optimization solving process of the final objective function:

the specific optimization processing mode of the final objective function comprises the following steps:

fixing mu, iteratively updating U and Z by using an ADMM method, wherein the sub-problems to be solved are as follows:

s.t.MU＝Z

the augmented lagrange form of the above equation is:

where p is a penalty term coefficient, Y is a Lagrangian multiplier,

fixing mu, and the updating iteration process of iteratively updating U and Z is as follows:

Y＝Y+ρ(MU-Z)

wherein, shrink represents a soft threshold function, and is defined as:

shrink(x,y)＝sign(x)⊙max{|x|-y,0}

it is understood that x and Y in shrink (x, Y) are only two schematic parameters, sign is a sign function, and takes a value of 1 when x is greater than 0, takes a value of 0 when x is equal to 0, and takes a value of-1 when x is less than 0, if x and Y are values, shrink (x, Y) ═ sign (x) max { | x | -Y,0}, that is, the values of the two values.

As one example, the first and second sensors may be, for example,

y is 0.5, then

max is taken as the maximum between x y and zero, and therefore,

at this time, the process of the present invention,

fix U and Z, update μ. The sub-problems to be solved at this time are:

wherein w ═ w₁,w₂,…,w_V]Is not comprised of w_vIs (V-1) × 1, V being the total number of primary and secondary data sources, w^TIs a transposed matrix of w, w_v′Represents Tr (U)^TL_v′,vU), i.e. the sub-targeting function to which the auxiliary data source v' corresponds, assuming non-descending ordering of the elements in w, i.e. w₁≤w₂≤…≤w_VAnd applying a Lagrange multiplier method, and solving the subproblem as a closed solution:

p is represented by satisfying theta-w_v′>Maximum value of v' under 0 condition.

Continuously and repeatedly fixing mu by adopting the optimization mode, and iteratively updating U; fix U and Z, update the process of μ until U is obtained at convergence.

In the alternative of the present invention, the obtaining of the community division result of the network community based on the solved community indication matrix includes:

By performing an optimization request on the final objective function, the directly obtained community indication matrix may not completely indicate the attribution of each sample, for example, when clustering is performed by using a spectral clustering algorithm, the solved community indication matrix obtained after optimization generally cannot completely indicate the attribution of each sample, and therefore, after the solved community indication matrix is obtained, conventional clustering needs to be performed on each row, for example, K-Means clustering is used, so as to further improve the effect of community division.

In conclusion, the method provided by the embodiment of the invention can well fuse data of multiple data sources, can further screen out data sources containing a large amount of noise and irrelevant information, and can further increase the supervision information Must-link to further improve the accuracy rate of community discovery. The method provided by the embodiment of the invention can be applied to various different application scenes needing to be classified, in practical application, a main data source, namely a target data source can be selected according to practical application requirements, and other data sources are used as auxiliary data sources to obtain a final clustering result taking the main data source as a guide. For example, the method can be applied to the classification of QQ or WeChat users, so that better services can be provided for the users based on the classification results, such as advertising the users according to the requirements of the users.

To better illustrate the provided aspects of embodiments of the present invention, further description is provided below with reference to a specific example. The scheme of the embodiment of the invention can be applied to community division of social network users (hereinafter, simply referred to as users) in instant messaging software (such as QQ, WeChat and the like), the example specifically takes WeChat as an example for description, and in the example, the first clustering algorithm and the second clustering algorithm adopt a spectral clustering algorithm. The method for carrying out community division on the micro credit users by the scheme provided by the embodiment of the invention specifically comprises the following steps:

first, multi-source social network data of WeChat users is obtained. In this example, the at least two data sources include three data sources, namely, a friend relationship of the user, an age of the user, and a circle of friends published by the user, and correspondingly, the multi-source social network data includes data corresponding to the friend relationship, data corresponding to the age of the user, and data corresponding to the circle of friends published by the user. In this example, the friend relationship of the user is taken as a main data source, and the age of the user and the published circle of friends are taken as two auxiliary data sources. In addition, in the process of acquiring multi-source social network data, some Must-link supervision information can be collected to be used for constructing a constraint function.

After the multi-source social network data is obtained, for a friend relationship data source, a user similarity matrix corresponding to the data source is calculated based on data corresponding to the friend relationship, elements in the matrix represent the similarity between two users corresponding to the data source, and the similarity between the two users can be calculated by adopting a Jaccard coefficient. And taking the user similarity matrix corresponding to the friend relation data source as a similarity matrix of a spectral clustering algorithm to obtain a first objective function based on the main data source.

In order to obtain the second objective function, a relationship matrix between each auxiliary data source and the main data source and a user similarity matrix corresponding to each auxiliary data source need to be calculated. In this example, the relationship matrix corresponding to each auxiliary data source may be obtained in the manner shown in fig. 3 in the foregoing. Specifically, for example, for the user age data source, if the user corresponding to the data source is the same user as the user corresponding to the friend relationship data source, the corresponding element in the relationship matrix is 1, and if the user is not the same user, the corresponding element in the relationship matrix is 0. Corresponding to a user age data source, for example, a cosine kernel function can be adopted to calculate the similarity between users corresponding to the data source, so as to obtain a similarity matrix corresponding to the data source, for a friend circle published by a user, a scheme such as a word bag model can be adopted to calculate the similarity of data published by different users in the friend circle, and the similarity can represent the similarity between different users, so as to obtain a user similarity matrix corresponding to the data source.

And then, obtaining a sub-targeting function corresponding to the data source based on the relation matrix and the user similarity matrix corresponding to the user age data source, and obtaining the sub-targeting function corresponding to the data source based on the relation matrix and the user similarity matrix corresponding to the friend number data source published by the user. And based on the weights respectively corresponding to the other auxiliary data sources (i.e. the weights respectively corresponding to the two sub-objective functions), performing a weighted summation of the two sub-objective functions to obtain a second objective function.

In this example, the influence on the auxiliary data source may be controlled by constructing a regular term of the weight vector, and a constraint function may be constructed based on the collected best-link supervision information. Then, a final objective function needing optimized solving is obtained based on the first objective function, the second objective function, the regular items of the constructed weight vectors and the constraint function, a community indication matrix of the user corresponding to the friend relationship data source is obtained through optimized solving of the final objective function, the obtained clustering result can be used as a final community division result through re-clustering of elements in the community indication matrix, and community division with data corresponding to the friend relationship of the user as main data and the other two kinds of data as auxiliary data is achieved. It should be noted that, in practical applications, the execution order of each step in the embodiment of the present invention is not absolute, but may be changed, and in the above example, the determining steps of the first objective function and the second objective function may not be executed in a sequential order, and for example, the first objective function may be executed after the calculation of the user similarity matrix corresponding to the main data source is completed. In practical applications, the execution sequence between the steps can be flexibly adjusted or performed in an intersecting manner, which is also clear to those skilled in the art and is not listed here.

It can be understood that, in the embodiment of the present invention, the discovery of the network community is to obtain various "human circles" formed by the network interaction behavior of the user by mining the social network interaction data of the user, the network community is a set of nodes with higher similarity or closely conformed related connections in the network, the connections between internal nodes of the same network community are relatively tight, the connections between nodes of different network communities are relatively sparse, and one network community can be regarded as a "group" or a "cluster".

It can be seen that the discovery of the network community is a discovery of a community structure based on social network data of a user, and a user community based on a certain relationship is determined by mining a certain mutual relationship (such as a user relationship, content published by the user, attention of the user in the social network or a friend relationship, and the like) among individuals in the social network. For example, based on the community discovery of the friend relationship of the users in the social network, the network community based on the user connection can be obtained, the connection among the users in the same network community is relatively close, and the connection among the users in different network communities is relatively sparse. For another example, based on the community discovery of the user interests, network communities divided based on the interests can be obtained, the interests of the users in the same network community are similar, and the interests of the users in different network communities are greatly different. Community discovery based on age data of users is also possible, where the age difference between users of the same network community is relatively small, and the age difference between users of different network communities is relatively large.

Based on the same principle as the network community discovery method provided by the embodiment of the present invention, the embodiment of the present invention further provides a network community discovery apparatus, as shown in fig. 4, the network community discovery apparatus 100 may include a multi-source social data obtaining module 110, a data source relationship determining module 120, and a community division result determining module 130. Specifically, the method comprises the following steps:

the social data acquiring module 110 is configured to acquire multi-source social network data of a social network user, where the multi-source social network data includes data corresponding to at least two data sources;

a data source relationship determining module 120, configured to determine, based on a user relationship between a social network user corresponding to each type of auxiliary data source and a social network user corresponding to a main data source, an association relationship between each type of auxiliary data source and the main data source, respectively, where the main data source is one of at least two specified data sources, and the auxiliary data source is a data source other than the main data source of the at least two data sources;

the community division result determining module 130 is configured to cluster the social network users corresponding to the main data source based on the data of the main data source, the data of each auxiliary data source, and the association relationship between each auxiliary data source and the main data source, so as to obtain a division result of the network community of the social network users corresponding to the main data source.

Optionally, the data source relationship determining module is specifically configured to:

Optionally, the number of rows of the relationship matrix corresponding to each auxiliary data source is the number of social network users corresponding to the auxiliary data source, the number of columns is the number of social network users corresponding to the main data source, and the user relationship indicates whether the social network user corresponding to the row where the element is located in the relationship matrix and the social network user corresponding to the column where the element is located are the same user.

Optionally, the community division result determining module is specifically configured to:

Optionally, the number of rows of the solved community indication matrix is the number of social network users corresponding to the main data source, and the number of columns of the solved community indication matrix is the number of pre-divided network communities.

Optionally, when the community division result determining module obtains the second objective function through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source, the community division result determining module is specifically configured to:

Optionally, when the community division result determining module obtains the first objective function through the first clustering algorithm based on the data of the main data source, the community division result determining module is specifically configured to:

Optionally, the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U^TL_v′,vU)

wherein the content of the first and second substances,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving^TA transposed matrix representing U, v representing a primary data source, v' representing a secondary data source, S_v′,vRepresenting secondary data sources v' with primary dataA relationship matrix between sources v, S_v,v′Denotes S_v′,vTransposed matrix of A_v′A user similarity matrix representing the correspondence of the secondary data sources,

Optionally, if the second objective function is a function obtained based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source, the second objective function is:

wherein the content of the first and second substances,

Optionally, the apparatus further comprises:

Optionally, the regular term of the weight vector is:

where μ represents the weight vector and β represents the first regularizationThe coefficients of which are such that,

represents the square of the 2-norm of μ.

Optionally, the apparatus further includes a constraint function construction module, where the constraint function construction module is configured to:

Optionally, when the constraint function building module builds the constraint function according to the Must-link supervision information, the constraint function building module is specifically configured to:

Optionally, the constraint function is:

γ||Z||₁

Optionally, the final objective function includes a first objective function, a second objective function, a regular term of the weight vector, and a constraint function.

Optionally, the community division result determining module is specifically configured to, when solving the final objective function to obtain a solved community indication matrix:

Optionally, the community division result determining module is specifically configured to, when solving the final objective function by using AMDD and a lagrangian multiplier method to obtain a solved community indication matrix:

Optionally, when the community division result determining module obtains the community division result of the network community based on the solved community indication matrix, the community division result determining module is specifically configured to:

The device provided by the embodiment of the invention can be applied to various electronic devices, such as mobile terminal devices, fixed terminal devices and servers.

It is understood that the above modules in the apparatus in the embodiments of the present disclosure have functions of implementing corresponding steps in the method shown in any embodiment of the present disclosure, and the functions may be implemented by hardware or by hardware executing corresponding software, and the hardware or software includes one or more modules corresponding to the above functions. The modules can be realized independently or by integrating a plurality of modules. For the specific functional description of the data processing apparatus, reference may be made to the corresponding description in the foregoing method, which is not described herein again.

Based on the same principle as the data processing method and the data processing apparatus provided by the embodiment of the present invention, an embodiment of the present invention also provides an electronic device, which may include a processor and a memory. Wherein the memory has stored therein readable instructions, which when loaded and executed by the processor, may implement the method shown in any of the embodiments of the present invention.

Embodiments of the present invention further provide a computer-readable storage medium, where the storage medium stores readable instructions, and when the readable instructions are loaded and executed by a processor, the method shown in any embodiment of the present invention is implemented.

Fig. 5 is a schematic structural diagram of an electronic device applicable to the embodiment of the present invention, and as shown in fig. 5, the electronic device may specifically be a server, and the server may be used to implement the method for discovering a network community shown in any embodiment of the present invention.

Specifically, as shown in fig. 5, the server 2000 may generally include at least one processor 2001, memory 2002, a network interface 2003, and input/output interfaces 2004, among other components. The components may communicate with each other via a bus 2005.

In particular, the memory 2002 may be used to store an operating system, application programs, etc., which may include program code or instructions that when invoked by the processor 2001 implement the methods illustrated in embodiments of the present invention, and may also include programs for implementing other functions or services.

The Memory 2002 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically erasable programmable Read Only Memory), a CD-ROM (Compact disk Read Only Memory) or other optical disk storage, optical disk storage (including Compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The processor 2001 is connected to the memory 2002 via the bus 2005, and realizes a corresponding function by calling an application program stored in the memory 2002. The Processor 2001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific integrated circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

The server 2000 may be connected to a network through a network interface 2003 to communicate with other devices (e.g., user terminal devices or other servers) through the network to realize data interaction. For example, the server 2000 communicates with a user terminal device through a network interface to obtain multi-source social network data of a user. The network interface 2003 may include a wired network interface and/or a wireless network interface, among others.

The server 2000 may be connected to a desired input/output device such as a keyboard, a display device, etc. through the input/output interface 2004, and may be connected to a storage device such as a hard disk, etc. through the interface, so that data in the server 2000 may be stored in the storage device or data in the storage device may be stored in the server 200. It is to be appreciated that the input/output interface 2004 can be a wired interface or a wireless interface. Depending on the actual application scenario, the device connected to the input/output interface 2004 may be a component of the server 200, or may be an external device connected to the server 200 as needed.

The bus 2005 for connecting the various components may include a path that carries information between the components. The bus 2002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like according to functions

Alternatively, for the solution provided by the embodiment of the present invention, the memory 2003 may be used for storing application program codes for executing the solution of the present invention, and the processor 2001 controls the execution. The processor 2001 is used to execute the application program code stored in the memory 2003 to implement the actions of the method or apparatus provided by the embodiments of the present invention.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for discovering a web community, comprising:

respectively determining the association relationship between each auxiliary data source and a main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source, wherein the main data source is one of the at least two specified data sources, and the auxiliary data source is the data source except the main data source of the at least two data sources;

2. The method of claim 1, wherein the determining the association relationship between each auxiliary data source and the main data source based on the user relationship between the social network user corresponding to each auxiliary data source and the social network user corresponding to the main data source comprises:

3. The method according to claim 1 or 2, wherein the performing community division on the social network users corresponding to the primary data source based on the data of the primary data source, the data of each auxiliary data source, and the association relationship between each auxiliary data source and the primary data source to obtain the division result of the network community of the social network users corresponding to the primary data source comprises:

obtaining a first objective function through a first clustering algorithm based on the data of the main data source, wherein the first objective function comprises a community indication matrix corresponding to the main data source before solving;

and obtaining the division result of the network community of the social network user corresponding to the main data source based on the solved community indication matrix.

4. The method of claim 3, wherein obtaining a second objective function through a second clustering algorithm based on the data of each secondary data source and the association relationship between each secondary data source and the primary data source comprises:

obtaining sub-targeting functions corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the incidence relation between each auxiliary data source and the main data source;

and obtaining the second objective function based on the sub-objective functions corresponding to each auxiliary data source.

5. The method of claim 4, wherein obtaining the second objective function based on the sub-objective function corresponding to each auxiliary data source comprises:

and obtaining the second objective sub-function based on the sub-objective function corresponding to each auxiliary data source and the weight corresponding to each auxiliary data source.

6. The method according to claim 4 or 5, wherein the obtaining a first objective function through a first clustering algorithm based on the data of the main data source comprises:

obtaining the first objective function through the first clustering algorithm based on the user similarity matrix corresponding to the main data source;

the obtaining of the sub-targeting function corresponding to each auxiliary data source through the second clustering algorithm based on the data of each auxiliary data source and the association relationship between each auxiliary data source and the main data source includes:

obtaining sub-objective functions corresponding to each auxiliary data source through the second clustering algorithm based on the user similarity matrix corresponding to each auxiliary data source and the relation matrix corresponding to each auxiliary data source;

7. The method of claim 6, wherein the second clustering algorithm is a spectral clustering algorithm, and the sub-objective functions are:

Tr(U^TL_v′，vU)

wherein the content of the first and second substances,

tr represents the trace of the matrix, U represents the community indication matrix before solving corresponding to the main data source, and U represents the community indication matrix before solving^TA transpose matrix representing U, v represents the primary data source, v' represents the secondary data source,S_v′，vrepresenting a relationship matrix, S, between a secondary data source v' and a primary data source v_v，v′Denotes S_v′，vTransposed matrix of A_v′A user similarity matrix representing the correspondence of the secondary data sources,

representation matrix S_v，v′A_v′S_v′，vDegree matrix, | · | | non-conducting phosphor_FRepresenting the F-norm.

8. The method of claim 5, further comprising:

obtaining a final objective function based on the first objective function and the second objective function includes:

and obtaining the final objective function according to the first objective function, the second objective function and the regular terms of the weight vector, wherein the weight vector is a term to be solved in the final objective function.

9. The method of any of claims 3 to 8, further comprising:

acquiring Must-link constraint Must-link supervision information, wherein the Must-link supervision information is used for identifying that two community network users belong to the same network community;

and obtaining the final objective function based on the first objective function, the second objective function and the constraint function.

10. The method of claim 9, wherein the constructing a constraint function according to the Must-link supervision information comprises:

11. The method of claim 9 or 10, wherein the final objective function comprises the first objective function, the second objective function, a regularization term of the weight vector, and the constraint function.

12. The method of claim 8, wherein solving the final objective function to obtain a solved community indication matrix comprises:

and solving the final objective function by using an alternative direction multiplier algorithm AMDD and a Lagrange multiplier method to obtain a solved community indication matrix.

13. An apparatus for discovering a web community, comprising:

the system comprises a multi-source social data acquisition module, a data processing module and a data processing module, wherein the multi-source social data acquisition module is used for acquiring multi-source social network data of a social network user, and the multi-source social network data comprises data corresponding to at least two data sources;

a data source relationship determining module, configured to determine, based on a user relationship between a social network user corresponding to each type of auxiliary data source and a social network user corresponding to a main data source, an association relationship between each type of auxiliary data source and the main data source, respectively, where the main data source is one of the at least two specified data sources, and the auxiliary data source is a data source other than the main data source in the at least two data sources;

14. An electronic device, comprising a processor and a memory;

the memory has stored therein readable instructions which, when loaded and executed by the processor, implement the method of discovery of a network community of any one of claims 1 to 12.

15. A computer-readable storage medium, having stored thereon readable instructions which, when loaded and executed by a processor, implement a method of discovery of a network community as claimed in any one of claims 1 to 12.