CN110851732A

CN110851732A - Attribute network semi-supervised community discovery method based on non-negative matrix three-factor decomposition

Info

Publication number: CN110851732A
Application number: CN201911033689.7A
Authority: CN
Inventors: 金弟; 何静
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-02-28

Abstract

The invention discloses a method for discovering an attribute network semi-supervised community based on non-negative matrix three-factor decomposition. Then, optimizing the model, calculating the partial derivative of the unknown variable in the model, and making the partial derivative be 0 to obtain the updating rule of each variable; collecting and processing data, and extracting a required adjacency matrix, prior information and a content matrix from an attribute network; randomly initializing parameters and unknown variables thereof, carrying out a training process by using the updating rules about the unknown variables obtained in the step 2) by adopting a gradient descent method, putting the processed data set into the model in the step 1) for training, and continuously iterating until the updating of the parameters is converged.

Description

Attribute network semi-supervised community discovery method based on non-negative matrix three-factor decomposition

Technical Field

The invention belongs to the field of machine learning, complex networks and natural language processing, mainly relates to fusion of network information, provides a method for reducing dimension and fusing information by adopting a non-negative matrix factorization technology, and particularly relates to a method for discovering an attribute network semi-supervised community based on non-negative matrix three-factor factorization.

Background

With the development of the internet, data generated by an online social network is more and more, and the data has links and semantic contents, such as user blogs, research papers and the like. These data are typically modeled as a network of attributes, with links forming the topology of the graph and content modeled as attributes of the nodes in the graph. The discovery of semantic communities of these networks is of great significance. For example, in a paper citation network, each node represents a paper, the papers are cited with each other, each paper has its content, and the community to which the papers belong is determined according to the links and the content, so that researchers can be helped to know the frontier of the current research field. Therefore, how to integrate links and content in a network to determine a more accurate semantic community structure is a very challenging and meaningful problem.

Nowadays, many community discovery methods for studying attribute networks are also proposed. Depending on the type of data used for clustering, they can be classified into four categories: 1) topology-based methods, 2) attribute-based methods, 3) integration methods, and 4) model-based methods. The first type translates community discovery on an attribute network into graph clustering on a new reconstructed network, where the attributes of the nodes are also modeled as topological information. The second type converts community discovery on attribute networks into traditional vector data clustering, where links and content are merged to compute the magnitude of similarity between pairs of nodes. The integration method combines the results of different clustering, i.e. links and content are modeled jointly by a NMF model-based method or a probabilistic model. Particularly, the model-based method can make full use of links and contents to resolve the clustering problem into a probabilistic reasoning process, and compared with other methods, the method has a solid theoretical basis, so that the method is generally considered to have good performance. However, when the network is very sparse and the community structure is too fuzzy, these methods often cannot accurately identify the community structure and semantic information thereof due to the uniqueness of the information. And, the prior and the heterogeneity of the degree are two key factors influencing the network clustering result. In view of the limitations of the current methods, the present document aims to research a new method for fusing attribute network features based on non-negative matrix tri-factorization to solve the defects of the above methods.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a model for effectively combining attribute network links and contents, so that more accurate semantic communities and the interrelation between the communities are obtained.

In order to achieve the purpose, the technical scheme adopted by the invention is based on the prior information assisted by the link and node content of a non-negative matrix three-factor decomposition fusion attribute network so as to improve the performance of community discovery, and the method comprises the following steps:

1) constructing a matrix decomposition model combining links and contents, wherein the matrix decomposition model comprises three parts of topology, prior and a content matrix, and describing the meaning of each variable in the model in detail, and the specific steps are as follows:

(1) construction of a non-negative matrix factorization model

Constructing a non-negative matrix three-factor decomposition model based on the link information, describing the membership of nodes and networks and the interrelation between communities, and expressing the model as follows:

the meanings of the characters in the formula can be referred to in table 1.

Table 1 is an explanation of the corresponding identification in the matrix decomposition model

(2) Semi-supervised model with embedded prior information

The invention adopts the must-link constraint as prior information to strengthen the community structure representation. The distance between two nodes belonging to the same community in the high-dimensional space in the low-dimensional space should be similar, the Euler distance is adopted to measure the distance between the two nodes, and the semi-supervised community discovery model is constructed by combining the link information:

(3) semi-supervised model for introducing node popularity

Since heterogeneity of degrees tends to increase the euler distance between two nodes belonging to a community, the incoming node popularity matrix W can eliminate this influence, and the semi-supervised model of incoming node popularity is defined as:

(4) semi-supervised model combining links with content

In order to better combine links and contents, the invention adopts the same potential space to approximate the potential space of the connection between nodes, and adopts a bag-of-words method to define a content matrix C, and the content matrix decomposition is defined as:

therefore, the invention finally constructs a semi-supervised community semantic discovery model combining links and contents and simultaneously considering the heterogeneity of prior and degree:

2) optimizing the model, performing partial derivation on unknown variables in the model, and setting the partial derivation as 0 to obtain an updating rule of each variable;

3) collecting and processing data, and extracting a required adjacency matrix, prior information and a content matrix from an attribute network;

4) randomly initializing parameters and unknown variables thereof, carrying out a training process by using the updating rule about each unknown variable obtained in the step 2 by adopting a gradient descent method, putting the processed data set into the model in the step 1 for training, and continuously iterating until the parameter updating is converged;

5) and recording the obtained parameter results and variable results thereof into related documents, namely the membership between nodes and communities, the relationship between communities and the semantic interpretation of communities, and visualizing experimental results.

Has the advantages that:

1. by introducing prior information in the attribute network and considering the problem of heterogeneity of node degrees, the community structure is strengthened, and the community discovery capability is improved.

2. The method of the invention combines the link relation information and the content information in the same low-dimensional potential space, effectively solves the problems of the community discovery result and the poor semantic interpretation effect of the conventional community model caused by fuzzy semantics (such as word ambiguity), and ensures that the interpretability of the method is stronger. The gradient descent method is utilized to enable the variables and the parameters to be updated simply, quickly and short in convergence time, and the method can be applied to a large-scale network.

3. The invention adopts a non-negative matrix three-factor decomposition method, not only obtains the membership relation between the nodes and the communities in the network, but also obtains the relation between the communities and the communities, and simultaneously explains the semantic information of each community.

The characteristics are as follows:

a. besides traditional link information, prior information and content information in the network are effectively and fully combined;

b. establishing a model with stronger interpretability by a non-negative matrix factorization method;

c. the updating rule is simple and quick;

d. and the expandability is strong.

Description of the drawings:

FIG. 1 is a diagram of a model framework established by the method of the present invention.

Detailed Description

The invention will be further illustrated by means of a specific example. The examples of the present invention are for better understanding of the present invention by those skilled in the art, and do not limit the present invention in any way.

In order to better solve the problems of network sparseness and semantic ambiguity and fully utilize the information of the network to discover more information about communities, the invention establishes a model which has strong interpretability and effectively combines network links and contents by utilizing matrix decomposition, and adopts a gradient descent method to ensure that the method is easy to understand and has high operation speed. Through the training model of the invention, the user can obtain more information (node membership, community between communities and community semantic information) about the communities in the network. The method can be widely applied to the fields of text classification clustering, commodity recommendation, information retrieval and the like.

The modeling process (i.e., the required information and corresponding matrices, the bonding process, and the effects that can be produced) established by the present invention is shown in FIG. 1.

The technical scheme adopted by the invention integrates the link and node contents of the attribute network and adds auxiliary prior information to improve the performance of community discovery, and comprises the following steps:

1. constructing a matrix decomposition model combining links and contents, wherein the matrix decomposition model comprises three parts of a topology, a priori and a content matrix, and the meaning of each variable in the model is described in detail;

2. optimizing the model, performing partial derivation on unknown variables in the model, and setting the partial derivation as 0 to obtain an updating rule of each variable;

3. collecting and processing data, taking Cora data in Table 2 as an example, and extracting a required adjacency matrix, prior information and a content matrix from an attribute network;

table 2 is a detailed description of the test data

4. Randomly initializing parameters and unknown variables thereof, carrying out a training process by using the updating rule about each unknown variable obtained in the step 2 by adopting a gradient descent method, putting the processed data set into the model in the step 1 for training, and continuously iterating until the parameter updating is converged;

5. and recording the obtained parameter results and variable results thereof into related documents, namely the membership between nodes and communities, the relationship between communities and the semantic interpretation of communities, and visualizing experimental results.

Table 3 shows the classification Accuracy (AC) and the standard mutual information index (NMI) of the experimental results of the comparison with other community discovery model methods according to the present invention

Method of producing a composite material	SSNMF	FSSNMF	PSSNMF	WSCDSM	The invention
						AC	0.3027	0.6012	0.8224	0.7216	0.8470
NMI	0.4270	0.7615	0.9062	0.7531	0.9401

The invention provides an information fusion method based on non-negative matrix three-factor decomposition. The main contributions of the present invention include three aspects: 1. the method combines various information: the link information, the prior information and the content information are combined, and the information in the network is effectively and fully utilized; 2. the method adopts a non-negative matrix three-factor decomposition based method, so that the model has stronger interpretability and obtains more information about the community; 3. the method provided by the invention can solve the problems of network sparsity and semantic ambiguity more simply and quickly.

The experimental results of the sample data are shown, and are shown in table 3. It should be understood that the embodiments and examples discussed herein are illustrative only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims.

The method has better performance for improving the community discovery capability and exploring more community information.

Claims

1. The attribute network semi-supervised community discovery method based on non-negative matrix three-factor decomposition is characterized by comprising the following steps of:

1) constructing a matrix decomposition model combining links and contents, wherein the matrix decomposition model comprises three parts of a topology matrix, a prior matrix and a content matrix, describing the meaning of each variable in the model in detail, optimizing the model, solving the partial derivative of an unknown variable in the model, and setting the partial derivative as 0 to obtain an update rule of each variable;

2) collecting and processing data, and extracting a required adjacency matrix, prior information and a content matrix from an attribute network;

3) randomly initializing parameters and unknown variables thereof, carrying out a training process by using the updating rules about the unknown variables obtained in the step 2) by adopting a gradient descent method, putting the processed data set into the model in the step 1) for training, and continuously iterating until the updating of the parameters is converged.

2. The method for discovering the attribute network semi-supervised community based on the non-negative matrix tri-factorization as recited in claim 1, wherein the step 1) is specifically as follows:

(1) construction of a non-negative matrix factorization model

(2) semi-supervised model with embedded prior information

According to the invention, a must-link constraint is used as prior information to strengthen community structure representation;

the distance between two nodes belonging to the same community in the high-dimensional space in the low-dimensional space should be similar, the Euler distance is adopted to measure the distance between the two nodes, and the semi-supervised community discovery model is constructed by combining the link information:

(3) semi-supervised model for introducing node popularity

Since heterogeneity of degrees tends to increase the euler distance between two nodes belonging to a community, the incoming node popularity matrix W eliminates this influence, and the semi-supervised model of incoming node popularity is defined as:

(4) semi-supervised model combining links with content

finally, a semi-supervised community semantic discovery model combining links and contents and considering heterogeneity of prior and degree is constructed:

3. the method for discovering attribute network semi-supervised community based on non-negative matrix tri-factorization as recited in claim 1, wherein the obtained parameter results and their variable results are recorded into related documents, namely, membership between nodes and communities, relationships between communities and semantic interpretation of communities, and experimental results are visualized.