CN115422445A

CN115422445A - Network representation learning method based on hierarchical negative sampling

Info

Publication number: CN115422445A
Application number: CN202211007471.6A
Authority: CN
Inventors: 陈俊扬; 伍楷舜; 巩志国; 戴志江
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-02

Abstract

The invention discloses a network representation learning method based on hierarchical negative sampling. The method comprises the following steps: acquiring a group of random walk sequences aiming at a graph network, wherein each walk sequence consists of a group of vertexes; modeling the field information of the vertex for each group of random walk sequences to determine a potential community structure of the target vertex; calculating the probability of the negative sample which is the target vertex for each vertex based on the potential community structure so as to sample the negative sample; and optimizing the set objective function based on the sampled negative samples, and further determining a vertex representation learning vector. The method can adaptively discover the potential community structure of the vertex, and obtains more reasonable negative samples by learning according to the probability distribution of the vertex correlation of the community, thereby improving the performance of the vertex representing learning vector.

Description

Network representation learning method based on hierarchical negative sampling

Technical Field

The invention relates to the technical field of data mining and machine learning, in particular to a network representation learning method based on hierarchical negative sampling.

Application scenarios

For example, in a social network, each user may be regarded as a vertex, and the social relationship between the user and the user may be represented by a connecting edge between the vertices, and when a new user appears, how to recommend the most likely related user for the new user is one of the scenarios to be processed by the present invention. In the similar scenario, in the citation network, an article can be regarded as a vertex, the citation relationship between the article and the article can be regarded as a connecting edge between the vertices, and when a new article appears, how to retrieve the article potentially related to the article is also a scenario that can be processed by the invention.

The invention provides a network representation learning method based on hierarchical negative sampling, which can learn vertex representation vectors with higher discrimination capability than the prior method, namely, based on a network structure of the vertices, and map the relationship between the vertices to a low-dimensional vector space. After the expression vectors of the vertexes are obtained, simple discrimination methods (such as SVM) or clustering methods (Kmeans) can be used for completing tasks such as vertex classification, link prediction and community detection.

Background

With the rapid development of social networks, a large number of algorithmic models of network analysis have been proposed. In particular, network representation learning (NRL, also known as network embedding method) exhibits effectiveness in network analysis tasks. Network representation learning aims at mapping vertices and edges to a low-dimensional vector space through coding, thereby modeling graph networks. These learned vertex and edge representation vectors may be used for subsequent applications such as vertex classification, link prediction, and community detection, among others. Negative Sampling (NS) is often used to improve the performance of network representation learning. However, current research work generally only randomly samples negative samples based on vertex frequency, i.e., vertices with higher frequencies are more likely to be sampled, which results in that the sampled vertices may not be true negative samples, so that the learned vertex and edge representation vectors do not achieve satisfactory performance for subsequent applications. Further, neighborhood information of vertices can be modeled as a potential topic distribution structure by utilizing the similarity between community distribution in a network and topic distribution in a document, and thus the present invention is also related to a topic modeling method.

Hereinafter, the related studies of the network representation learning, the network representation learning based on the negative sampling, and the topic modeling related to the network representation learning will be separately described.

1) Network representation learning

Currently, there are many types of network representation learning schemes. For example, the Deepwalk method explores structural information in the network through random walk, and the obtained walk sequence can learn network embedding vectors by using a Skip-Gram model. As another example, a generic graph embedding method is proposed that can explicitly calibrate the distance between vertices to preserve the similarity between vertices in the embedding space. In summary, methods such as deep walk, LINE, node2vec, DRNE, and AROPE are proposed to explore the topology of the network, and these methods optimize the corresponding objective function according to different tasks, but are limited to using a negative sampling method, that is, sampling negative samples to calculate the Skip-Gram-based objective function. To this end, some scholars have proposed alternatives to negative sampling, such as the hierarchical Softmax method and the noise contrast estimation method, but the existing literature states that negative sampling generally performs better than alternatives on the task of network representation learning. Furthermore, all of these alternative methods and negative sampling have a common limitation, namely that negative samples are sampled from a fixed distribution of vertices, and therefore, more reasonable negative samples cannot be chosen dynamically.

2) Network representation learning based on negative sampling

The negative sampling first occurs in the Skip-Gram model, which modelCan be used as a neural network language model for learning word embedding. Since the Skip-Gram model requires summing the representation vectors of all vertices in each update, its objective function is not feasible to compute in a large scale network. The negative sampling aims at simplifying the objective function of Skip-Gram by distributing P from uniformity _NS Negative samples are sampled to simplify computational complexity. However, from a fixed distribution P _NS The negative samples sampled in the middle, i.e. the vertices with higher frequency are more likely to be sampled as the negative samples of each target vertex, this approach may result in poor network representation learning performance, especially for vertices with larger degrees, because the target vertices have a higher probability of selecting irrelevant vertices with larger degrees as negative samples.

In order to generate high quality negative examples, some scholars propose improved methods. For example, a Distance Negative Sampler (Distance Negative Sampler) is defined, and a vertex cannot select its vertex of the 1 st order domain as a Negative sample. In addition, highly correlated vertices in the N-th order neighborhood of the target vertex may also be sampled. When optimizing an objective function using gradient descent, suitable negative examples may also be obtained by selecting vertices with larger gradients, e.g., by applying a generative countermeasure network (GAN) in a reinforcement learning manner to select negative examples. However, these GAN-based methods have high variance because they allow the vertex to select its neighbors as negative samples, resulting in erratic information propagation and unstable performance of the final network representation learning. Therefore, existing methods generally do not model neighborhood information for vertices, nor do they facilitate selecting reasonable vertices as negative examples of target vertices.

3) Topic modeling related to web representation learning

After performing random walks on vertices in the network, sets of vertex sequences may be obtained, for example, a vertex may be considered a word and a vertex sequence obtained by a walk may be considered a sentence. Thus, the domain information of the vertices can be modeled with the topic model and its variants. Some existing work explores potential topic structures in networks to enable network representation learning. However, these methods do not take into account how to select a more reasonable negative example. Furthermore, these methods use a predefined number of topics in the modeling, which is however usually unknown in the real world and may have a large impact on the final performance of the model. There is also some work in topic models to select negative examples by ordering their gradient magnitude to improve negative sampling. These methods still allow a word to select its more relevant neighbors as negative examples, which can lead to a reduction in the performance of network representation learning.

Through analysis, the existing network representation learning model generally adopts Skip-Gram to carry out parameter optimization, and the Skip-Gram is an optimization method for maximizing the content co-occurrence probability in a sliding window. After searching out the neighborhood of vertices, these models typically use the Skip-Gram model to learn the representation learning vector of vertices, where the content in the sliding window corresponds to the neighbor vertices in the graph. However, due to the computational complexity of the optimization function in the Skip-Gram, most network representation learning methods adopt a negative sampling method to approximate the target function in the Skip-Gram. Negative sampling can generally achieve better performance in large-scale datasets than other optimization methods, the main idea being to encourage the representation vector of the target vertex to be close to the representation vectors of its neighbors, while being far away from the representation vectors of its negative examples.

However, the existing negative sampling method samples negative samples based on the frequency of vertex occurrences without considering the network structure where the negative samples are located, and thus unreasonable negative samples may be sampled. For example, for a certain target vertex, its neighboring vertices may be sampled as negative examples, which may result in poor network representation learning performance. To address this problem, some scholars define Distance Negative samplers (Distance Negative samplers) that select Negative samples in the set { V- { N (u) ueu } } for each target vertex u, where V is the set of vertices and N (u) is the neighborhood of u, see the different order neighborhoods of target vertex 1 shown in fig. 1. From the center to the periphery, the vertices in the first dashed circle are the 1 st order neighbors of vertex 1. Vertices located in the second and third dashed circles are the 2 nd and 3 rd order neighbors, respectively, of vertex 1. The approach is to exclude vertices in the 1 st order neighborhood of vertex 1 from the selection of negative examples, e.g.,

vertices

2, 3, and 4 in the 1 st order neighborhood. However, none of the current solutions consider negative-sample sampling of the domain information of the target vertex, in short, for target vertex 1 of fig. 1, the current solution can still select vertex 11 as its negative sample.

Furthermore, since it is difficult to determine the optimal order of the neighboring regions that retain the target vertex, i.e., the vertices of these regions are not selected as negative examples of the target vertex. For example, if the optimal order is set to 1, the vertices in the 1 st neighborhood of the target vertex will not be negative samples, the vertices in the 2 nd or 3 rd neighborhoods of the target vertex may be selected as negative samples, but the setting of the optimal retention order will vary from one dataset to the next, and whether the setting is reasonable will affect the web representation learning performance.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a network representation learning method based on hierarchical negative sampling, which comprises the following steps:

for a network G = (V, E), a set of random walk sequences S = { S) is obtained ₁ ,…,s _M N is the size of the vertex set, M denotes the number of random walks, V denotes the vertex set,

representing a set of edges, each wandering sequence s consisting of a set of vertices { v } ₁ ,…,v _N }；

Modeling the field information of the vertex for each group of random walk sequences to determine a potential community structure of the target vertex;

calculating the probability of the negative sample which is the target vertex for each vertex based on the potential community structure so as to sample the negative sample;

and optimizing the set objective function based on the sampled negative samples, and further determining a vertex representation learning vector.

Compared with the prior art, the method has the advantages that a novel negative sampling method, namely Hierarchical Negative Sampling (HNS), is provided, the potential structure of the vertex can be modeled, and the mutual relation between the potential structure and the vertex can be learned. During the training process, more reasonable negative samples can be sampled, so that better network representation learning performance is obtained. The method can be applied to network data of different scales, and the performance of the method on the vertex classification task is superior to that of the existing network representation learning model.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of different order fields of a target vertex in the prior art;

FIG. 2 is a schematic diagram of a probabilistic graphical model of hierarchical negative sampling according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of a training framework for hierarchical negative sampling according to one embodiment of the present invention;

FIG. 4 is a graphical illustration of the correlation of the number of communities with the network representation learning performance in accordance with one embodiment of the present invention;

FIG. 5 is a graphical illustration of the effect of the number of communities found at different values of α, according to one embodiment of the present invention;

in the figure, targeted vertex-target vertex; rank 1 neighbor-1 order fields; rank 2 neighbor-2 order fields; rank 3 neighbor-3 order fields; number Setting of Community-Number of Community.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The invention provides a novel negative sampling method, which is called a Hierarchical Negative Sampling (HNS) technology and is used for modeling N-order neighborhood information of a target vertex. The invention does not need to artificially set a threshold value to keep the optimal order of the adjacent area of the target vertex, but can adaptively discover the potential community structure of the vertex and then calculate the probability of being the negative sample of the target vertex for each vertex according to the structure. Specifically, for a target vertex, the HNS firstly marks the target vertex as belonging to a certain community, and then samples a negative sample from the community according to the probability distribution of the vertex relevance of the community to optimize the target function. By modeling the domain information of the vertexes in the network into the hierarchical structure of the community, more reasonable negative samples can be obtained through learning, and therefore the performance of the vertexes representing learning vectors is improved.

In the following, key knowledge about network representation learning, as well as the main mathematical symbols and expressions, will be introduced.

1. Random walk (Random walks)

In one embodiment, G = (V, E) is used to represent a network, where V represents a set of vertices,

representing a collection of edges. Obtaining a set of sequences S = { S } using algorithm random walk to randomly walk ₁ ,…,s _M Each sequence s is composed of a set of vertices { v } ₁ ,…,v _N } of whichIn (1), N is the size of the vertex set, and M represents the number of random walks.

2. Skip-Gram model

Skip-Gram is a shallow neural network model that can be used for network representation learning algorithms. For each vertex v in the sequence s _i The Skip-Gram target is in a sequence of sliding window size t, such that the context vertices { v } _i-t ,…,v _i+t }\v _i Maximizes the mean logarithmic probability of (c), and is expressed as:

wherein v is _j Is the vertex v _i The context vertex of (c), represents the number of vertices in the sequence, P (v) _j ∣v _i ) Is the softmax function, defined as follows:

wherein v ∈ R ^d Representing the learned vertex as a vector, d is the dimension of the vector, v' _j Representing a vertex v _j Transpose of vector of (c), v' _v Representing a vertex v _v Is transposed of the vector of (c).

3. Negative sampling

As shown in equation (2), the denominator term of the softmax function requires a dot product re-summation of the vector representations of all vertices and the vector representation of the target vertex at each iteration. Thus, in large scale networks, the function of Skip-Gram is computationally infeasible, while negative sampling simplifies this function computation process. In particular, given a target vertex v _i And its context vertex v _j The negative sampling function is expressed as:

wherein σ (x) = 1/(1 + exp (-x)),v _k is a slave probability P _NS (v) (the probability distribution of vertices may be set to random sampling) of the negative sample vertices obtained by sampling, where K is the number of negative samples, V' _k Representing a vertex v _k Transpose of vector of (c), V' _j Representing the vertex v _j Is transposed of the vector of (c).

4. The invention provides a hierarchical negative sampling method

In the following, the problem with negative sampling for approximating the Skip-Gram function is first explained, and this problem also exists in other models that use negative sampling, such as NSCaking, R-NS, and SGA. Next, the HNS model of the present invention is introduced, and how the model solves the problems that currently exist.

1. Problems in the prior art

Negative sampling may take unreasonable samples (i.e., vertices associated with the target vertex) as negative samples. As described in Mikolov et al (Distributed representation of words and phrases and the composition of matter, in: advanced information processing systems,2013, pp.3111-3119), negative sampling will P _NS (v) Is defined as

Proportional uniform distribution wherein d _v Is the frequency of occurrence of the vertex v in the sequence, is an a priori parameter set by a human, and is usually set to 3/4. To summarize, for each vertex, the negative samples are from a fixed distribution P _NS (v) Uniformly sampling negative samples, and the formula is described as follows:

as can be seen from equation (4), vertices with higher frequencies are more likely to be sampled as negative samples of the target vertex. As shown in fig. 1, when a high frequency vertex is located in the 2 nd order neighborhood or the 3 rd order neighborhood of the target vertex, the strategy may sample the high frequency vertex, thereby resulting in poor network representation learning performance. Furthermore, as can be seen from formula (4), P _NS (v) Is from the sequence SA fixed distribution is obtained, and therefore it is not possible to model the underlying community structure with the information of the relevant vertices, and thus to select the negative examples more appropriately. The most advanced models available, for example based on a framework for the generation of the network GAN against, including the Kbgan, nscoching, SGA, etc., models, propose by σ (-v ') in equation (3)' _k ·v _i ) The method of negative sampling is improved by dynamically selecting vertices with larger gradients to obtain high quality negative samples. However, the performance of these methods typically has a high variance because it may be considered as a negative sample for the relevant vertex in the N-th order neighborhood of the target vertex. Therefore, these gradient-based methods may produce error propagation, resulting in unstable model performance. Alternatively, the prior art only considers the 1 st order neighborhood of the target vertices in FIG. 1, which is not sufficient to explore the correlation between vertices.

2. Modeling of neighborhood information by HNS model

After performing a random walk on the network, a walk sequence S can be obtained. First, neighborhood information of vertices can be modeled as a potential community structure. Then, the following assumptions were followed: (1) Each random walk sequence is a sample of the potential structure of the community; (2) each vertex has a probability distribution of its community preference.

The present invention considers how to select negative examples. In addition, the number of communities in the real world is often unknown, but the existing algorithm usually needs to set the number of communities in advance, so that the dynamic change of the number of communities when the communities find problems cannot be well handled. In one embodiment of the invention, a hierarchical Dirichlet process is used to explore potential community structures, which provides a non-parametric approach to inferring community numbers. The probability map model of HNS is shown in fig. 2, and the generation process includes the following steps:

step S1, carrying out probability distribution modeling G on all known communities C ₀ | γ, C-Dir (γ/C), wherein G ₀ Representing a base distribution of a Dirichlet process, dir (gamma/C) representing a Dirichlet prior distribution of a sequence with respect to community relevance, gamma being a hyperparameter, C representing the number of communities;

step S2, for each community C ∈ {1,2, \8230;, C }:

(a) The probability of each vertex belonging to the community is

Dir (beta) represents Dirichlet prior distribution of a certain vertex in the sequence about community correlation degree, and beta is a hyper-parameter;

step S3, for each sequence S obtained by wandering:

(a) According to the weight theta _s ∣α,G ₀ ～DP(α,G ₀ ) Sampling the community where the sequence is located, wherein theta _s Denotes, α denotes the weight representing the vertex v belonging to the new community, DP (α, G) ₀ ) Representing the degree of relevance of a sequence belonging to a new community modeled using the dirichlet process;

(b) For each vertex v ∈ { v } in the sequence ₁ ,v ₂ ,…,v _N }：

i. According to weight Z _S,v ∣θ _S ～Multinomial(θ _S ) Sampling an allocation of communities for the vertex, where Z _s，v A probability distribution representing the degree to which a vertex v in the sequence s is related to all communities;

obtaining the generation probability of the vertex from the assigned community

Wherein, the first and the second end of the pipe are connected with each other,

representing the probability distribution of the vertices associated with community c in community c,

a weight representing the degree of correlation of the vertex v in the sequence s with a community sampled.

Where C represents the discovered community set, and is | C | =1 at initialization, γ and β are bayesian parameters in a dirichlet distribution, and DP represents the dirichlet process. In summary, the HNS is a model based on Hierarchical Dirichlet Processes (HDP), the number of the community sets C can be dynamically changed, and the model of the present invention can automatically learn and determine the number of communities in the training process.

3. Training framework for HNS of the invention

In one embodiment, the proposed model training framework is shown in FIG. 3. Specifically, for each set of randomly walked sequences, a sliding window is first used to generate vertex pairs for the positive samples, e.g., vertex 4 in FIG. 3 is the target vertex, and then the vertex pairs for the positive samples are (4, 3) and (4, 5). Then, according to the generation process mentioned in the neighborhood information modeling of fig. 2, a community is sampled and allocated to the target vertex 4, and the negative sample of the target vertex is sampled through the reverse distribution of the community-related vertices. Finally, the vertex-wise representation vector is learned in conjunction with a negative sampling methodology.

4. Model reasoning

In one embodiment, the method of Gibbs sampling is used to reason about the parameters of the probabilistic graphical model of FIG. 2. In particular, given a sequence s, each vertex v in the sequence, there is a distribution Z of potential community weights _s,v Then, by using the characteristic that Dirichlet distribution is conjugate prior distribution of polynomial distribution, parameter theta related to community distribution _S And

can be cancelled by the integration. Thus, given other vertices within sequence s except for current vertex v, community allocation information for other vertices in sequence s may be obtained

And the distribution of other vertices in a community c except the sequence s

Finally, the probability of a vertex v selecting a stored community may be obtained from a conditional distribution, for example using the formula

Based on the computed probability, the sample selection community c is assigned to vertex v. There are two cases of c for a sampling community: selecting the existing community or the new community, wherein the corresponding probability formula is as follows:

1) Selecting an existing community

Given the information of the community assignments of other sequences and vertices, the probability that the current vertex v selects the existing community c is:

wherein

Indicating the number of vertices assigned to community c except for the current vertex v in sequence s,

indicating the number of vertices v assigned to community c in addition to the current sequence s. To summarize, the left half of the product of equation (5) above means that vertex v tends to select communities with more vertex assignments in the same sequence, and the right half means that vertex v is more prone to select communities with higher co-occurrence rates with other vertices.

2) Selecting a new community

C represents the number of existing communities and C +1 represents a new community, then the conditional probability formula for a vertex v belonging to a new community is as follows:

wherein the left half of the product represents the probability that vertex v belongs to the new community, the right half is a normalization factor, and N represents the number of vertices. Alpha is a priori parameter, and the larger alpha, the more likely the vertex is to select a new community. In experiments, the performance of the HNS model was found to be robust to the parameter α.

3) Sampling more reasonable negative samples:

referring back to the training framework shown in FIG. 3, first, the target vertex v is defined by equations (5) and (6) _t A community c is sampled. Then, negative examples may be sampled from the community, and the probability that other vertices in the community are considered negative examples is calculated as follows:

among them, the principle behind equation (7) is that when a target vertex is assigned to a community c, the negative sample should be a vertex having a small correlation with the community, and the probability of irrelevance of the vertex is calculated as P _neg (v) In that respect Then, with respect to the target vertex v, in the combination formula (7) _t Distribution of negative sample vertices ψ _neg Comprises the following steps:

wherein Dir (P) _neg (v ₁ )，...，P _neg (v _N ) A dirichlet prior distribution representing the vertices of the negative examples.

Finally, negative samples can be sampled using a polynomial distribution:

v|ψ _neg ～Multinomial(ψ _neg ) (9)

in the formula (4), the frequency of the vertex-based fixed distribution P is compared with _NS (v) Compared with the NS method for sampling the negative sample, the invention learns the vertex distribution psi of the negative sample of the community _neg The hierarchical structure of the potential communities of the vertexes can be considered, and negative samples are sampled according to community information of the vertexes, namely, the vertexes with small correlation with the communities where the target vertexes are sampled are used as the negative samples.

5) Objective function optimization analysis

In the following it will be described how the HNS of the present invention optimizes the Skip-Gram based objective function.

In one embodiment, a random gradient descent (SGD) method is used to optimize the Skip-Gram based objective function. Combining the method of NS in equation (3) and equation (4), the objective function can be further derived as:

wherein v is _i Is the target vertex, v _j Is its context vertex. In HNS, the PSI can be calculated _neg Distributing to sample negative samples, i.e. by using P in equation (7) _neg Instead of P in the formula (10) _NS To sample negative samples. Finally, the objective function of HNS is obtained as follows:

to optimize the above objective function, its derivative is calculated as follows:

then let the derivative

Or → 0, the following expression can be further derived:

as can be seen from the above formula:

for each target vertex v _i And positive samples thereof, e.g. v _j Calculating the similarity between them, optimizing to make the value close to-log (K.P) _neg (v _j ) Log (K.P) of the formula _neg (v _j ) Is proportional to

While

Representing the vertex v _j Belonging to a target vertex v _i The probability of the community c;

for v _i And negative examples thereof, e.g. v _k Through 1-P _neg (v _k ) → 0 optimizes the similarity between them with the goal of having negative samples v _k And is assigned to the target vertex v _i Is less relevant. Finally, the vertex distribution ψ of the negative sample community is used _neg To sample negative samples.

In conclusion, the hierarchical negative sampling method provided by the invention models the N-order neighbor information of the target vertex by designing a probability graph model, so that a more reasonable negative sample is selected. The probability graph model can learn the probability distribution of negative samples related to the target vertex, meanwhile, a potential community structure in the network is explored by utilizing a hierarchical Dirichlet process, and the model can be made to learn the optimal order of the adjacent area of the target vertex in a self-adaptive mode through a non-parametric Bayes inference method. Specifically, it can be observed from equation (13) that the method models the potential community structure of the target vertex by using the neighborhood information of the target vertex, and simultaneously combines a negative sampling method to formulate and optimize an objective function. The detailed pseudo code is shown in algorithm 1.

6) HNS algorithm complexity analysis

In HNS, the training process consists of two parts: parameters of a probability map model are deduced by using a Gibbs sampling method, and vertex representation learning is performed by using a random gradient descent method. The complexity of the first part is O (I) ₁ CVSL), wherein I ₁ Is the number of iterations, C represents the size of the number of communities, V represents the size of the dimension of the vertex embedding vector, S is the size of the number of sequences, and L is the length of the sequences. The second part has a complexity of O (I) ₂ V log V), wherein I ₂ Is the number of iterations. Thus, the overall complexity of HNS is O (V (I) ₁ CSL+I ₂ log V)). In addition, each component of the HNS may be turned on during the training processThe acceleration is performed by existing algorithms, such as the combined use of PLDA, PLDA + and Skip-Gram, etc. In large scale scenarios, these parallel algorithms only require time to train that scales linearly with the data size.

In order to further verify the effect, the method is applied to data sets with different real network scales for verification. In the experiment, three real-world network data sets were used to evaluate the performance of the proposed HNS model. Furthermore, the robustness of HNS to the performance of its parameter variations was also investigated. Experiments show that the performance of the method in the vertex classification task is superior to that of the most advanced model at present.

1. Data set

Experiments were performed on two citation networks and one language network. Table 1 lists the detailed statistics of the data set.

Table 1: statistical information of the Experimental data set

(1) Cora is a typical set of reference network data derived from prior art constructions. After filtering out the unlabeled papers, this data set contains 2211 machine learning paper digests from 7 classes, and 5214 links between papers, which are the reference relationships between papers.

(2) DBLP is a referencing network dataset in the computer domain. Each of the papers cited therein is incorporated by reference or by reference into other papers. In the experiment, 4 conference papers in the research field were selected: 1) A database: SIGMOD, ICDE, VLDB, EDBT, PODS, ICDT, DASFAA, SSDBM, CIKM; 2) Data mining: KDD, ICDM, SDM, PKDD, PAKDD; 3) Artificial intelligence: IJCAI, AAAI, NIPS, ICML, ECML, ACML, IJCNN, UAI, ECAI, COLT, ACL, KR; 4) Computer vision: CVPR, ICCV, ECCV, ACCV, MM, ICPR, ICIP, ICME. After preprocessing, the retained dataset contained the title of 17725 papers from the research field mentioned above and 52914 reference links between them.

(3) Wiki is a data set of a language network, containing 2405 web pages of 17 categories and 17981 links between them. This data set was first published in the LBC project and is widely used for the task of evaluating vertex classifications.

2. Model for performance comparison

To verify the performance of the present invention, the performance of HNS on the vertex classification task was compared with the state-of-the-art model.

1) NSCaching is a model based on the generation of a countermeasure network that can select high quality negative examples by calculating the magnitude of the gradient of the negative examples when using gradient descent to train vertex-embedded vectors.

2) NS is the most widely used model in vertex representation learning methods. It simplifies the objective function of the Skip-Gram model but retains its performance, which is very effective for handling large-scale networks.

3) The non-neighborhood negative sampling (NN-NS) method is a variant of NS. For each target vertex, the NN-NS setting chooses the probability of its neighbors (corresponding to the 1 st order domain of the target vertex mentioned in the shortcomings of the prior art) as negative samples to be zero because the target vertex is strongly correlated with its neighbors.

4) HNS is a hierarchical negative sampling model provided by the invention, and prevents the relevant vertex of the target vertex from being sampled as a negative sample by modeling the neighborhood information of the target vertex.

5) HNS (w/o DP) is a variant of HNS proposed by the present invention, except that the number of communities is automatically inferred by presetting the number of communities, rather than by using the dirichlet process. The variant model is used as a reference baseline for HNS by manually searching for the best value of the number of communities, evaluating whether HNS can automatically approximate the best value, and observing whether HNS achieves similar performance to HNS (w/o DP) in the vertex classification task.

3. Model parameter setting and operating environment

For fair comparison, the following parameters are set uniformly in all methods: the random walk frequency of each vertex is 80, the size of the sliding window is 10, the number of negative sample samples of each target vertex is 5, the iteration frequency is 5, and the vertexThe embedding vector dimension is 200. In nscoching, the ComplEx pattern was chosen that resulted in the best performance. In HNS, the following hyper-parameters were selected using a grid search method: γ ∈ {0.5,1.0,1.5}, α ∈ {0.02,0.06,0.1,0.14,0.18,0.2} - (1) ^-2 M), β ∈ {0.01,0.05,0.1}, where M is the number of random walk sequences. The results of the HNS experiments using the parameters (γ =1.0, α =0.001 · M, β = 0.01) will be described below. Furthermore, in HNS (w/o DP), the number of communities C ∈ [10,100 ]]Searching for an optimal value. Using Python language implementation model, all comparison models were run on a Linux server with Intel Core i 9-9900.10 GHz CPU and 16GB memory.

4. Evaluation index

In order to evaluate the classification performance of the model, micro-F1 is used as an evaluation index, and the formula is as follows:

where C denotes the number of classes, TP, FP and FN denote true positive, false positive and false negative, respectively. The value of Micro-F1 is between 0 and 1, with higher values indicating better classification performance.

5. Evaluating performance of models on vertex classification tasks

For the vertex classification task, the learned vertex embedding vector is used as a feature, and the classifier is trained by using One-vs-rest logistic regression realized by librinear. Then, the vertex classification accuracy of all methods in the data sets of different training ratios was evaluated. Specifically, with reference to the settings used in DeepWalk, the performance of HNS in sparse scenes was evaluated with a training ratio of 1% to 10%. For each training ratio, randomly selecting a part of vertexes as a training set, using the rest of vertexes as a testing set, and finally evaluating the vertex classification accuracy by using a Micro-F1 index.

Referring to tables 2, 3 and 4, the Micro-F1 index was used to evaluate the vertex classification performance of different methods in different training ratios on the data sets of Cora, DBLP and Wiki. The highest score is highlighted in bold.

Table 2: micro-F1 values for vertex classification tasks on Cora datasets

Table 3: micro-F1 values for vertex classification tasks on DBLP datasets

Table 4: micro-F1 values for vertex classification tasks on Wiki datasets

From the above table data, the following conclusions can be drawn:

1) The model HNS proposed by the present invention and its variant HNS (w/o DP), the vertex classification performance on different datasets at different training ratios was consistently better than other comparative models, except for the cases of training ratios of 2%, 3% and 4% in the Wiki dataset, which demonstrated the effectiveness of the model. Furthermore, HNS can achieve similar or better performance than HNS (w/o DP).

2) In these tasks, in particular on Wiki datasets, it can be seen that NN-NS in most cases achieved higher Micro-F1 scores than NS and NSCaching models. This is because the NN-NS model may exclude the 1 st order domain of the target vertices when selecting negative examples for each target vertex. However, the NN-NS considers only the case of the 1 st domain of the target vertex, and when the 2 nd domain or the 3 rd domain of the target vertex and even the vertices of the N th neighborhood are also related to the target vertex, the performance of the model inevitably decreases.

3) After the potential hierarchical structure of the vertex is considered, HNS and the variant thereof can learn the vertex embedding vector with higher discrimination capability. Furthermore, even though the training data set is small, the model proposed by the present invention is still superior to the baseline method, which indicates that sparse scenes can be better handled.

6. Robustness of model parameters

Further, the influence of parameters in HNS (w/o DP) and HNS on the performance of the vertex classification task, such as the community number C and the hyper-parameter alpha, is researched, and experiments show that the method provided by the invention has robustness.

First, the influence of the number of communities on the HNS (w/o DP) vertex classification task is studied to verify whether HNS can automatically find the best value of the number of communities. For example, the number of communities, i.e., C e ({ 10,20,40,60,80}, {10,20,50,80,100}, {10,20,30,40,50,60 }) was varied across the data sets of Cora, DBLP, and Wiki, respectively, to test HNS (w/o DP).

Fig. 4 shows the vertex classification results of HNS (w/o DP) after setting different numbers of communities in three datasets, where 4 (a) corresponds to a Cora dataset, 4 (b) corresponds to a DBLP dataset, and 4 (c) corresponds to a Wiki dataset. Each data point represents the average Micro-F1 performance at a training rate of 1% to 10% for each data set at the currently set number of communities.

From fig. 4, the following observations can be made:

1) Micro-F1 value representation for HNS (w/o DP), fluctuates with the number of communities, especially on DBLP and Wiki datasets. Specifically, in FIG. 4 (a), FIG. 4 (b) and FIG. 4 (c), the Micro-F1 value curves of HNS (w/o DP) had fluctuation amplitudes of 60.14%, 64.39% and 49.85%, respectively. Furthermore, HNS (w/o DP) may achieve relatively stable and efficient vertex classification performance on Cora, DBLP, and Wiki for a number of communities of 40,50, and 30, respectively.

2) The HNS can automatically approach the best performance (w/o DP) of the HNS. It can be seen that the Micro-F1 values achieved by HNS on the three data sets were 59.35%, 64.46% and 50.27%, respectively. Further, the number of communities found by HNS was 39, 56, and 34, respectively, which verified that HNS can automatically approximate the optimal number of communities set in HNS (w/o DP).

In addition, the effect of the hyperparameter α on the number of communities found in HNS and the Micro-F1 value was studied. With α ∈ {0.02,0.06,0.1,0.14,0.18,0.2} · (1 e) ^-2 M) changing the value of alpha on the three data sets, where M is in Cora,2211, 17725 and 2405 in DBLP and Wiki, respectively.

FIG. 5 is the effect of the hyper-parameter α on the number of HNS found communities and the value of Micro-F1, respectively, where FIG. 5 (a) is the effect of parameter α on the number of HNS found communities, FIG. 5 (b) is the effect of parameter α on the value of Micro-F1, and M represents the total number of random walk sequences in each dataset, 2211, 17725 and 2405 in Cora, DBLP and Wiki, respectively.

The following results can be obtained from fig. 5:

1) Fig. 5 (a) shows the number of communities found by HNS at different alpha values. It can be seen that over the three datasets, the number of communities found increases with increasing α. This is because alpha represents the probability of a vertex selecting a new community.

2) FIG. 5 (b) shows the Micro-F1 values of HNS at different α values. It should be mentioned that although the number of communities found by HNS increases with increasing α, the Micro-F1 values remained stable across all datasets, indicating that HNS is robust to changes in the hyper-parameter α.

In summary, according to the network representation learning method based on hierarchical negative sampling provided by the invention, the domain information of the modeling vertex is researched for the first time to sample the negative sample, the potential structure of the vertex can be explored when the negative sample is selected, and the N-order neighborhood information of the target vertex is utilized to sample a more reasonable negative sample. Specifically, the negative samples are selected according to the N-order neighbor information of the target vertex by combining a probability graph model, the learned community vertex distribution of the negative samples can be regarded as the probability that each vertex is sampled as the negative sample of the target vertex, and meanwhile, the optimal order required to be reserved in the vertex adjacent area can be automatically searched. In addition, a potential community structure of a network vertex is explored by utilizing a hierarchical Dirichlet process, and a nonparametric method is provided for deducing the number of communities in the network. Experimental results show that in the real networks with different training scales, the method is superior to the performance of the most advanced model in the vertex classification task.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A network representation learning method based on hierarchical negative sampling comprises the following steps:

for a network G = (V, E), a set of random walk sequences S = { S) is obtained ₁ ,…,s _M N is the size of the vertex set, M represents the number of random walks, V represents the vertex set,

2. The method of claim 1, wherein determining the potential community structure using a hierarchical dirichlet process to generate a probabilistic graph model comprises:

modeling G probability distribution for all known communities C ₀ | γ, C-Dir (γ/C), wherein G ₀ Representing the base distribution of the Dirichlet process, dir (gamma/C) representing the Dirichlet prior distribution of the random walk sequence with respect to the community correlation degree, gamma being a hyperparameter, C representing the number of communities;

for each community C ∈ {1,2, \8230;, C }, the probability that each vertex belongs to that community is set to

Wherein Dir (β) represents Dirichlet prior distribution of a certain vertex in the random walk sequence with respect to community correlation degree, and β is a hyper-parameter;

for each wandering sequence s, according to a weight θ _s ∣α,G ₀ ～DP(α,G ₀ ) The sampling takes the community in which the walking sequence is located, and v ∈ { v } for each vertex in the walking sequence ₁ ,v ₂ ,…,v _N Execution:

according to weight Z _S,v ∣θ _S ～Multinomial(θ _S ) Sampling the distribution of a community for the vertex;

obtaining generation probabilities of vertices from assigned communities

Where DP represents the Dirichlet process, α represents the weight of vertex v belonging to the assigned new community,

a weight representing the degree of correlation of the vertex v in the sequence s with a community sampled, multinomial representing a polynomial.

3. The method of claim 2, wherein inferring parameters of the probabilistic graphical model using gibbs sampling comprises:

for a walking sequence s, there is one distribution Z of potential community weights for each vertex v in the walking sequence _s,v ；

Utilizing the characteristic that Dirichlet distribution is conjugate prior distribution of polynomial distribution, the parameter theta related to community distribution _S And

the integral is eliminated, and further community distribution information of other vertexes in the walking sequence s except the current vertex v is obtained for other vertexes in the walking sequence s

And distribution of vertices in community c except for wandering sequence s

Either an existing community or a new community is selected for vertex v.

4. The method of claim 3, wherein the probability of selecting a stored community c for a vertex v is set to:

wherein the content of the first and second substances,

indicating the number of vertices assigned to community c in the walkabout sequence s except for the current vertex v,

indicating that, in addition to the current wandering sequence s, a vertex v is assigned to a community cThe number of the cells.

5. The method of claim 3, wherein the conditional probability that vertex v belongs to a new community is set as:

wherein C +1 represents a new community.

6. The method of claim 4, wherein v is the target vertex _t Negative samples are sampled according to the following steps:

is a target vertex v _t Sampling a community c;

sampling negative samples from the community c, wherein the probability that other vertices in the community c are considered as negative samples is set as:

for the target vertex v _t Distribution of negative sample vertices ψ _neg Expressed as:

negative samples are sampled using a polynomial distribution:

v|ψ _neg ～Multinomial(ψ _neg )

wherein the vertex distribution psi of the community negative samples _neg And (4) taking the hierarchy of the potential communities of the vertexes into consideration, and sampling vertexes with small correlation with the communities where the target vertexes are positioned as negative samples, P _neg (v) Is the uncorrelated probability of a vertex.

7. The method of claim 6, wherein the objective function is represented as:

wherein v is _k Is the sampled negative sample vertex, K is the number of negative samples, v' _k Representing the vertex v _k Transposition of vector of (c), v' _j Representing a vertex v _j Transpose of the vector of (v) _i Representing a target vertex, vertex v _j Representing a target vertex v _i Is positive.

8. The method of claim 1, wherein the modeling the domain information for the vertices follows the following assumptions:

each random walk sequence is a sample of the underlying structure of the community;

each vertex has a probability distribution of its community preference.

9. A computer-readable storage medium, on which a computer program is stored, wherein the computer program realizes the steps of the method according to any one of claims 1 to 8 when executed by a processor.

10. A computer arrangement comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor realizes the steps of the method of any one of claims 1 to 8 when executing the computer program.