CN110019981B

CN110019981B - Directed super-edge propagation method integrating unsupervised learning and network out-degree

Info

Publication number: CN110019981B
Application number: CN201711208187.4A
Authority: CN
Inventors: 盛益强; 郝怡然
Original assignee: Institute of Acoustics CAS
Current assignee: Zhengzhou Xinrand Network Technology Co ltd
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2021-05-04
Anticipated expiration: 2037-11-27
Also published as: CN110019981A

Abstract

The invention relates to a directed super-edge propagation method integrating unsupervised learning and network out-degree, which comprises the following steps: discovering a series of undirected hyperedges connecting multiple vertices from the network node relationships using an unsupervised learning algorithm; aiming at any undirected excess edge, excavating a directed excess edge relation including a forward excess edge and a backward excess edge until traversing all undirected excess edges; sorting elements of a front piece vertex set in the directed super-edge relation according to the out degree, and selecting seed nodes from big to small; and starting from the selected seed nodes, carrying out network information propagation by adopting a linear threshold propagation algorithm aiming at the directed hypergraph. On the premise of ensuring expandability, the invention utilizes the directed super edge to select the seed node and selects a node with the largest degree of outturn to carry out propagation on the basis, thereby improving the coverage rate and the propagation efficiency.

Description

Directed super-edge propagation method integrating unsupervised learning and network out-degree

Technical Field

The invention relates to the field of social computing and media mining, in particular to a directed super-edge propagation method integrating unsupervised learning and network outdegree.

Background

With the rapid development of internet technology, more and more online social networks are coming up in succession. In these social networks, individuals in the network interact with each other using the social network as a medium, and information, viewpoints, and influence are propagated. Indeed, as the research of big data is becoming more widespread, social network influence propagation has become one of the key issues in data mining and social network analysis. In the field of social networks, the problem of maximizing influence refers to that a propagation model is given and a set of seed nodes is selected, information is propagated from the node set, and finally the coverage rate of activated nodes on a network is maximized. The goal of impact maximization is to achieve maximum propagation coverage with the shortest time and the fewest seed nodes. In the current internet, influence propagation is mainly embodied by information propagation, so that analyzing influence factors of the information propagation has important significance for improving an influence propagation model. The influence factor of information propagation can be regarded as the influence factor of the node activation probability in the force propagation model.

The propagation models that are currently used more widely include independent cascade models and linear threshold models. The independent cascade model treats an active node as a publisher, an activated node as a recipient, and the publisher activates the recipient. Thus, the independent cascade model is a publisher-centric model, where a node can only affect nodes directly connected to it, and once a node is activated, it will attempt to deactivate all neighboring nodes. Each node in the linear threshold model has an affected threshold that is uniformly and randomly chosen in the range of 0 to 1 and that does not change any more once determined to be in propagation. As with the independent cascade model, there is and only the seed set S at time t ═ 0₀The node in (1) is activated. At each subsequent time t ≧ 1, each inactive node needs to determine whether it is activated or not according to whether the linear weighted sum of all activated neighbors to it has reached its affected value, if so, node v is activated at time t, otherwise, node v remains inactive. The propagation process ends when no new nodes are activated at a certain time.

For the whole propagation process, the selection of the seed node is the basis of propagation, because the selection result of the seed node directly affects the final effect of propagation, including the coverage rate and the propagation time. Currently, common seed node selection methods include node degree-based heuristic algorithms, greedy algorithms, distance-based heuristic algorithms, random algorithms and the like.

Assuming that the initialized active node is S, f (S) represents that the number of the final active nodes obtained by propagation by using the nodes in S as seed nodes. Taking the greedy algorithm as an example, an empty set S is initialized first, and then all nodes need to be traversed each time a node is added, and the node with the maximum value of f (S + v) -f (S) is added to the node set S. When the greedy algorithm is used for selecting the seed nodes, all the nodes need to be traversed when one node is added every time, so that the time complexity is high, and the greedy algorithm does not consider the topological structure of the graph, which is the limitation of the greedy algorithm.

For a node degree-based heuristic algorithm, namely, k nodes with the highest degree are selected as initial active nodes, the time complexity of the algorithm is greatly reduced compared with a greedy algorithm, and when the algorithm is operated for multiple times and N seed nodes are selected each time, the seed nodes selected by the algorithm are relatively fixed, so that the transmission result fluctuation is not large, and the transmission result fluctuation of the greedy algorithm is relatively large. But because the algorithm only selects the nodes with higher degrees at a time, the information of partial nodes is ignored. For a simpler random algorithm, namely, a plurality of nodes are randomly selected from an original node set to serve as seed nodes, and because the uncertain factors are more and the randomness is high, the seed nodes are not generally selected. And (3) carrying out propagation by using a node degree-based heuristic algorithm, when the seed node set S is selected, simply selecting the K nodes with the highest degree, not considering the topological structure of the graph, and when most of the nodes with higher degree are in the same group, the coverage rate is relatively reduced.

In order to solve the above problems, it is necessary to introduce a topological structure of relationships between users, to improve the quality of seed nodes, and further to improve coverage and propagation efficiency, and to propagate information in a wide range in as short a time as possible. Therefore, a common seed node selection method is combined with correlation analysis, and nodes with strong correlation in a large-scale data set are found out to form a hypergraph. Selecting nodes from each hypergraph as seed nodes may improve coverage. The currently common association analysis algorithm is the brute force method, F_k-1*F₁Algorithms, and Apriori algorithms, and the like. For F_k-1*F₁According to the algorithm, each K-edge is formed by combining a frequent (K-1) -edge and a 1-edge, then pruning the edges which are lower than the minimum support degree in the K-edges, and repeating the process until no new edge is generated. The algorithm is slightly less time-complex than Apriori algorithm, but it is difficult to avoid duplicate generation of candidate edges. However, the conventional Apriori algorithm is not suitable for directed graph and finds an undirected super edge in forwarding data from a user, so that it is necessary to widen the application range of the conventional method.

Disclosure of Invention

The invention aims to solve the problems that the transmission speed of the existing network information transmission method is not fast enough and the transmission range is not wide enough on the premise of ensuring the expandability.

In order to achieve the above object, the present invention provides a directed hyper-edge propagation method integrating unsupervised learning and network out-degree, which comprises the following steps:

discovering a series of undirected hyperedges connecting multiple vertices from the network node relationships using an unsupervised learning algorithm; aiming at any undirected excess edge, excavating a directed excess edge relation including a forward excess edge and a backward excess edge until traversing all undirected excess edges; sorting elements of a front piece vertex set in the directed super-edge relation according to the out degree, and selecting seed nodes from big to small; and starting from the selected seed nodes, carrying out network information propagation by adopting a linear threshold propagation algorithm aiming at the directed hypergraph.

Preferably, the step of sorting the elements of the front piece vertex set in the directed super-edge relationship according to out degree and selecting the seed nodes from big to small includes: in a directed hyper-edge relationship, only one node with the largest out-degree is preferentially selected as a seed node.

Preferably, the step of discovering a series of undirected hyperedges connecting multiple vertices from network node relationships using an unsupervised learning algorithm comprises: generating an edge list of forwarding pairs formed by all the propagation content providers and corresponding propagation content subscribers, pruning the forwarding pairs smaller than a given minimum support degree, and recording the pruned super edge list; performing pairwise recombination on the undirected excess edges in the set left after pruning to generate new edges; pruning the new combination with the support degree smaller than the given minimum support degree, and recording the super-edge column after pruning again; wherein the rules of recombination are that the first 1/2 elements in the two sets are combined to form the first 1/2 element in the first new set, and the last 1/2 element is combined to form the last 1/2 element in the second new set; and repeating the steps until the remaining collection after pruning is an empty collection.

Preferably, the given minimum support has a value of 0.25.

Preferably, said slave-to-undirected supersedeThe step of mining the directed super-edge relation in the edge list comprises the following steps: optionally, a non-directional super edge is marked as { trans₁、trans₂…trans_k}, optionally one node (denoted as trans)_m) To create a relationship list (denoted Rlist) with backward hyperedges (the number of elements of the set of the backend vertices is 1), and to calculate the confidence of the hyperedge according to the following formula:

repeating the steps until the undirected super edge is empty, and deleting the relation less than the given minimum reliability; merging the relationships which are not deleted by a given rule to obtain a new relationship list; the given merging rule is that the predecessors of the relationships in the Rlist are compared, and if there are two relationships (denoted as R _ a and R _ b) whose predecessors differ by only one node, and the element whose predecessor is R _ a different from the predecessor is R _ b, and the element whose successor is R _ b different from the element whose successor is R _ a, then the two relationships are merged; the new relation front piece is the same element as the R _ a and R _ b front pieces, and the back piece is the union of the R _ a and R _ b back pieces; until the entire list of super edges is traversed.

Preferably, starting from the selected seed node, the step of propagating the network information by using a linear threshold propagation algorithm for the directed hypergraph includes: direct activation seed node set A_lreadyAnd randomly assigning a threshold value theta to each of the remaining nodes_u. Wherein the threshold value theta_uRequires that the content of the compound is [0,1 ]]Is adjusted within the value range of theta_uThe larger the value is, the harder the node is to activate, theta_uThe smaller the value is, the easier the node is to activate; setting a neighbor node set of a current node in a directed hypergraph as N (v), and defining b for any node w e to N (v)_vwRepresents the influence degree of the node w on the node v and satisfies

For any non-activated node, when the activated neighbor co-action is greater than the randomly assigned threshold, i.e., the activated neighbor co-action is greater than the randomly assigned threshold

The node is activated; and in the process of network information transmission, continuously repeating the steps until no new active node exists and the network information transmission is finished.

Preferably, the unsupervised learning algorithm includes an unsupervised learning algorithm including K-means clustering, Apriori, FP-growth.

Compared with the prior art, the invention utilizes the directed super-edge to select the seed node on the premise of ensuring the expandability, and selects the node with the largest out-degree for propagation on the basis, thereby improving the coverage rate and the propagation efficiency. The invention discovers the undirected super edges by introducing an unsupervised algorithm and further discovers the directed super edges, and changes the application range of the directed super edges from an undirected graph to a directed graph.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below.

Fig. 1 is a flowchart of a directed hyper-edge propagation method for fusing unsupervised learning and network out-degree according to an embodiment of the present invention;

fig. 2 is an application diagram of the directed hyper-edge propagation method for merging unsupervised learning and network out-degree shown in fig. 1.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, concepts related to the present invention are described as follows.

The hypergraph is a popularization of the graph, at least one hyperedge (hyperedge) in the hypergraph is used for connecting any number of vertexes (vertex), and the general edge (edge) can be connected with only two vertexes; the directed hypergraph is obtained by adding a direction to a hyper-edge in the undirected hypergraph, wherein the direction represents the front-back sequence among the vertexes of the hyper-edge and divides a vertex set into a front piece vertex set and a back piece vertex set; forward super edge (forward super edge) is a directed super edge with the element number of 1 in the front piece vertex set; backward super edge (backward super edge) refers to a directed super edge with the number of elements of the back-part vertex set being 1 (as shown in fig. 2).

Based on the concept, in the embodiment of the invention, the front part vertex set and the back part vertex set of each directed hyper-edge relationship form a directed hyper-graph together, the hyper-graph is a vertex set formed by connecting the front part vertex and the back part vertex in pairs through directed hyper-edges including a forward hyper-edge and a backward hyper-edge, the front part vertex set has no hyper-edge, and the back part vertex set also has no hyper-edge.

Fig. 1 is a flowchart of a directed hyper-edge propagation method fusing unsupervised learning and network out-degree according to an embodiment of the present invention, where the method includes steps S101 to S104:

in step S101, a series of undirected hyperedges connecting multiple vertices are found from the network node relationships using an unsupervised learning algorithm including K-means clustering, Apriori, FP-growth.

In step S102, for any one of the undirected hyperedges, a directed hyperedge relationship including the forward hyperedge and the backward hyperedge is mined until all the undirected hyperedges are traversed.

In step S103, sorting the elements of the front piece vertex set in the directed super-edge relation according to the out degree, and selecting seed nodes from big to small; preferably, in a directed hyper-edge relationship, only one node with the largest out-degree is preferentially selected as a seed node;

in step S104, network information is propagated by using a linear threshold propagation algorithm for the directed hypergraph from the selected seed node.

In one embodiment of the present invention, the step of discovering the undirected hyper-edge using an unsupervised algorithm includes:

1. generating all broadcast content providers (denoted trans)_ak) And corresponding broadcast content subscribers (denoted trans)_bk) List of edges that make up a forwarding pair, denoted as [ { trans [ ]_b1,trans_a1},{trans_b2,trans_a2}…{trans_bn,trans_an}]Pruning forwarding pairs with the support degree less than the given minimum support degree, and recording the pruned super-edge list as L₁(ii) a Preferably, the value given for the minimum support is 0.25.

2. To pairPerforming pairwise recombination on the undirected excess edges in the set left after pruning to generate new edges; pruning new combinations smaller than the given minimum support degree, and recording the pruned super edge list as L_k(ii) a The rules of recombination are that the first 1/2 elements in the two sets are combined to form the first 1/2 element in the first new set, and the last 1/2 element is combined to form the last 1/2 element in the second new set.

3. Step 102 is repeated until the remaining set after pruning is an empty set.

Secondly, in the embodiment of the present invention, the step of mining the directed super-edge relationship from the undirected super-edge list includes:

1. optionally, a non-directional super edge is marked as { trans₁、trans₂…trans_k}, optionally one node (denoted as trans)_m) To create a relationship list (denoted Rlist) with backward hyperedges (the number of elements of the set of the backend vertices is 1), and to calculate the confidence of the hyperedge according to the following formula:

2. repeating the steps until the undirected super edge is empty, and deleting the relation less than the given minimum reliability;

3. merging the relationships which are not deleted according to a given rule to obtain a new relationship list (marked as Rlist _ new);

specifically, given a merge rule, the predecessors of the relationships in Rlist are compared, and if there are two relationships (denoted as R _ a and R _ b) whose predecessors differ by only one node, and the element whose predecessor is R _ a different from the predecessor is R _ b, and the element whose successor is R _ b different from the element whose successor is R _ a, then the two relationships are merged; the new relation front piece is the same element as the R _ a and R _ b front pieces, and the back piece is the union of the R _ a and R _ b back pieces;

4. repeating step 203 until the combination can not be carried out;

5. go back to step 201 until the entire list of hyper-edges is traversed.

Third, in an embodiment of the present invention, a linear threshold propagation algorithm for a directed hypergraph includes:

1. direct activation seed node set A_lreadyAnd randomly assigning a threshold value theta to each of the remaining nodes_u. Wherein the threshold value theta_uRequires that the content of the compound is [0,1 ]]Is adjusted within the value range of theta_uThe larger the value is, the harder the node is to activate, theta_uThe smaller the value, the easier the node is to activate.

2. Setting a neighbor node set of a current node in a directed hypergraph as N (v), and defining b for any node w e to N (v)_vwRepresents the influence degree of the node w on the node v and satisfies

The node is activated.

3. And in the process of network information transmission, continuously repeating the steps until no new active node exists and the network information transmission is finished.

Fig. 2 is an application diagram of the directed hyper-edge propagation method for merging unsupervised learning and network out-degree shown in fig. 1. As shown in fig. 2, the application is illustrated in a five-layer structure.

The circle node of the first layer is an empty set phi, which indicates that the initial set is empty;

the circle nodes of the second layer are forwarding pairs pair composed of all the broadcast content providers and corresponding broadcast content subscribers, and inside each circle node is one forwarding pair, such as 13, 19, 36, 56 and 71; pruning sets smaller than a given minimum support by calculating the minimum support for each set in the second layer, the pruned sets being marked in gray, say 71;

the circle node of the third layer is to match any forwarding pair in the second layer with the remaining forwarding pairs in the second layer in sequence, and combine the two forwarding pairs to form a new set, where the new set belongs to the third layer of the graph, for example: 13 and 19

generation

139, 13 and 36

generation

136, 13 and 71

generation

713, 19 and 71 generation 719; since the superset of pruned sets must be less than the minimum support, it also needs to be pruned, i.e. marked in grey, such as 713 and 719;

similarly, the circle nodes in the fourth layer are sequentially matched with the forwarding pairs in the third layer to form a new set, for example: 139 and 136 generate 1396, 136 and 713 generate 7136, 139 and 719 generate 7139, 136 and 719 generate 71369; the superset of pruned sets therein is also labeled gray, such as 7136, 7139, and 71369;

in the circle node at the fifth level, a set 71396 is generated that contains all the elements, also belonging to the pruned set, and therefore marked gray.

According to the embodiment of the invention, on the premise of ensuring expandability, the directed excess edge is utilized to select the seed node, and on the basis, the node with the largest out-degree is selected for propagation, so that the coverage rate and the propagation efficiency are improved.

It will be further appreciated by those of ordinary skill in the art that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, the components and steps of the various examples having been described herein generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A directed hyper-edge propagation method integrating unsupervised learning and network out-degree is characterized by comprising the following steps:

discovering a series of undirected hyperedges connecting multiple vertices from the network node relationships using an unsupervised learning algorithm; wherein the step of discovering a series of undirected hyperedges connecting multiple vertices from network node relationships using an unsupervised learning algorithm comprises: generating an edge list of forwarding pairs formed by all the propagation content providers and corresponding propagation content subscribers, pruning the forwarding pairs smaller than a given minimum support degree, and recording the pruned super edge list; performing pairwise recombination on the undirected excess edges in the set left after pruning to generate new edges; pruning the new combination with the support degree smaller than the given minimum support degree, and recording the super-edge column after pruning again; wherein the rules of recombination are that the first 1/2 elements in the two sets are combined to form the first 1/2 element in the first new set, and the last 1/2 element is combined to form the last 1/2 element in the second new set; repeating the steps until the remaining collection after pruning is an empty collection;

aiming at any undirected excess edge, excavating a directed excess edge relation including a forward excess edge and a backward excess edge until traversing all undirected excess edges; wherein the step of mining the directed excess edge relationship including the forward excess edge and the backward excess edge comprises: optionally, a non-directional super edge is marked as { trans₁、trans₂…trans_k}, optionally a node trans_mCreating a relation list Rlist with a backward super-edge, wherein the number of elements of a backward super-edge back-piece vertex set is 1, and calculating the credibility of the super-edge according to the following formula:

repeating the steps until the undirected super edge is empty, and deleting the relation less than the given minimum reliability; merging the relationships which are not deleted by a given rule to obtain a new relationship list; the given merging rule is that the predecessors of the relationships in the Rlist are compared, and if there are two relationships that the predecessors of R _ a and R _ b differ by only one node, and the element that the predecessor of R _ a differs from the predecessor of R _ b is exactly the element that the successor of R _ b differs from the successor of R _ a, the two relationships are merged; the new relation front piece is the same element as the R _ a and R _ b front pieces, and the back piece is the union of the R _ a and R _ b back pieces; until the whole super-edge list is traversed;

sorting elements of a front piece vertex set in the directed super-edge relation according to the out degree, and selecting seed nodes from big to small;

and starting from the selected seed nodes, carrying out network information propagation by adopting a linear threshold propagation algorithm aiming at the directed hypergraph.

2. The method according to claim 1, wherein the step of sorting the elements of the top-piece vertex set in the directed super-edge relationship according to degree of occurrence and selecting the seed nodes from big to small comprises:

in a directed hyper-edge relationship, only one node with the largest out-degree is preferentially selected as a seed node.

3. The method of claim 1, wherein the given minimum support level has a value of 0.25.

4. The method according to claim 1, wherein the step of propagating the network information by using a linear threshold propagation algorithm for the directed hypergraph from the selected seed node comprises:

direct activation seed node set A_lreadyAnd randomly assigning a threshold value theta to each of the remaining nodes_u(ii) a Wherein the threshold value theta_uRequires that the content of the compound is [0,1 ]]Is adjusted within the value range of theta_uThe larger the value is, the harder the node is to activate, theta_uThe smaller the value is, the easier the node is to activate;

setting a neighbor node set of a current node in a directed hypergraph as N (v), and defining b for any node w e to N (v)_vwRepresenting the influence of node w on node vTo a degree that satisfies

The node is activated;

and in the process of network information transmission, continuously repeating the steps until no new active node exists and the network information transmission is finished.

5. The method of claim 1, wherein the unsupervised learning algorithm comprises an unsupervised learning algorithm including K-means clustering, Apriori, FP-growth.