CN111126437A

CN111126437A - Abnormal group detection method based on weighted dynamic network representation learning

Info

Publication number: CN111126437A
Application number: CN201911155412.1A
Authority: CN
Inventors: 冯昊; 刘琰; 周资乔; 钟凤喆; 王博
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-05-08
Anticipated expiration: 2039-11-22
Also published as: CN111126437B

Abstract

The invention belongs to the technical field of dynamic network anomaly detection, and discloses an anomaly group detection method based on weighted dynamic network representation learning, which comprises the following steps: step 1: constructing a weighted dynamic network representation learning model based on a deep self-coding neural network; step 2: performing abnormal link identification based on the constructed weighted dynamic network representation learning model to obtain an abnormal link set; and step 3: and constructing a full-connection neural network model based on the abnormal link set, and detecting abnormal groups through the full-connection neural network model. The invention combines the abnormal link with the fully-connected neural network abnormality detection model, expands the application range of the invention based on the abnormal link, and carries out experimental verification on the safe mail data set and the AS-level Internet data set, and the experimental result shows that the invention has better abnormal group detection effect.

Description

Abnormal group detection method based on weighted dynamic network representation learning

Technical Field

The invention belongs to the technical field of dynamic network anomaly detection, and particularly relates to an anomaly group detection method based on weighted dynamic network representation learning.

Background

With the rapid development of network technology and the wide popularization of computers and mobile intelligent terminals, networks greatly change the work and life of people, and meanwhile, the network scale becomes larger and larger, and the structure becomes more and more complex. Therefore, anomaly detection on a dynamic network becomes more and more difficult, structural features in the graph are difficult to comprehensively capture based on the existing graph structural feature statistical method, and how to effectively identify an anomaly group in a changing network is a current research hotspot.

The basic idea of network representation learning is to change nodes in a network into multi-dimensional vector representation through a series of conversions, and to require to retain structural information in an original network as much as possible in the conversion process, so that tasks such as link prediction, node multi-label classification, community division and the like can be realized more conveniently by using the existing method. In currently known dynamic network representation learning methods, when a weighted network is faced, a random walk-based method increases or decreases the selection probability of a node according to the weight of a degree node when selecting a next hop node. The method can effectively reduce the distance between the nodes corresponding to the high-weight edges after representation learning, however, in an abnormal link detection task, whether the links between the nodes in the next time slice network are normal or not needs to be judged by using the representation of the nodes in the historical network, structural information between the nodes is learned by the method, and the weight information of the edges is not learned. Therefore, if the links exist among the nodes to be detected but the weights of the links are obviously larger or smaller than the weights of the links in the past, the method cannot successfully identify the weight abnormality of the links. Meanwhile, The duration of an abnormal event in a dynamic network is long or short, which is often difficult to be captured by a single time slice network, and an abnormal detection model based on a fully-connected neural network is provided in a paper (Miz V, Riccaud B, Benzi K, et al. However, in this paper, the node anomaly is defined as a sudden increase of traffic of a node within a certain time, and the change of the communication structure between nodes is not considered. Therefore, the invention constructs a weighted dynamic network representation learning model, performs abnormal link detection on the whole network on the basis of the model, finally constructs a full-link neural network based on abnormal links, and detects and determines an abnormal node set.

Disclosure of Invention

The invention provides an abnormal group detection method based on weighted dynamic network representation learning, aiming at the problems that the existing network representation learning method cannot well learn the corresponding relation between edges and weights when facing a weighted dynamic network and cannot effectively identify weight abnormality when abnormal link detection is carried out.

In order to achieve the purpose, the invention adopts the following technical scheme:

an abnormal group detection method based on weighted dynamic network representation learning comprises the following steps:

step 1: constructing a weighted dynamic network representation learning model based on a deep self-coding neural network;

step 2: performing abnormal link identification based on the constructed weighted dynamic network representation learning model to obtain an abnormal link set;

and step 3: and constructing a full-connection neural network model based on the abnormal link set, and detecting abnormal groups through the full-connection neural network model.

Further, the step 1 comprises:

step 1.1: for dynamic networks G ═ G₁,G₂,…,G_t,G_t+1,…,G_nEach edge e in }_iE, collecting the weight values of the E in different time slice networks, and collecting the edge E_iIs marked as w_ei＝{w₁,w₂,...,w_mFor sequence w_eiDiscretizing the same;

step 1.2: in each time slice network, a random walk path set is constructed based on each node in each time slice network, and givenNetwork G ═ V, E, W, for any V₁E to V, and constructing a random walk path set omega_v1＝{(v₁,v₂,...,v_l,w₁₂,w₂₃,...,w_(l-1)l),...|(v_i,v_i+1)∈E∩w_i(i+1)E.g. W, wherein l is the length of the constructed random walk path, W_i(i+1)Is an edge (v)_i,v_i+1) The weight of (c);

step 1.3: and (3) regarding the weight of the edge as a special node, coding each node in the random walk path as an input layer and an output layer of the deep self-coding neural network in a one-hot coding mode, learning the network structure and the weight information of the edge through a minimum loss function in an intermediate layer, and simultaneously compressing the dimension represented by each node vector to a preset vector representation dimension d.

Further, the step 1.3 includes:

step 1.3.1: minimizing the difference between the input layer and the output layer by optimizing a first objective equation:

wherein, | Ω | is the number of random walk paths, and l is the length of the random walk paths;

is the output of the nl-th layer, i.e. the output layer,

W^(nl-1)is the nl-1 th layer weight, b^(nl-1)Represents the nl-1 th layer bias;

for the ith random walk path

Any node of

The one-hot code of (1), which is the input of the 0 th layer, i.e., the input layer,

for the ith random walk path edge (v)_l-1,v_l) The weight of (c);

step 1.3.2: in the middle layer, for the random walk path (v)₁,v₂,...,v_l,w₁₂,w₂₃,...,w_(l-1)l) Minimizing the first half (v) of the path by optimizing a second objective equation₁,v₂,...,v_l) The distance between the nodes, the second objective equation is:

wherein ,

and

coding one-hot of two adjacent nodes in the random walk path;

step 1.3.3: minimizing the distance between the edge and the weight node by optimizing a third objective equation, the third objective equation being:

wherein

Is (v)₁,v₂,...,v_r) Edge (e) between nodes₁₂,e₂₃,...,e_(r-1)r) Any one side e_(j-1)jA vector representation of (a);

step 1.3.4: sparsity of input-output vectors is limited by KL divergence:

wherein d is the dimension represented by the vector, p is the sparsity parameter,

is the mean degree of activation of the layer τ neurons,

is the degree of activation of the i-dimensional neuron,

is the average activation degree of the i-dimension neuron, and is tau epsilon [1, nl ∈]；

Step 1.3.5: and (3) synthesizing the formula 1, the formula 2, the formula 3 and the formula 4, constructing a loss function, and finishing the construction of a weighted dynamic network representation learning model:

wherein

Representing the weight decay function, W^(τ)For the τ -th layer weight, F denotes the norm.

Further, the step 2 comprises:

step 2.1: dynamically updating vector representation of nodes, and setting sampling probability s for random walk path of 1 st to t th time slice network_i：

Wherein i is a time value;

step 2.2: acquiring a random walk path set by integrating a current time slice network and a historical time slice network, sequentially sending the random walk paths into a constructed weighted dynamic network representation learning model, and obtaining low-dimensional vector representation of nodes by minimizing a loss function;

step 2.3: and after the vector representation of each node of the t-th time slice network is obtained, abnormal link detection is carried out on each link of the t + 1-th time slice network based on the vector representation of the current node, and an abnormal link set is obtained.

Further, the step 2.3 comprises:

step 2.3.1: and link exception identification:

average distance between all edge-connected node pairs in 1 st to t th time slice network

As a reference, node v_i，v_jThe degree of closeness between is defined as:

wherein ,d_ijIs a node v_i，v_jThe Euclidean distance between them;

setting an abnormal link judgment threshold k, and when two nodes with the similarity smaller than k in the network are linked in a time slice t +1, determining that the node pair has link abnormality in the time slice t +1, wherein the link is an abnormal link;

step 2.3.2: weight anomaly identification:

by a pair of nodes v_i，v_jThe vector representation of (a) is subjected to a Hadamard product operation to obtain an edge e_ijBy computing the edge e_ijPredicting the weight of the edge in a t +1 time slice by the Euclidean distance of each weight node in a d-dimensional space; if the predicted weight value does not match the actual weight value, determining the edge e_ijA weight exception occurs at the t +1 th time slice, and the link is an exception link;

step 2.3.3: the abnormal link set is obtained through the step 2.3.1 and the step 2.3.2.

Further, the step 3 comprises:

step 3.1: regarding all abnormal links in the abnormal link set as edges among nodes, and constructing a full-connection neural network model based on the abnormal link set, thereby outputting a plurality of abnormal subgraphs and obtaining an abnormal subgraph set;

step 3.2: and taking the maximum connected subgraph in the abnormal subgraph set to output as a final abnormal group.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the vector representation of the nodes and the edges is obtained by learning the structural information and the weight information of the edges in the dynamic network, and the abnormal node set is obtained by using the full-connection neural network model on the basis of abnormal link detection.

2. The invention designs a weighted dynamic network representation learning model, which learns the dynamic network structure information more comprehensively, considers the weight as a special node, synthesizes the node representation to obtain the vector representation of the edge, and minimizes the distance between the edge and the 'weight node' thereof, thereby learning the weight information in the network. After the node vector representation is obtained, the real dynamic network data set is used for carrying out abnormal link detection, and the effectiveness of the method is verified through experiments.

3. The invention combines the abnormal link with the fully-connected neural network abnormality detection model, expands the application range of the invention based on the abnormal link, and carries out experimental verification on the safe mail data set and the AS-level Internet data set, and the experimental result shows that the invention has better abnormal group detection effect.

Drawings

FIG. 1 is a basic flowchart of an abnormal group detection method based on weighted dynamic network representation learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a weighted dynamic network representation learning model architecture of an abnormal group detection method based on weighted dynamic network representation learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the positions of the edges and the weighted nodes of a t-time slice network of an abnormal group detection method based on weighted dynamic network representation learning according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a dynamic network link structure change of an abnormal group detection method based on weighted dynamic network representation learning according to an embodiment of the present invention;

FIG. 5 is a second basic flowchart of an abnormal group detection method based on weighted dynamic network representation learning according to an embodiment of the present invention;

FIG. 6 is a diagram of the detection result of the anomaly in the Anran mail data set of the anomaly group detection method based on weighted dynamic network representation learning according to the embodiment of the present invention;

FIG. 7 is a diagram of the detection result of an abnormal group of Internet at level AS of Ribayone and Venezuela according to an abnormal group detection method of weighted dynamic network representation learning in the embodiment of the present invention;

fig. 8 is a statistical result diagram of abnormal link numbers of a libaran abnormal node set in an abnormal group detection method based on weighted dynamic network representation learning according to an embodiment of the present invention;

fig. 9 is a graph illustrating the evolution of the abnormal link number of the abnormal libaran node set according to the abnormal group detection method based on weighted dynamic network representation learning in the embodiment of the present invention;

FIG. 10 is a graph showing the statistical result of abnormal link numbers of Venezuela abnormal node set according to the abnormal group detection method based on weighted dynamic network representation learning in the embodiment of the present invention;

fig. 11 is a graph illustrating the evolution of abnormal link numbers in venezuelan abnormal node set according to the abnormal group detection method based on weighted dynamic network representation learning in the embodiment of the present invention.

Detailed Description

For a better understanding of the present invention, the meanings of some of the nouns appearing in the present invention are explained:

weighted dynamic network: the weighted dynamic network is a time-varying weighted network, and a dynamic network comprising n time slices is denoted by G ═ G₁,G₂,…,G_t,G_t+1,…,G_nT-th time slice network G_t＝(V_t,E_t,W_t)，V_tFor the set of vertices in the network, E_tRepresenting relationships between vertices for sets of edges, W_tIs a set of edge weights.

Weight exception: given dynamic network G ═ G₁,G₂,…,G_t,G_t+1,…,G_nNetwork G for any of its time slices_t＝(V_t,E_t,W_t) For any one edge e_i∈E_t，e_i＝{frm,to,w_iWhere frm, to are the two end points of the edge, w_tThe weight of the current edge is within the normal range of the weight [ w ] of the edge with frm and to as the end points in the range of n time slices_l,w_h]If w is_i<w_lOr w_i>w_hThen consider e_iA weight anomaly occurs at time slice t.

Link exception: the link exception comprises link exception connection and link exception disconnection, and the given dynamic network G is { G ═ G₁,G₂,…,G_t,G_t+1,…,G_nAnd after vector representation of each node of the t-1 time slice network is obtained, if two nodes v with low link probability occur_i、v_jIf the link occurs at a certain time t, the link behavior is called as abnormal link, and similarly, if two nodes v with high link probability occur_i、v_jAt some time t, the disconnection is said to be an abnormal disconnection of the link.

Synchronization exception chaining: given dynamic network G ═ G₁,G₂,…,G_t,G_t+1,…,G_nAnd when abnormal links appear from a plurality of nodes in the dynamic network from the t-th time slice network to the s-th time slice network, consistency and unification are presented, so that the node set is called to have synchronous abnormal link behavior in the t-s time slice.

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

a network anomaly is defined as a group of nodes synchronously having abnormal link behavior over a period of time. Given a weighted dynamic network G ═ G₁,G₂,…,G_t,G_t+1,…,G_nThe goal of our is to obtain a set of nodes with synchronized abnormal link behavior over a specified time period of the weighted dynamic network, for which the abnormal link set in the dynamic network is identified through network representation learning. For any time slice network t in dynamic networkAnd (3) learning the structural information of the 1 st to the tth time slice networks to detect abnormal links of the t +1 th time slice network, wherein the abnormal links comprise link weight abnormality and link abnormality.

After acquiring the whole weighted dynamic network abnormal link set, the method aims to acquire a node set synchronously generating abnormal behaviors in a certain time period, accordingly, the node set with connecting edges and synchronously generating the abnormal behaviors is searched on the basis of the abnormal link set, weights among nodes are acquired by comparing the abnormal behavior similarity of each node in the period, low-weight edges are pruned by setting a weight threshold value, and finally, a maximum connected subgraph (maximum sub-network output) is taken as the abnormal node set of the current time period.

In order to effectively detect abnormal groups of a weighted dynamic network, the invention discloses an abnormal group detection method based on weighted dynamic network representation learning, as shown in figure 1, a weighted dynamic network representation learning model is firstly established based on a deep self-coding neural network, abnormal link identification is carried out on the current dynamic network, an abnormal link set is fused with a fully-connected neural network, and finally an abnormal group (abnormal node set) is obtained. The three sections are described in detail below.

Step S11: constructing a weighted dynamic network representation learning model Weiightwalk based on a depth self-coding neural network; the WeightWalk model can effectively learn the network structure information and the edge weight information, and is described in the following three parts of weight discretization, weighted random walk path generation and representation learning respectively.

Step S11.1: weight discretization:

in the weighted dynamic network, the weight between the nodes is a continuous value, however, the continuous value is not beneficial to the representation learning of the node, and the continuous value needs to be dispersed. For dynamic networks G ═ G₁,G₂,…,G_t,G_t+1,…,G_nEach edge e in }_iE.g. E, and the weight values of the E in different time slice networks, edge E, are collected_iIs marked as w_ei＝{w₁,w₂,...,w_mFor sequence w_eiIt can be discretized by various methods, such as equal frequency partition, equidistant partition, clustering partition, etc., where we assume that the sequence satisfies normal distribution, and calculate its mean μ and variance σ²，

Given a threshold α, for any w_i∈w_eiIf w is_iValues of (d) fall in [ mu- α, mu + α]The weight of the region other than the region is set to 1, and if the value falls within [ mu- α, [ mu + α ]]Then set its weight to 0.α is usually taken to be 3 σ because if the sequence w is_eiIf the normal distribution is satisfied, the probability that the value falls outside the region is only 0.3%, which is a small probability event, and the value of α can be determined according to the actual situation.

Step S11.2: generating a weighted random walk path:

in each time slice network, a set of random walk paths is constructed based on each node in each time slice network, given that network G ═ V, E, W, for any V₁E to V, and constructing a random walk path set omega_v1＝{(v₁,v₂,...,v_l,w₁₂,w₂₃,...,w_(l-1)l),...|(v_i,v_i+1)∈E∩w_i(i+1)E.g. W, wherein l is the length of the constructed random walk path, W_i(i+1)Is an edge (v)_i,v_i+1) In order to learn the corresponding relationship between the edges and the weights in the network, the edge weights and the nodes need to be transmitted into the model, and the weights of the nodes and the edges need to be separated in the model learning stage.

Step S11.3: deep self-coding neural network representation learning:

the purpose of network representation learning is to learn a mapping function f to map each node in the network into a low-dimensional vector: v → R^dWhere d is the dimension of the vector representation. The existing NetWalk dynamic network representation learning algorithm uses a self-coding neural network to learn to perform representation learning on each time slice network, but has two problems: first, there is no learning in dynamic network representationConsidering the condition of the decay of the importance of the historical path, for example, when the node representation of the nth time slice is learned, the node link importance of the (n-100) th time slice is obviously far lower than that of the node link of the nth time slice. Secondly, the weights of the edges are not learned when the weighted network is processed, so that the method cannot successfully identify the weight abnormality if the links exist among the nodes to be detected but the weights of the nodes to be detected are obviously larger or smaller than usual when an abnormal link detection task is performed. In order to solve the two problems, the invention provides a weight learning model for a weighted dynamic network, namely, weight walk, a model framework is shown in fig. 2, the model input is a weighted random walk path, the weights of edges in the model are regarded as special nodes, each node in the random walk path is coded by a one-hot coding mode to be used as an input layer and an output layer of a self-coding neural network, the weight information of a network structure and the edges is learned by a minimum loss function in the middle layer, and meanwhile, the dimension represented by each node vector is compressed to a preset vector representation dimension d.

In this model, assuming that the model has nl layers in total, the input layer is designated as layer₀The output layer is marked as layer_nlThe intermediate layer is collectively called layer_ml. Giving ith random walk path

For any node

Its one-hot code is described as

The whole random walk path is recorded as

Given a layer weight matrix W^(τ)The τ th layer bias matrix b^(τ)，τ∈[1,nl]，f^(τ)(.) represents the output of the layer # 0 of the model with the input of layer 0 being

The nth layer output is

For self-coding neural networks, it is desirable to minimize the difference between the input and output of the model, using l₂Regularization to minimize this difference, the target equation is written as:

wherein, | Ω | is the number of random walk paths, l is the length of the random walk paths,

W^(nl-1)is the nl-1 th layer weight, b^(nl-1)Indicating the nl-1 th layer bias.

In the middle layer_mlFor a random walk path (v)₁,v₂,...,v_l,w₁₂,w₂₃,...,w_(l-1)l) It is desirable to minimize the first half (v) of the path₁,v₂,...,v_l) The distance between nodes, the target equation is:

wherein ,

and

and coding the one-hot of two adjacent nodes in the ith random walk path.

We obtain a vector representation of the edges by merging vector representations of the nodes, (v)₁,v₂,...,v_r) The edges between nodes can be represented as (e)₁₂,e₂₃,...,e_(r-1)r) Wherein for any edge e_(j-1)jObtaining a vector representation of the edge by performing a Hadamard product (Hadamard product) operation on the vector representation of the node,

in order to learn the weights of the edges, it is necessary to minimize the distance between the edges and the weight nodes, and the objective equation is expressed as:

to guarantee sparsity of the input-output vectors, KL divergence is used for limiting:

is the mean degree of activation of the layer τ neurons,

is the degree of activation of the i-dimensional neuron,

is the average activation degree of the i-dimension neuron.

To prevent overfitting, weight attenuation is added, and in summary, the final loss function is defined as:

wherein

The weighted dynamic network representation learning model construction is completed by steps S11.1 to S11.3.

Step S12: performing abnormal link identification based on the constructed weighted dynamic network representation learning model to obtain an abnormal link set; after vector representations of nodes of a t-th time slice network are obtained, abnormal link detection is carried out on all links of the t + 1-th time slice network based on the vector representations of the current nodes, and the method comprises a link abnormality and weight abnormality identification method.

Step S12.1: dynamically updating vector representation of nodes by adopting a reservoir sampling strategy of a NetWalk model, considering the importance attenuation condition of historical paths, setting sampling probability s for random walk paths of the 1 st to t th time slice networks, wherein the influence of the paths farther away from the current time t on the current time slice network is smaller_i：

Where i is the time value.

Step S12.2: acquiring a random walk path set by integrating a current time slice network and a historical time slice network, sequentially sending the random walk paths into a constructed weighted dynamic network representation learning model, and obtaining low-dimensional vector representation of nodes by minimizing a loss function;

step S12.3: and after the vector representation of each node of the t-th time slice network is obtained, abnormal link detection is carried out on each link of the t + 1-th time slice network based on the vector representation of the current node, and an abnormal link set is obtained.

Further, said step S12.3 comprises:

step S12.3.1: and link exception identification:

by computing node v_i，v_jThe Euclidean distance in d-dimensional space is taken as the distance between two nodes, the t-th time slice network representation is actually the vector representation of all the nodes which appear in the 1 st to t-th time slice network after learning, and therefore the average distance between all the connected edge node pairs appearing in the 1 st to t-th time slice network

As a reference, node v_i，v_jThe degree of closeness between is defined as:

wherein ,d_ijIs a node v_i，v_jThe euclidean distance between them.

Each link in the t +1 time slice network sets an abnormal link judgment threshold value k after acquiring the proximity degree of each node pair of the t time slice network, and when two nodes with the proximity degree smaller than k are linked in the t +1 time slice, the node pair is abnormal in the t +1 time slice, and the link is considered to be an abnormal link; or simultaneously setting a link abnormal disconnection judgment threshold value h, and when two nodes with the similarity degree larger than h have no link relation in the time slice of t +1, determining that the node pair has link abnormal disconnection in the time slice of t + 1. In general, we do not need to consider abnormal disconnection of links, and are only applicable to dynamic networks with highly consistent network nodes and links in time slices, such AS routing networks and road traffic networks.

Step S12.3.2: weight anomaly identification:

by a pair of nodes v_i，v_jIs subjected to a Hadamard product (Hadamard product) operation to obtain an edge e_ijBy computing the edge e_ijThe Euclidean distance from each weight node in d-dimensional space is used for predicting the weight of the edge in t +1 time slices. Assume that the weights are simply set to two classes, 0 and 1, through [0, t []After learning of the time slice dynamic network representation, the edges effectively form a plurality of clusters around the weights, edge e_ijIs actually weightedAt the middle position of two clustering centers with

weight

0 and 1, respectively calculating edge e_ijPredicting t +1 time slice edge e by distance to two weighted centers_ijIf the predicted weight value does not match the actual weight value, the edge e is determined_ijWeight anomalies occur at time slice t + 1. As shown in FIG. 3, each point in FIG. 3 is the position relationship between each edge and the weight node in the tth time slice network, and edge e_ijThe weight of (c) is determined by the closest weight node.

Step S12.3.3: the set of abnormal links is obtained through steps S12.3.1 and S12.3.2.

Step S13: and constructing a fully-connected neural network based on the abnormal link set, and detecting abnormal groups through the fully-connected neural network.

When an abnormal event occurs in a dynamic network, abnormal behaviors often occur among a series of node sets, the number of communication among nodes is suddenly reduced, and abnormal links appear or disappear. The traditional anomaly detection method usually focuses on an anomaly time point, and then searches for an anomaly node after the anomaly time point is determined, and if the duration of an anomaly event is longer, the method cannot completely detect the anomaly. The dynamic network can be converted into a static network containing dynamic network structure information and time information by using a fully-connected neural network, and the abnormal detection of the dynamic network is converted into searching a connected subgraph on the static graph, wherein the connected subgraph contains the structure information and the time information. The method is based on maximizing the weight of edges among interconnected nodes with abnormal synchronization, the connection among the nodes with similar activities is enhanced, then the edges with low weight are cut off, and the fully-connected neural network is converted into one or a plurality of sub-network sets (sub-network sets) with similar behaviors. The sub-networks may be isolated from each other or connected into a whole, and the nodes of the sub-networks reserved after detection are output as a final abnormal node set.

However, in the method, the anomaly in the dynamic network is defined as sudden increase of the communication quantity of the nodes, sudden decrease of the communication quantity between the nodes is not considered, and also because only the communication quantity of the nodes is considered and the abnormality of the link structure between the nodes is not considered, as shown in fig. 4, the communication quantity of each node in the graph at the time point T0 and the time point T1 is 2, and no change occurs in view of the communication quantity, but the link structure between the nodes is changed greatly.

The method of the invention can be used for effectively detecting the link structure abnormality appearing in the graph 4 on the basis of the abnormal link detection, edges v1-v4 and v2-v3 at the time T1 in the graph 4 can be regarded as abnormal links, and the link (structure) abnormality and the link weight abnormal node set, namely the abnormal link set, can be effectively detected on the basis of the abnormal links. The invention fuses the abnormal link with the fully-connected neural network (abnormal detection) model for the first time, and the flow chart of the method is shown in figure 5. Obtaining a dynamic network exception link set denoted as omicron, omicron { (t)₁,v₁,v₂),(t₁,v₁,v₄),...,(t_n,v_x,v_y) Where for arbitrary (t)_i,v_x,v_y)∈ο，t_iTime-of-occurrence for abnormal links, v_x∈V、v_yAnd E is V, and V is a node set existing in the dynamic network. The abnormal links in the dynamic network are regarded as edges among the nodes, the fully-connected neural network is constructed based on the abnormal link set, N nodes are totally arranged in the constructed fully-connected neural network model, the nodes correspond to all the nodes corresponding to the abnormal links in the dynamic network (the nodes which do not appear in the abnormal links in the V need to be discarded), and if the abnormal links exist between any two nodes in the N nodes, the connecting edges exist between the nodes. After learning of a fully-connected neural network (measuring node similarity, increasing the weight of edges among nodes with abnormal synchronization and pruning edges with low weight), an abnormal sub-graph set (abnormal sub-network set) is obtained.

Without verifying the effect of the invention, the following experiments were set up:

in order to verify the effectiveness of the weighted dynamic network representation learning model WeightWalk in weight learning, an abnormal link detection experiment is adopted for proving.

(a) Baseline method:

to verify the validity of the model, 5 current and up-to-date baseline methods were employed:

according to the method, a node sequence is generated through a random walk strategy, and then a skipgram model is adopted to learn vector representation of nodes.

And node2vec, wherein the method gives consideration to depth-first traversal and breadth-first traversal in random walk, so that the network structure can be learned more flexibly.

And LINE, optimizing the representation of the nodes by considering the first-order and second-order similarities of the nodes, and learning and representing by adopting the second-order similarity in a comparison test.

SDNE is a Deep learning-based Network representation model that uses self-encoders and local reservation constraints to learn the representation of nodes.

NetWalk, the method adopts a random walk and reservoir algorithm to dynamically update a random walk path, and is a dynamic network representation learning model based on a deep self-coding neural network.

Experimental data:

uci (uc irvine messages): the network provides for an on-line student community of users to communicate with each other. The nodes represent users and the edges represent messages sent.

DNC-the DNC data set is a leaked mail network, nodes in the network correspond to users, and the nodes are emails sent among the users.

Subreddit: the data contains discussions of 25000 reddit users for different topics, and nodes in the network correspond to the reddit users or the topics, and the edges represent one speech of the users on a certain topic.

(b) The experimental steps are as follows:

the Weightwalk model sets the length of a random walk path to be 3, the number of paths from each node to be 20, the number of layers of the self-coding neural network to be 5, and the dimensionality represented by the intermediate layer vector to be 100 and 20 respectively. In the experiment, the data set is sliced according to the day, and the data set is converted into a weighted dynamic network. On each data, 10000 edges are randomly selected as positive samples, 5000 linked negative sample edges are taken (namely, the two nodes have no linked edge relationship in the data set), and 5000 weighted negative sample edges are taken (namely, the two nodes have links in the data set, but the weights are different). After the vector representation of each node is obtained by the method, 20000 samples are detected, and training prediction is performed by using a logistic regression model, so as to finally obtain the Macro F1-score result listed in Table 1. F1-score, which considers the accuracy and recall of the classification model at the same time, can be regarded as a harmonic value of the accuracy and recall, and the calculation formula is as follows:

where Precision is the accuracy and Recall is the Recall.

TABLE 1 abnormal chaining detection

	UCI	DNC	Subreddit
				LINE	0.581	0.516	0.597
DeepWalk	0.567	0.52	0.495
				node2vec	0.57	0.523	0.582
SDNE	0.691	0.776	0.604
				NetWalk	0.609	0.665	0.576
WeightWalk	0.776	0.8	0.789

As can be seen from Table 1, the Weightwalk model performs best on the data sets, and the abnormal links can be effectively detected through the node vector representation learned by the method. On the contrary, the other methods can not effectively detect the abnormal weight, which shows that the Weightwalk model has stronger applicability in the aspect of abnormal link detection when the model is oriented to the weighted dynamic network.

In order to verify the abnormal group detection effect of the invention, the accuracy of abnormal detection is evaluated by injecting abnormality into a real data set, and the method is used for an AS-level Internet data set to perform experimental verification.

Comparing the invention with a source method (see Miz V, Ricaud B, Benzi K, et al. analog detection of the dynamics of Web and social network using abnormal memory [ C ]// the world Wide Web conference. ACM,2019: 1290-.

And the experimental data set adopts UCI and DNC data sets, the data sets are sliced according to days, and the data sets are converted into a weighted dynamic network. And randomly extracting a certain time slice network on each data, selecting 25% of nodes to increase the traffic of the nodes in the current time slice network, then selecting 25% of nodes to change the communication structure of the nodes on the premise of not changing the traffic of the current time slice network, and taking the nodes as an abnormal node set to be detected. The Weightwalk model sets the length of a random walk path to be 3, the number of paths from each node to be 20, the number of layers of a self-coding neural network to be 5, the dimensions represented by vectors of middle layers to be 100 and 20 respectively, after the vectors of the nodes are obtained to be represented, the abnormal node set is detected based on abnormal links, and finally the Macro F1-score result listed in the table 2 is obtained.

TABLE 2 comparison of abnormal population detection experiments

	DNC	UCI
			WeightWalk_Anomaly	0.652	0.550
Method	0.584	0.316

As can be seen from Table 2, the method of the invention has better performance on the data set, and can effectively identify the node traffic sudden increase abnormity and the communication structure abnormity. Meanwhile, in view of the variability and complexity of the dynamic network, the method is slightly inferior in the UCI data set, and has a certain relationship with the loose network structure and the loose connection.

In order to further verify the detection effect of the abnormal group, experimental comparison is carried out on the real data set.

We adopt the secure mail data set and the AS-level Internet dynamic network data set to evaluate the detection effect of the invention, meanwhile, due to AS-level Internet dynamic network data set abnormal events (disconnection and power failure of submarine optical cables), the communication traffic of Internet operators in related countries is reduced and the communication structure is changed, therefore, The abnormal group detection method based on The traffic "sudden increase" in The paper (see Miz V, Ricaud B, Benzi K, et al. analog detection in The dynamic of Web and social network using The abnormal memory [ C ]// The World Wide Web conference. ACM,2019:1290-, the weighting dynamic network abnormal group detection model based on the abnormal link can better detect the abnormal events, this also shows that our dynamic network abnormal group detection model based on abnormal link set has better applicability.

(c) Anran mail data set experiment

The Aniran mail data set is the incoming and outgoing mail of hundreds of high-level managers in a certain company for years and is disclosed. Since the data set not only contains the intercommunication between the members of the company, but there is also a lot of communication with the personnel outside the company. Therefore, in the experiment, users who send less than 3 mails in the last 5 years in the Anran mail network are firstly filtered, certain cleaning is carried out on data, and finally only the communication data among the users who send the redundant 3 mails is reserved. The mailbox addresses and the sending time of the sender and the receiver in the mail record are extracted from the safe mail data set and used for constructing a mail network, nodes in the network represent communication members, if a member a sends a mail to a member b, an edge is added between ab, the mail communication record of 1999/1/4-2001/12/31 three years in total is divided into 1092 time slices by taking one day as a unit, and the number of the mails sent in one day between a and b is used as the weight of the ab edge.

In the experiment, the Anran data set is detected by taking 12 months in 1999, 4 months, 5 months and 8 months in 2001 as abnormal detection intervals, and 23, 50, 92 and 12 nodes are respectively related to the maximum connected subgraphs. The node sets are respectively used as detection groups, the change of the number of abnormal links is compared within the range from 1999 to 2001 by 3, as shown in fig. 6, 4 months to be detected are respectively identified by black wide lines, the number of the abnormal links of the node sets is respectively standardized to be 0-100, as can be seen from fig. 6, the node sets respectively obtain the maximum value in each incident month, especially in 2001 by 4 months, 5 months and 8 months, the number of the abnormal links is respectively improved by 50% -300% compared with other months, and the effectiveness of the method is proved to a certain extent. In 2001, anomaly detection in month 5 involves 92 nodes in total, and meanwhile, the number of abnormal links in the node set is increased by 2 to 3 times in month 5 compared with the number of abnormal links in the rest of months, which shows that events occurring in month 5 have a large influence on the peace company.

(d) AS-level Internet data set experiment

At a specific time t, the AS-level Internet of a certain country refers to a network snapshot composed of all ases directly connected to the AS belonging to the country, and is denoted AS G ═ G₁,G₂,…,G_t,G_t+1,…,G_n}. Wherein the t-th time slice network G_t＝(V_t,E_t,W_t)，V_tFor the national AS autonomous domain and other national AS autonomous domains directly connected to the national AS autonomous domain, E_tFor edges between autonomous domains of AS, W_tIs a set of edge weights. During a period of time, from G_t＝(V_t,E_t,W_t) The formed dynamic network G can reflect the evolution trend of the state network communication state. In general, the normal change of G reflects the gradual evolution law of the AS-level Internet scale and topological relation, but the drastic change of the large-scale Internet is usually caused by network abnormal events, such AS router mis-configuration, physical link failure, and network failureAttack, etc., can cause the topological structure of AS level Internet in the country to change.

In this embodiment, AS-level Internet of libaran and venezuela is selected for experimental verification, and AS-level Internet dynamic networks of libaran and venezuela can be obtained by analyzing the public routing table data of the RouteViews project, with the number of AS pairs appearing in the routing table of the relevant country AS side weights. The sampling interval of the routing table of the Route Views item is 2h, so the time interval of the adjacent network snapshots in the dynamic network is also 2h, and meanwhile, the accuracy of the anomaly detection is also 2 h. The AS-level Internet data set for Ribes and Venezuela is shown in Table 3.

TABLE 3 statistical information for AS-level Internet data sets of Ribarengto and Venezuela

State of the country	Starting time	End time	Number of snapshots
				Lebane tender	2012/6/1 00:00	2012/7/31 22:00	727
Venezuela	2019/2/1 00:00	2019/3/31 22:00	706

According to BGPMon report (see BGPMon [ EB/OL ]. https:// www.bgpmon.net/internet output-in-lebanon-continuees-for-days /), from 16 minutes 16/7/4/2012, the ocean fiber of the Libayone is cut off, the Internet service is interrupted for several days, wherein the network of the operators such AS Liban Telecom (AS42020) of the maximum Libayone Internet operator is most seriously affected. Since the Route table sampling interval of the Route Views entry is 2h, the time point reflected on the Route table is 2012, 7, 4, and 18.

The detection interval is selected from 7/1/2012 to 7/9/7, and an abnormal subgraph obtained by abnormal group detection is shown in a graph G1 in fig. 7, where a node is an abnormal node set after detection, the weight of an edge is an abnormal value of a current connected edge, and a larger weight of an edge indicates that the current edge is more abnormal. Fig. 8 visually shows the evolution of the abnormal node set in the entire dynamic network, where the abscissa in fig. 8 is the time slice of the current dynamic network, and there are 727 time slices in total, the ordinate is the abnormal node set, and in the diagram, the number of abnormal links of the current node on the current time slice is indicated by using the lightness of the color (the darker the color is, the larger the number of abnormal links is, and the pure white represents no abnormal link), and the abnormal event occurrence time point 2012 is identified by a black straight-line segment at 7 month, 4 day 18. AS can be seen from fig. 8, after the black straight line segment is identified, the number of abnormal links in the node set increases sharply, and part of the abnormal links in the node set lasts until 7/30/2012, which indicates that the AS still does not return to normal until 7/30/2012.

Meanwhile, in order to further understand the behavior change of the abnormal node set when the abnormal event occurs, 7 nodes are selected from the node set, and the behavior of the abnormal node set when the abnormal event occurs is analyzed. As shown in fig. 9, a part in fig. 9 is statistics of the total number of abnormal links of the node set between 7 month 1 and 7 month 9, and b part in fig. 9 is variation of the number of abnormal links of the node set between 7 month 1 and 7 month 9. As can be seen from fig. 9, the abnormal link counts of the nodes all change significantly at 7/month/4/18/2012, and are not alleviated until 7/month/7/day later. The method also provides certain basis for analyzing the occurrence time point of the abnormal event and analyzing the influence caused by the abnormal event to a certain extent.

On 8 months of Union 3 (see CNN [ EB/OL ]. https:// edition. CNN. com/2019/03/08/americas/venezuelalackout-power-intl/index. htm), the power outage crises were encountered in most of the 7 th evening of Venezuela until 8 th morning, where many places were still in the dark. Although no official publishes the specific number of blackout cities, the local media has statistics that 22 of the 23 states in the country have blacked out.

The detection interval is selected from 3/2019 to 11/3, and the abnormal subgraph obtained by detecting the abnormal population is shown as a graph G2 in FIG. 7. Also, we use fig. 10 to visually display the evolution of the above abnormal node set in the whole dynamic network, and the black straight line segment is 3, 7 and 22 of 2019 (UTC) when the abnormal event occurs. As can be seen from fig. 10, the abnormal link number of the node set increases sharply after the black straight line segment, and gradually decreases after the black straight line segment lasts for several days. In order to further understand the behavior change of the abnormal node set when the abnormal event occurs, 7 nodes are also selected from the node set for analysis. As shown in fig. 11, a part in fig. 11 is statistics of the total number of abnormal links of the node set between 3 months and 3 days, and 11 days, and b part in fig. 11 is variation of the number of abnormal links of the node set between 3 months and 3 days, and 11 days. As can be seen from fig. 11, the number of abnormal links of the node changes significantly in 2019 at 3, 7 and 22, and the node is not alleviated until 3, 11 days. This indicates that venezuela AS level Internet routing fluctuates significantly and does not recover completely until 3 months and 11 days.

The experimental result proves the effectiveness of the method in detecting the abnormal group, the method can reveal the occurrence time of the abnormal event to a certain extent, and meanwhile, the influence degree of the current event on the individual can be evaluated by analyzing the evolution of the abnormal link number of a single node, so that a certain reference is provided for the influence analysis of the abnormal event.

According to the invention, the vector representation of the nodes and the edges is obtained by learning the structural information and the weight information of the edges in the dynamic network, and the abnormal node set is obtained by using the full-connection neural network model on the basis of abnormal link detection. The invention designs a weighted dynamic network representation learning model, which learns the dynamic network structure information more comprehensively, considers the weight as a special node, synthesizes the node representation to obtain the vector representation of the edge, and minimizes the distance between the edge and the 'weight node' thereof, thereby learning the weight information in the network. After the node vector representation is obtained, the real dynamic network data set is used for carrying out abnormal link detection, and the effectiveness of the method is verified through experiments. The invention combines the abnormal link with the fully-connected neural network abnormality detection model, expands the application range of the invention based on the abnormal link, and performs experimental verification on the safe mail data set and the AS-level Internet data set.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. An abnormal group detection method based on weighted dynamic network representation learning is characterized by comprising the following steps:

2. The abnormal group detection method based on weighted dynamic network representation learning according to claim 1, wherein the step 1 comprises:

step 1.1: for dynamic networks G ═ G₁,G₂,…,G_t,G_t+1,…,G_nEach edge e in }_iE, collecting the weight values of the E in different time slice networks, and collecting the edge E_iWeight value sequence ofIs marked as w_ei＝{w₁,w₂,...,w_mFor sequence w_eiDiscretizing the same;

step 1.2: in each time slice network, a set of random walk paths is constructed based on each node in each time slice network, given that network G ═ V, E, W, for any V₁E to V, and constructing a random walk path set omega_v1＝{(v₁,v₂,...,v_l,w₁₂,w₂₃,...,w_(l-1)l),...|(v_i,v_i+1)∈E∩w_i(i+1)E.g. W, wherein l is the length of the constructed random walk path, W_i(i+1)Is an edge (v)_i,v_i+1) The weight of (c);

3. The abnormal group detection method based on weighted dynamic network representation learning according to claim 2, wherein the step 1.3 comprises:

is the output of the nl-th layer, i.e. the output layer,

W^(nl-1)is the nl-1 th layer weight, b^(nl-1)Represents the nl-1 th layer bias;

for the ith random walk path

Any node of

for the ith random walk path edge (v)_l-1,v_l) The weight of (c);

wherein ,

and

coding one-hot of two adjacent nodes in the random walk path;

wherein

step 1.3.4: sparsity of input-output vectors is limited by KL divergence:

is the mean degree of activation of the layer τ neurons,

is the degree of activation of the i-dimensional neuron,

wherein

4. The abnormal group detection method based on weighted dynamic network representation learning according to claim 3, wherein the step 2 comprises:

Wherein i is a time value;

5. The abnormal group detection method based on weighted dynamic network representation learning according to claim 1, wherein the step 2.3 comprises:

step 2.3.1: and link exception identification:

As a reference, node v_i，v_jThe degree of closeness between is defined as:

wherein ,d_ijIs a node v_i，v_jThe Euclidean distance between them;

step 2.3.2: weight anomaly identification:

6. The abnormal group detection method based on weighted dynamic network representation learning according to claim 1, wherein the step 3 comprises: