CN112149835B

CN112149835B - Network reconstruction method and device

Info

Publication number: CN112149835B
Application number: CN201910579330.3A
Authority: CN
Inventors: 付尧
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2024-03-05
Anticipated expiration: 2039-06-28
Also published as: CN112149835A

Abstract

The application provides a network reconstruction method and device, wherein the method comprises the following steps: representing the state and action of an agent by nodes in an original network, and performing reinforcement learning on the agent in the original network based on environment return information so that the agent learns structural topology information in the original network and rules for removing noise information in the original network; and acquiring a moving path formed by transferring the intelligent agent subjected to reinforcement learning from a current node to a next node in the original network, taking the next node as the current node, repeatedly executing the step of acquiring the moving path until the node in the original network is traversed, and taking a network formed by connecting all the moving paths as a reconstruction network of the original network.

Description

Network reconstruction method and device

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a network reconfiguration method and apparatus.

Background

A large number of complex systems that exist in nature can be described by a wide variety of networks. A typical network consists of a number of nodes representing different individuals in the real system and edges between the nodes, where an edge is used to represent a relationship between individuals, often a certain relationship between two nodes is followed by an edge, whereas an edge is not followed, and two nodes with edges are considered as being adjacent in the network. For example, the nervous system can be seen as a network of numerous nerve cells interconnected by nerve fibers; a computer network may be considered a network of autonomous computers interconnected by a communication medium such as fiber optic cable, twisted pair, coaxial cable, etc.; also similar are power networks, social networks, traffic networks, dispatch networks, etc.

At present, a network constructed by data has abnormal node relations due to inaccuracy of relation construction rules or problems of noise, abnormality and the like of the data, or some potential node relations are not constructed. For example, in the security field, a personnel relationship network can be constructed by combining basic attribute, behavior and other data of personnel with preset relationship construction rules, but due to the inaccuracy of the relationship construction rules or the problems of noise, abnormality and the like of the extracted data, node relationships with a certain degree of abnormality exist, and some potential node relationships are not constructed. The analysis effects on personnel, cases, behaviors and the like in the security field can be influenced by utilizing the abnormal relation network, and the performance of downstream tasks can be reduced.

Disclosure of Invention

In view of this, the present application provides a network reconfiguration method and apparatus.

Specifically, the application is realized by the following technical scheme:

in a first aspect, an embodiment of the present application provides a network reconfiguration method, where the method includes:

representing states and actions of an agent by nodes in an original network, and performing reinforcement learning on the agent in the original network based on environment return information so that the agent learns structural topology information in the original network and rules for removing noise information in the original network;

acquiring a moving path formed by the transition of the reinforcement-learned intelligent agent from a current node to a next node in the original network, taking the next node as the current node, and repeatedly executing the step of acquiring the moving path formed by the transition of the trained intelligent agent from the current node to the next node in the original network until the node in the original network is traversed;

and the network formed by connecting all the moving paths is used as a reconstruction network of the original network.

Optionally, the environmental report information includes an environmental instantaneous report and an environmental delayed report;

The method for representing the state and the action of the intelligent agent by the nodes in the original network, and performing reinforcement learning in the original network by adopting the intelligent agent based on the environment return information comprises the following steps:

selecting a node from an original network as the state of an intelligent agent at the time t, and recording the state at the time t;

inputting the state into a preset value network, and acquiring the action of the maximum value output by the value network as the action to be executed at the moment t;

executing the action to transfer to a state at the time t+1, and recording the state at the time t+1;

determining an environmental instantaneous return for the action and updating the value network based on the environmental instantaneous return;

taking the state at the time t+1 as the state at the time t, continuously executing the step of inputting the state into a preset value network based on the updated value network until the number of the recorded states reaches a preset number threshold value, and completing the action of the round;

according to the state recorded in the action of the round, connecting nodes corresponding to the adjacent states in the original network to obtain an intermediate network;

determining an environment delay return corresponding to the intermediate network;

and updating the value network based on the environment delay return, and starting the next action based on the updated value network until the preset maximum number of action wheels is reached, so as to complete the training of the intelligent agent.

Optionally, the act of inputting the state into a preset value network to obtain the maximum value output by the value network, where the act is to be executed at time t includes:

inputting state action pairs formed by the state at the time t and each alternative action respectively into the value network, and obtaining the value corresponding to each state action pair output by the value network; the alternative actions comprise other nodes in the original network except for the state at the time t;

and taking the corresponding alternative action of the state action pair with the highest value as the action to be executed at the moment t.

Optionally, the attribute information of the node includes a classification label;

the determining an environmental transient reward for the action comprises:

establishing a moving path between a node corresponding to the state at the time t and a node corresponding to the state at the time t+1;

judging whether the moving path exists in the original network or not, and determining a first environment instantaneous return according to the judging result;

judging whether the classification label corresponding to the state at the time t is consistent with the classification label corresponding to the state at the time t+1; determining a second environment instantaneous return according to the judgment result;

And fusing the first environmental instantaneous return and the second environmental instantaneous return to obtain environmental instantaneous return.

Optionally, the determining the first environmental transient report according to the result of the determination includes:

if the mobile path does not exist in the original network, determining the first environment instantaneous return to be-1;

if the movement path exists in the original network and has occurred in this round of action, determining the first environmental transient report to be 0;

if the movement path exists in the original network and occurs for the first time in this round of actions, the first environmental transient report is determined to be 1.

Optionally, the determining the second environmental transient reward according to the result of the determination includes:

if any one of the classification labels corresponding to the state at the time t and the classification labels corresponding to the state at the time t+1 fails to be acquired, determining the second environment instantaneous return as 0;

if the classification label corresponding to the state at the time t and the classification label corresponding to the state at the time t+1 are obtained successfully, judging whether the classification labels are consistent;

if yes, determining the second environment instantaneous return to be 1;

If not, the second environmental transient reward is determined to be-1.

Optionally, the updating the value network based on the environmental transient rewards includes:

inputting a sample pair consisting of the state recorded at the time t, the action executed at the time t and the environment instantaneous return into the value network, so that the value network performs the following updating process:

determining the real value corresponding to the action at the moment t according to the environment instantaneous return, and taking the real value as a value tag corresponding to the state action pair at the moment t;

when the number of the received sample pairs reaches a preset sample pair threshold, training the value network by taking the received state, action and corresponding value labels at each moment as training samples to obtain an updated value network.

Optionally, the step of determining the real value corresponding to the action at the time t according to the environmental instantaneous return includes:

and after the maximum value of the state at the time t+1 output by the value network is discounted according to a preset discount rate, overlapping the maximum value with the environment instantaneous return obtained at the time t to obtain the real value corresponding to the action at the time t.

Optionally, the determining the environment delay return corresponding to the intermediate network includes:

Acquiring the processing performance of the intermediate network for a downstream task and the processing performance of the original network for the downstream task;

and calculating the difference between the processing performance of the intermediate network for the downstream task and the processing performance of the original network for the downstream task as the environment delay return.

Optionally, the acquiring the processing performance of the intermediate network for the downstream task includes:

dividing nodes in the intermediate network into training set nodes and testing set nodes;

mining a classification model based on the classification label training graph of each training set node;

predicting the classification labels of the nodes of each test set by adopting the graph mining classification model to obtain corresponding prediction results;

and determining a classification effect score as the processing performance of the intermediate network for the downstream task according to the comparison of the prediction result of each test set node and the real classification label.

Optionally, the determining the classification effect score according to the comparison of the prediction result of each test set node and the real classification label includes:

determining the number of test nodes with the predicted result consistent with the real classification labels;

calculating the precision rate and recall rate according to the number;

And calculating the classification effect score according to a preset calculation algorithm according to the precision rate and the recall rate.

Optionally, the calculating the difference between the processing performance of the intermediate network for the downstream task and the processing performance of the original network for the downstream task, as the environment delay return, includes:

calculating a difference value between the processing performance of the intermediate network for a downstream task and the processing performance of the original network for the downstream task;

and balancing the difference by adopting a preset balancing parameter to obtain the environment delay return, wherein the balancing parameter is used for balancing the environment delay return and the environment instantaneous return value, and the larger the value of the balancing parameter is, the larger the guiding effect of the environment delay return is.

Optionally, the updating the value network based on the environment delay rewards includes:

and inputting a sample pair consisting of all nodes in the intermediate network and the environment delay returns into the value network so as to learn the sample pair by the value network to update network parameters in the value network.

Optionally, the node is described by adopting node information, and the node information comprises attribute information and structure information of the node; the attribute information of the node comprises a feature vector of the node; the structure information is a structure characterization vector obtained through a graph characterization learning algorithm.

In a second aspect, an embodiment of the present application provides a network reconfiguration device, where the device includes:

the intelligent agent training module is used for representing the state and action of an intelligent agent by using nodes in an original network, and performing reinforcement learning in the original network by adopting the intelligent agent based on environment return information so that the intelligent agent learns structural topology information in the original network and rules for removing noise information in the original network;

the mobile path acquisition module is used for acquiring a mobile path formed by the fact that the intelligent agent subjected to reinforcement learning migrates from a current node to a next node in the original network, and repeatedly calling the mobile path acquisition module by taking the next node as the current node until the nodes in the original network are traversed;

and the reconstruction network generation module is used for connecting all the mobile paths into a network serving as a reconstruction network of the original network.

The embodiment of the application has the following beneficial effects:

in this embodiment, the state and action of the agent are represented by the nodes in the original network by using the reinforcement learning framework, and the agent and the environment are allowed to perform interactive learning in the original network, so that the agent learns the structural topology information in the original network and the rules for removing the noise information in the original network. And then the learned intelligent reconstruction network is utilized, the noise and abnormal edges of the original network can be effectively reduced while the important topological structure of the original network is maintained, and meanwhile, some potential relation edges are added, so that the performance of downstream tasks can be effectively improved based on the reconstructed network.

Drawings

FIG. 1 is a flow chart illustrating steps of an embodiment of a method for network reconfiguration according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart illustrating steps of another embodiment of a method of network reconfiguration according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of an agent training framework shown in an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a reference network with better performance shown in a verification scenario for verifying the reconstruction effect of a reconstruction network according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a noise network obtained after adding noise edges to the reference network of FIG. 4, shown in a verification scenario to verify the reconstruction effect of a reconstruction network, according to an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a reconstruction network, using the prior art, of the noise network of FIG. 5, shown in a verification scenario to verify the reconstruction effect of the reconstruction network, according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a reconstruction network, shown in a verification scenario for verifying the reconstruction effect of a reconstruction network, according to an exemplary embodiment of the present application, by reconstructing the noise network of FIG. 5 using the method of the embodiment of the present application;

FIG. 8 is a hardware configuration diagram of the apparatus in which the device of the present application is located;

fig. 9 is a block diagram illustrating an embodiment of a network reconfiguration device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a method for network reconfiguration according to an exemplary embodiment of the present application may include the following steps:

and step 101, representing the state and action of the intelligent agent by using nodes in an original network, and performing reinforcement learning on the intelligent agent in the original network based on the environment return information so that the intelligent agent learns structural topology information in the original network and rules for removing noise information in the original network.

By way of example, the original network may be a noisy network in the presence of noise, such as a complex network that may include a neural network, a computer network, an electrical network, a social relationship network (e.g., a personal relationship network, etc.), a traffic network, a dispatch network, etc.

A plurality of nodes may be included in the original network. In this embodiment, the state and actions of the agent are represented by nodes in the original network. The network reconfiguration problem may be converted into a path generation problem between nodes in the network and thus may be formalized as a markov decision process.

According to the embodiment of the application, the framework of reinforcement learning is adopted, the original network is used as the execution environment of the intelligent agent, reinforcement learning is carried out on the intelligent agent based on the environment return information, and the intelligent agent can learn important structural topology information in the original network and learn rules for removing noise information in the original network.

Step 102, obtaining a moving path formed by the reinforcement-learned agent moving from a current node to a next node in the original network, taking the next node as the current node, and repeating the step of obtaining the moving path formed by the trained agent moving from the current node to the next node in the original network until the nodes in the original network are traversed.

And step 103, using the network formed by connecting all the moving paths as a reconstruction network of the original network.

In this step, the network reconfiguration is to point to the original network, delete some noise edges that are not in fact, and add some reasonable or potential edges that are beneficial to downstream tasks.

After the training of reinforcement learning is completed, the trained agent can be utilized to act among nodes of the original network, for example, a node is randomly selected from the original network as an initial state, then the action with the maximum value under the state output by the value network is obtained, the action is executed to migrate to the next node (namely, the next state), and the edge formed between the initial node and the next node can be used as a moving path. After the migration to the next node, taking the next node as the current state, continuing to acquire the action of the maximum value in the current state and executing the action, so as to migrate to the next node, and so on until all nodes in the original network are traversed, and then forming a reconstruction network of the original network by all the acquired moving paths. Travel path

Referring to fig. 2, a flowchart illustrating steps of another method embodiment of network reconfiguration according to an exemplary embodiment of the present application, where the process of training an agent in the embodiment of fig. 1 is described in more detail, may include the following steps:

step 201, selecting a node from the original network as the state of the agent at the time t, and recording the state at the time t.

By way of example, the original network may be a noisy network in the presence of noise, which may include a neural network, a computer network, an electrical network, a social relationship network (e.g., a personal relationship network, etc.), a traffic network, a dispatch network, etc., complex networks.

A plurality of nodes may be included in the original network. For example, as shown in the agent training framework schematic of FIG. 3, the original network may be the noise network of FIG. 3, which may include V ₁ -V ₆ Equal nodes, adjacent two nodes are connected by edges, and the relationship between the two connected nodes is represented.

In this embodiment, the state of the agent may be characterized by nodes in the original network, i.e. Wherein, the state information of the state can be described by node information, and as an example, the node information can include attribute information of the node and structure information of the node in the original network, in the above formula, +. >The state information of (2) may include +.>And the structure information of the node in the original network, wherein the set S is defined as all nodes in the original network.

As one example, the attribute information of the node may include a feature vector of the node. The structural information of the nodes can be structural characterization vectors obtained through a graph characterization learning algorithm.

Specifically, for attribute information, the feature vector of the node can be directly usedRepresenting node->The feature vector of the node may include feature information such as a classification label of the node, for example, the classification label may be a person label (for example, a child label, a pupil label, a middle school student label, a college student label, etc. may be further subdivided according to the task), a motor vehicle label (for example, a car label, a truck label, a bus label, etc. may be further subdivided according to the task), etc.

For the structural information, since the number of nodes of the network is large, the whole state space and the action space are very large. To address this problem, in one embodiment, states and actions may be encoded such that they may be mapped to a continuous low-dimensional space, and then a graph representation learning algorithm (e.g., deepWalk, node2Vector, etc.) may be used to derive a structural representation Vector (i.e., structural information) for each node in the original network.

For example, a series of node sequences obtained by random walk or other methods can be used as input on the original network, and the method for encoding sentences in the analog natural language processing is fed into the neural network to obtain an embedded representation (the representation dimension can be manually specified, for example, 4-dimensional) of each node, such as the nodeNode->And so on, each node representation can be a 4-dimensional vector, and the neural network continuously updates parameters, so that the adjacent relation between the embedded representations of the 4-dimensional vectors of each node can approximate to the original input node sequence, namely a node representation method of another dimension can be found, but the structural relation between the nodes in the network can be reflected.

Thus, the state at each time in the present embodimentCan be obtained by the following formula:

wherein,representing node->Is a structural representation vector, ">Representing node->Is used to represent vector concatenation.

In one embodiment, a node may be randomly selected from the original network as the initial state at the initial time. For example, in fig. 3, the first step in the first round may be to select the V3 node as the initial state.

After determining the state at time t, the present embodiment may also record the state at each time. In one embodiment, the state at each time may be recorded in a path sequence path, e.g., when the initial state is determined to be V3 node, then path sequence path= (V) ₃ )。

Step 202, inputting the state into a preset value network, and obtaining the action of the maximum value output by the value network as the action to be executed at the time t.

In this embodiment, the node in the original network may also be used to characterize the action of the action to be performed by the agent, i.e. the action at each time tSet a is defined as all nodes in the original network. If the action at the current time t isIndicating that the environment is transferred to the node +.>Is a state of (2).

And the above stateSimilarly, the action to be performed at time t may be expressed as:

。

in this step, according to the value-based framework in reinforcement learning, after determining the state at the current time t, the state action pair obtained according to the state may be input into a preset value network, and according to the output of the value network, the action corresponding to the maximum value may be taken as the action to be executed at the time t. Wherein, the value network and the intelligent agent are part of the reinforcement learning framework.

In one possible implementation of this embodiment, step 202 may further include the following sub-steps:

s11, inputting state action pairs formed by the state at the moment t and each alternative action respectively into the value network, and obtaining the corresponding value of each state action pair output by the value network; the alternative actions include other nodes in the original network than the state at time t.

In one example, the value network may be a multi-layer neural network, the inputs of which may be state action pairs (i.e., state-actions), and the outputs of which are the corresponding values. In the step of this embodiment, the actions in each state action pair may be referred to as alternative actions, which may include other nodes in the original network than the state at the time t. For example, for the noise network of fig. 3, if the state at the current time t is V ₃ The corresponding alternative actions may include: v (V) ₁ 、V ₂ 、V ₄ 、V ₅ V (V) ₆ The corresponding pairs of state actions may include: v (V) ₃ - V ₁ 、V ₃ - V ₂ 、V ₃ - V ₄ 、V ₃ - V ₅ V (V) ₃ - V ₆ Etc.

After each state action pair is obtained, each state action pair may be input into the value network, so that the value network calculates the transition value from the state where the current time t is located to the action to be executed in the next state, and outputs the value corresponding to the state action pair.

In the substep S12, the candidate action corresponding to the state action with the highest value is taken as the action to be executed at the time t.

In this step, after the value of each state action pair, that is, the value of all actions that can be selected by the state at the current time, the candidate action corresponding to the state action pair having the highest value may be taken as the action to be executed at the time t. For example, in FIG. 3, if V ₃ - V ₆ The corresponding value is the largest, then V ₆ As an action to be performed at time t.

Step 203, executing the action to transition to the state at the time t+1, and recording the state at the time t+1.

In this step, the agent transitions to a state at time t+1 after performing the action to be performed at time t, where in one example, the action to be performed at time t and the state at time t+1 may correspond to the same node. For example, V ₆ As movements to be performed at time tThe agent performs the action to go to V ₆ I.e. V ₆ As the state at time t+1.

At the same time, the present embodiment may record the state at time t+1, and in one embodiment, as in the example of fig. 3, the state V may be added to the path sequence ₆ The latest path sequence is obtained as path= (V ₃ , V ₆ )。

In one possible embodiment, transition probabilities may also be calculated when making state transitions，Representing that an action is being performed->After that, the environment is from the state->Transition to State->Is a probability of (2). When->Then->。

At step 204, an environmental transient reward for the action is determined and the value network is updated based on the environmental transient reward.

In reinforcement learning, the environmental return information is used as feedback in the network reconstruction process to guide the exploration of abnormal edges and the discovery of potential relationships. As one example, the environmental reward information may include an environmental transient reward related to network local characteristics of the current environment.

As an example, the environmental transient rewards further may include a first environmental transient rewards and a second environmental transient rewards. Wherein the first environmental transient rewards may also be referred to as topology rewards, which are rewards determined according to the topology of the original network; the second environmental transient rewards may also be referred to as homogeneous rewards, which are rewards that reflect the homogeneity of the network. In the topology theory, the network with better homogeneity is considered to have better network quality, and the information is also beneficial to be transmitted on the network, and the value network is updated based on the topology structure and homogeneity of the network, so that the value network has better convergence.

In one possible implementation of this embodiment, step 204 may include the following sub-steps:

and a substep S21, wherein a moving path is established between the node corresponding to the state at the time t and the node corresponding to the state at the time t+1.

For example, in fig. 3, the node corresponding to the state at time t is V ₃ The node corresponding to the state at time t+1 is V ₆ The movement path established by the two is V ₃ - V ₆ 。

In a substep S22, it is determined whether the moving path exists in the original network, and a first environmental instantaneous return is determined according to the result of the determination.

In this step, after the moving path is established according to the front and rear states, the moving path may be matched in the original network, and the first environmental transient return is determined according to the matching result, so as to capture important topology information of the original network.

In one possible implementation, the substep S22 may further comprise the substeps of:

if the mobile path does not exist in the original network, determining the first environment instantaneous return to be-1; if the movement path exists in the original network and has occurred in this round of action, determining the first environmental transient report to be 0; if the movement path exists in the original network and occurs for the first time in this round of actions, the first environmental transient report is determined to be 1.

For example, the following formula may be employed to determine the first environmental transient reward：

In the above formula, the movement path if created is representedThe first environment instant return of-1 point is given if the moving path is +.>A first environmental transient reward of 0 is given when existing in the original network but already occurring in the set P of the current round of movement paths, otherwise a first environmental transient reward of +1 is given.

For example, in fig. 3, for the moving path V ₃ - V ₆ Since it is not present in the noise network, V can be determined ₃ - V ₆ Is determined to be-1, thereby representing the moving path V ₃ - V ₆ For structural information not provided in the original network, i.e. V ₃ - V ₆ Is newly added structure information.

Step S23, judging whether the classification label corresponding to the state at the time t is consistent with the classification label corresponding to the state at the time t+1; and determining a second environmental transient report according to the result of the judgment.

In one possible implementation, the substep S23 may further comprise the substeps of:

if any one of the classification labels corresponding to the state at the time t and the classification labels corresponding to the state at the time t+1 fails to be acquired, determining the second environment instantaneous return as 0; if the classification label corresponding to the state at the time t and the classification label corresponding to the state at the time t+1 are obtained successfully, judging whether the classification labels are consistent; if yes, determining the second environment instantaneous return to be 1; if not, the second environmental transient reward is determined to be-1.

In one embodiment, class labels may be derived from the feature vectors of the nodes, and also have a relation to what the actual task is. For example, in a personal relationship network, where the node is a person, there may be some characteristics such as gender, age, etc., if the task is gender classification, the classification label is the gender in the node's characteristics, and if it is age prediction, the classification label is the age in the characteristics.

For example, the following formula may be employed to determine the second environmental transient reward：

In the above formula, it is indicated that if the state at time t corresponds to the classification labelClassification tag corresponding to the state at time t+1 +.>And if the two classification labels are not the same, giving a second environmental instantaneous return of-1, and if the two classification labels are not known, giving a second environmental instantaneous return of +1, and if one of the two classification labels is not known, giving a second environmental instantaneous return of 0.

In step S24, the first environmental instantaneous return and the second environmental instantaneous return are fused to obtain environmental instantaneous return.

In one embodiment, one of the ways of fusing may be: and carrying out weighted operation on the first environmental instantaneous return and the second environmental instantaneous return by adopting preset weight parameters to obtain the environmental instantaneous return.

For example, if the preset weight parameter is a superparameterThe first and second instantaneous returns can be fused by the following formula to obtain the final instantaneous returnTime return->：

。

In practice, the role of the value network is to approximate the state action pairs at each momentIn reinforcement learning, but the real value may be updated by changing the interaction between the agent and the environment, the embodiment may update the value network according to the environment instant report after obtaining the environment instant report.

In one possible implementation, the following steps may be taken to update the value network based on the environmental transient rewards:

inputting a sample pair consisting of the state recorded at the time t, the action executed at the time t and the environment instantaneous return into the value network, so that the value network performs the following updating process: determining the real value corresponding to the action at the moment t according to the environment instantaneous return, and taking the real value as a value tag corresponding to the state action pair at the moment t; when the number of the received sample pairs reaches a preset sample pair threshold, training the value network by taking the received state, action and corresponding value labels at each moment as training samples to obtain an updated value network, so that the maximum value output by the value network is more approximate to a true value.

In one example, after an action is performed at the current time, the performed action, the recorded state at the current time and the instantaneous return of the environment can be combined into a sample pair, and the sample pair is sent into the value network for learning, for example, as shown in fig. 3, assuming that the node at the first time is V ₃ The action performed is V ₆ ，V ₆ The corresponding environment instantaneous return is r ₁ ^im Then the sample pair of the input value network is (V ₃ ，V ₆ ，r ₁ ^im ) The method comprises the steps of carrying out a first treatment on the surface of the Let the node at the second moment be V ₆ The recorded state is (V ₃ ，V ₆ ) The action performed is V ₄ ，V ₄ The corresponding environment instantaneous return is r ₂ ^im The sample pair of the input value network is ((V) ₃ ，V ₆ ），V ₄ ，r ₂ ^im ）。

After the value network receives the sample pair, the real value corresponding to the action at the moment t can be determined according to the environment instantaneous return. In one embodiment, the true value corresponding to the action at time t may be determined as follows: and after the maximum value of the state of the value network output at the time t+1 is discounted according to a preset discount rate, overlapping the maximum value with the environment instantaneous return obtained at the time t to obtain the real value corresponding to the action at the time t.

For example, the following formula may be used to obtain the true value corresponding to the action at time t:

In the above-mentioned description of the invention,representing the value of the value network output +.>For parameters of the value network->Is super-parameter (i.e. preset discount rate),>for immediate return on the environment.

Expressed above is the sum of the current value network outputsThe corresponding real value needs to be updated into the environment instantaneous value obtained after the action at the moment t is takenTime return, add transition to State->The maximum value of the current value network output after that multiplied by the discount rate +.>。

After obtaining the real value corresponding to the action at the time t, the real value can be used as the value label corresponding to the state action pair at the time t, and the value network is updated according to the state action pair and the corresponding value label.

In practical engineering implementation, the value network can accumulate the sample pairs received for multiple times, and when the number of the received sample pairs reaches a preset sample pair threshold, the state, the action and the corresponding value label at each time are used as training samples to train the value network so as to update the parameters of the value network and obtain the updated value network.

Step 205, taking the state at time t+1 as the state at time t, and based on the updated value network, continuing to execute step 202 until the number of recorded states reaches a preset number threshold, and completing the current round of actions.

In this step, after the state at time t+1 is shifted, the time t+1 may be taken as the current time t, and then steps 202-204 are continuously executed according to the updated value network (if the value network in the previous step is not updated, the value network updated last time may be adopted) until the number of states recorded in the path sequence reaches the preset number threshold, and then the present round of actions is completed.

In one embodiment, a round of action may include a specified maximum number of steps. For example, in fig. 3, each round of action may include a first step, a second step, a third step … max step. Each step may record a status, and then it may be determined whether the current round of action is completed based on the number of status recorded. In one embodiment, the process from determining the current state to updating the value network based on the environmental transient rewards may be a one-step action, i.e., the process of one-step action may include steps 202-204.

After updating the value network according to the environment instantaneous return each time, whether the maximum step number is reached can be judged, in one implementation, step number statistics can be performed by using a counter, and the counter is incremented by 1 each time the value network is updated according to the environment instantaneous return. When the counter does not reach the preset count, starting the next action; when the counter reaches the preset count, the action of the round is completed. In another embodiment, after updating the value network according to the environmental instantaneous return, the number of the recorded states can be counted to determine whether the number reaches a preset number threshold, and if not, the next action is started; if so, the action of the round is completed.

And step 206, connecting nodes corresponding to the adjacent states in the original network according to the states recorded in the action of the round, and obtaining an intermediate network.

In this step, each time the agent completes a round of actions, the nodes corresponding to the adjacent states can be connected into one side (i.e. one moving path) in the original network according to each state recorded in the round of actions, and after all states recorded in the round of actions are traversed, an intermediate network formed by all nodes passed in the round of actions can be obtained.

For example, in fig. 3, after the maximum step is completed in the first round, the resulting intermediate network of the first round is (V ₃ ,V ₆ ,V ₃ ,V ₄ ,…,V ₂ ). From the intermediate network, it can be initially seen that edges (e.g. V ₃ -V ₆ ) And removes some noise edges (e.g. V ₃ -V ₄ ）。

As an example, the intermediate network may be represented in the form of a adjacency matrix.

Step 207, determining an environment delay return corresponding to the intermediate network.

In this step, for each round of action, an environmental delayed return for the round of action may be determined from the intermediate network, which, when implemented, may be obtained in connection with the downstream task.

In one possible implementation of this embodiment, step 207 may comprise the following sub-steps:

and a substep S41, acquiring the processing performance of the intermediate network for the downstream task and the processing performance of the original network for the downstream task.

In a possible implementation manner of this embodiment, the step of obtaining the processing performance of the intermediate network for the downstream task may include the following sub-steps:

dividing nodes in the intermediate network into training set nodes and testing set nodes; mining a classification model based on the classification label training graph of each training set node; predicting the classification labels of the nodes of each test set by adopting the graph mining classification model to obtain corresponding prediction results; and determining a classification effect score as the processing performance of the intermediate network for the downstream task according to the comparison of the prediction result of each test set node and the real classification label.

In this step, an adjacency matrix of the intermediate network may be obtained, and then the nodes in the intermediate network may be divided into training set nodes (e.g., 80%) and test set nodes (e.g., 20%) on a scale. The classification labels of all nodes in the training set are used for participating in training of a graph mining classification model, after a trained model is obtained, the trained graph mining classification model is used for predicting the classification labels of all nodes in the test set, the predicted classification labels are compared with actual classification labels to determine the accuracy of a prediction result, and finally the classification effect score f1 score is obtained and is used as the processing performance of an intermediate network for downstream tasks.

In one embodiment, the classification effect score may be determined as follows:

determining the number of test nodes with the predicted result consistent with the real classification labels; calculating the precision rate and recall rate according to the number; and calculating the classification effect score according to a preset calculation algorithm according to the precision rate and the recall rate.

For example, f1 score may be calculated using the following formula:

illustratively, the precision and recall may be calculated by:

assume that the test set is TP + FP + FN + TN, wherein,

TP: the sample is positive, the prediction result is positive

FP: the sample is negative, the predicted result is positive

TN: the sample is negative and the predicted result is negative

FN: the sample is positive and the predicted result is negative

The precision (accuracy) is: (tp+tn)/(tp+tn+fp+fn);

the recall ratio (recall) is: TP/(tp+fn).

As one example, the graph mining classification model may include a deep network model such as a graph roll-up neural network GCN (Graph Convolutional Network).

Similarly, the processing performance of the original network on the downstream task may also be represented by using the classification effect score of the original network, and the classification effect score of the original network may refer to the above-mentioned obtaining manner of the classification effect score of the intermediate network, and only the part of the intermediate network needs to be replaced by the part of the original network, which is not described herein again.

In a substep S42, a difference between the processing performance of the intermediate network for the downstream task and the processing performance of the original network for the downstream task is calculated as the environment delay report.

In one possible implementation of the present embodiment, the substep S42 may comprise the substeps of:

calculating a difference value between the processing performance of the intermediate network for a downstream task and the processing performance of the original network for the downstream task; and balancing the difference by adopting preset balance parameters to obtain the environment delay return.

For example, the environment delay returns may be calculated using the following formula:

wherein,and->Representing the adjacency matrix of the intermediate network and the adjacency matrix of the original network, respectively; y is Y _test Is a test set; />A classification effect score representing the intermediate network; />A classification effect score representing the original network; scale refers to a balance parameter used to balance the magnitude of the classification score of the environment delayed return and the environment instantaneous return, and the larger scale, the larger the environment delayed return guiding function, and the smaller scale, the smaller scale can be a preset experience value.

And step 208, updating the value network based on the environment delay return, and starting the next action based on the updated value network until the preset maximum number of action wheels is reached, so as to complete the training of the intelligent agent.

In one possible implementation manner, the step of updating the value network based on the environment delay rewards may further include:

For example, as shown in fig. 3, assume that the intermediate network is (V ₃ ,V ₆ ,V ₃ ,V ₄ ,…,V ₂ ) The environment delay return corresponding to the intermediate network is r ₁ ^de Then the sample pair of the input value network is { (V) ₃ ,V ₆ ,V ₃ ,V ₄ ,…,V ₂ ）, r ₁ ^de }。

After the value network receives the sample pair, the sample pair may be learned to update the value network. The method for reporting back the more value network according to the environment delay is similar to the method for updating the value network according to the environment instantaneous return, namely, the real value of the intermediate network is calculated according to the environment delay return to be used as the value label of the intermediate network, and then network learning is carried out according to the intermediate network and the corresponding value label to obtain the updated value network. For specific procedures, reference is made to the above-described way of updating a value network based on an environmental transient report, and a detailed description of the updated procedure will not be provided here.

After updating the value network according to the environment delay return, whether the preset maximum number of running wheels is reached can be judged. In one embodiment, a counter may be used to count the number of wheels, and each time after updating the value network according to the environment delay, a round of action is completed, and 1 is added to the counter, and when the counter does not reach the preset maximum number of wheels, the next round of action is started, and steps 202-208 are continuously executed; when the counter reaches the maximum number of running wheels, namely the maximum number of iteration wheels, training is stopped, and training of the intelligent body is completed.

It should be noted that, the maximum step number and the maximum number of the running wheels are all empirical values, and are different according to different actual task requirements.

Step 209, obtaining a movement path formed by the reinforcement-learned agent migrating from the current node to the next node in the original network, and taking the next node as the current node, and repeating the step of obtaining the movement path formed by the trained agent migrating from the current node to the next node in the original network until the nodes in the original network are traversed.

And step 210, using the network formed by connecting all the moving paths as a reconstruction network of the original network.

After the training of the reinforcement learning is completed, the trained agent can be utilized to act among nodes of the original network, two adjacent nodes in the moving path generate an edge, and no edge is adjacent, so that the network is reconstructed.

For example, a node is randomly selected from the original network as an initial state, then the action of maximum value under the state output by the value network is obtained, and the action is executed to migrate to the next node (i.e. the next state), so that an edge formed between the initial node and the next node can be used as a moving path. After the migration to the next node, taking the next node as the current state, continuing to acquire the action of the maximum value in the current state and executing the action, so as to migrate to the next node, and so on until all nodes in the original network are traversed, and then forming a reconstruction network of the original network by all the acquired moving paths.

Compared with the network reconstructed by the prior art, the reconstructed network obtained by the embodiment can effectively reduce noise edges of the original network, and adds some potential missing edges, so that the reconstructed network can more effectively improve the performance of downstream tasks. For example, in a verification scenario for verifying the reconstruction effect of the reconstruction network, for a reference network with better performance as shown in fig. 4, a noise network obtained by adding 40% noise edges to the network (i.e. the original network referred to in this embodiment) is shown in fig. 5. If the reconstruction network obtained by reconstructing the noise network in fig. 5 by the prior art manner is shown in fig. 6, and the reconstruction network obtained by reconstructing the noise network in fig. 5 by the scheme of this embodiment is shown in fig. 7, it can be seen from comparison of fig. 6 and fig. 4, and fig. 7 and fig. 4 that the network reconstructed by the prior art still has more noise edges, and some valuable edges are not reconstructed. The network reconstructed in this embodiment of fig. 7 can reconstruct a network with better performance, which is similar to the network of fig. 4.

In order to enable those skilled in the art to better understand the embodiments of the present application, the following exemplary description will be given by taking a scenario in which an original network is a personal relationship network in the security field as an example, and it should be understood that this example is used as an exemplary description of the present embodiment, but should not be construed as limiting the present embodiment:

Step1: and randomly selecting a node from the noise personnel relationship network as an initial state of the intelligent agent.

Step2: and the intelligent agent acquires the action of the maximum value output by the value network in the current state, and executes the action to transfer to the next node to be used as the next state.

Step3: the instantaneous return on the environment regarding the local nature of the network is obtained after transition to the new state as the value of the sample pair (last state-action taken with maximum value) to update the value network.

Step4: and judging whether the maximum Step number is reached, if not, returning to Step2, and if not, continuing to Step5.

Step5: a sub-network (i.e., an intermediate network) is obtained that is formed by the nodes through which all actions of the agent run.

Step6: the sub-network is fed into a graph mining classifier (e.g., graph convolutional neural network GCN) to obtain an environmental delayed return associated with the downstream task, and the value network is updated according to the environmental delayed return.

Step7: judging whether the maximum number of rounds is reached, if not, returning to Step1 to start the next round of action, otherwise, ending the training of the intelligent body.

Step8: the trained intelligent agent is utilized to act among nodes, two adjacent nodes in the moving path generate an edge, and no edge is adjacent, so that the network is reconstructed.

Through the process of reconstructing the network in the example, the noise edges of the original network can be effectively reduced, some potential missing edges are added, the performance of downstream tasks is effectively improved, and improvement of analysis effects of personnel, cases, behaviors and the like in the security field is facilitated.

It should be noted that the above flow is only a general description of the flow of reconstructing the network in this example, and reference may be made to the description in the above method embodiment for specific implementation details in each step.

In this embodiment, attribute information and structure information of nodes in an original network are used to represent states and actions of an agent, the node where the agent is located at each moment is used as the state of the agent at that moment, and the action with the maximum value output by the value network for the state is obtained as the action to be executed at the current moment, and the action is executed to transition to the next state. After each execution of an action, an environmental instantaneous return for the action may be determined, and the value network updated based on the environmental instantaneous return, and then the next action is continued using the updated value network until the current round of action is completed. And updating the value network by determining the environment delay return corresponding to the intermediate network for each intermediate network obtained by each round of action, starting the next round of action based on the updated value network until the preset maximum number of action wheels is reached, completing training of the intelligent agent, and then, reconstructing the network by the intelligent agent obtained by training acting among nodes of the original network. In the process, the value network is updated through different return information, so that the value output by the value network is more approximate to the true value. In the network reconstruction process, not only the attribute characteristics of the nodes are considered, but also the structural characteristics are considered, so that the noise edges of the original network can be effectively reduced, some potential missing edges are added, and the performance of downstream tasks is effectively improved.

Corresponding to the embodiment of the method, the application also provides an embodiment of the network reconstruction device.

The device embodiment of the application can be applied to electronic equipment. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking a software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of a device where the device is located for operation. In terms of hardware, as shown in fig. 8, a hardware structure diagram of a device where an apparatus of the present application is located is shown in fig. 8, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 8, the device where the apparatus is located in an embodiment may generally include other hardware according to an actual function of the apparatus, which is not described herein again.

Referring to fig. 9, a block diagram illustrating a network reconfiguration device according to an exemplary embodiment of the present application may specifically include the following modules:

the agent training module 901 is configured to characterize a state and an action of an agent with nodes in an original network, and perform reinforcement learning on the agent in the original network based on environmental return information, so that the agent learns structural topology information in the original network and rules for removing noise information in the original network;

A moving path obtaining module 902, configured to obtain a moving path formed by the reinforcement-learned agent migrating from a current node to a next node in the original network, and repeatedly invoke the moving path obtaining module with the next node as the current node until the node in the original network is traversed;

a reconstructed network generating module 903, configured to use a network formed by connecting all the moving paths as a reconstructed network of the original network.

In one possible implementation of this embodiment, the environmental report information includes an environmental instantaneous report and an environmental delayed report; the agent training module 901 may include the following sub-modules:

the state determination submodule is used for selecting a node from the original network as the state of the intelligent agent at the time t and recording the state at the time t;

the action determining submodule is used for inputting the state into a preset value network, and acquiring the action of the maximum value output by the value network as an action to be executed at the moment t;

the state transition sub-module is used for executing the action to transition to the state at the time t+1 and recording the state at the time t+1;

An environmental instantaneous return determination sub-module for determining an environmental instantaneous return for the action;

a value network first update sub-module for updating the value network based on the environmental transient rewards;

a round of action determining submodule, configured to take the state at time t+1 as the state at time t, and continuously call the action determining submodule based on the updated value network until the number of the recorded states reaches a preset number threshold value, so as to complete the round of actions;

the intermediate network determining submodule is used for connecting nodes corresponding to adjacent states in the original network according to the states recorded in the action of the round to obtain an intermediate network;

an environmental delay return determining sub-module, configured to determine an environmental delay return corresponding to the intermediate network;

and the value network second updating sub-module is used for updating the value network based on the environment delay return and starting the next action based on the updated value network until the preset maximum number of the action wheels is reached, so that the training of the intelligent body is completed.

In a possible implementation manner of this embodiment, the action determining submodule is specifically configured to:

In a possible implementation manner of this embodiment, the attribute information of the node includes a classification label; the environment instantaneous return determination submodule comprises:

a mobile path establishing unit, configured to establish a mobile path between a node corresponding to the state at the time t and a node corresponding to the state at the time t+1;

a first environmental instantaneous return determining unit, configured to determine whether the moving path exists in the original network, and determine a first environmental instantaneous return according to a result of the determination;

the second environment instantaneous return determining unit is used for judging whether the classification label corresponding to the state at the moment t is consistent with the classification label corresponding to the state at the moment t+1; determining a second environment instantaneous return according to the judgment result;

and the environment instantaneous return fusion unit is used for fusing the first environment instantaneous return and the second environment instantaneous return to obtain the environment instantaneous return.

In a possible implementation manner of this embodiment, the first environmental instantaneous return determining unit is specifically configured to:

In a possible implementation manner of this embodiment, the second environmental instantaneous return determining unit is specifically configured to:

if yes, determining the second environment instantaneous return to be 1;

if not, the second environmental transient reward is determined to be-1.

In one possible implementation manner of this embodiment, the value network first updating sub-module is specifically configured to:

In a possible implementation manner of this embodiment, the value network first updating sub-module is further configured to:

and after the maximum value of the current t moment output by the value network is discounted according to a preset discount rate, overlapping the maximum value with the environment instantaneous return to obtain the real value corresponding to the action at the t moment.

In one possible implementation manner of this embodiment, the environment deferred return determination submodule includes:

a processing performance obtaining unit, configured to obtain a processing performance of the intermediate network for a downstream task, and a processing performance of the original network for the downstream task;

and the difference calculation unit is used for calculating the difference between the processing performance of the intermediate network for the downstream task and the processing performance of the original network for the downstream task as the environment delay return.

In one possible implementation manner of the present embodiment, the processing performance obtaining unit includes:

the node dividing subunit is used for dividing the nodes in the intermediate network into a training set node and a testing set node;

the model training subunit is used for mining a classification model based on the classification label training graph of each training set node;

the classification prediction subunit is used for predicting the classification labels of the nodes of each test set by adopting the graph mining classification model to obtain corresponding prediction results;

and the classification effect score obtaining subunit is used for determining a classification effect score as the processing performance of the intermediate network for the downstream task according to the comparison of the prediction result of each test set node and the real classification label.

In a possible implementation manner of this embodiment, the classification effect score obtaining subunit is specifically configured to:

calculating the precision rate and recall rate according to the number;

In a possible implementation manner of this embodiment, the difference calculating unit is specifically configured to:

In a possible implementation manner of this embodiment, the value network second updating sub-module is specifically configured to:

In a possible implementation manner of this embodiment, the attribute information of the node includes a feature vector of the node; the structure information is a structure characterization vector obtained through a graph characterization learning algorithm.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

The embodiment of the application also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the embodiment of the method when executing the program.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a vehicle-mounted terminal, a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method of network reconfiguration, the method comprising:

representing states and actions of an agent by nodes in an original network, and performing reinforcement learning on the agent in the original network based on environment return information so that the agent learns structural topology information in the original network and rules for removing noise information in the original network; the original network includes: a computer network or a power network;

the network formed by connecting all the moving paths is used as a reconstruction network of the original network;

the environmental report information comprises environmental instantaneous report and environmental delay report;

the intelligent agent executes the action to transfer to a state at the time t+1, and records the state at the time t+1;

2. The method according to claim 1, wherein the act of inputting the state into a preset value network to obtain a maximum value output by the value network as an act to be performed at time t comprises:

3. The method according to claim 1 or 2, wherein the attribute information of the node comprises a classification tag;

the determining an environmental transient reward for the action comprises:

4. A method according to claim 3, wherein said determining a first environmental transient report based on the result of the determination comprises:

5. A method according to claim 3, wherein said determining a second environmental transient report based on the result of the determination comprises:

If yes, determining the second environment instantaneous return to be 1;

if not, the second environmental transient reward is determined to be-1.

6. The method according to claim 1 or 2, wherein the updating the value network based on the environmental transient rewards comprises:

7. The method of claim 6, wherein the step of determining the true value for the action at time t based on the environmental transient rewards comprises:

and after the maximum value of the state of the value network output at the time t+1 is discounted according to a preset discount rate, overlapping the maximum value with the environment instantaneous return obtained at the time t to obtain the real value corresponding to the action at the time t.

8. The method of claim 1, wherein said determining the corresponding environmental delay returns of the intermediate network comprises:

9. The method of claim 8, wherein the obtaining the processing performance of the intermediate network for the downstream task comprises:

10. The method of claim 9, wherein determining the classification effect score based on the comparison of the predicted outcome for each test set node with the true classification label comprises:

calculating the precision rate and recall rate according to the number;

11. The method according to claim 8 or 9 or 10, wherein said calculating a difference between the processing performance of the intermediate network for a downstream task and the processing performance of the original network for the downstream task in return for the environmental delay comprises:

12. The method according to claim 1 or 8 or 9 or 10, wherein said updating the value network based on the environment delay rewards comprises:

13. The method according to claim 1, wherein the node is described by node information, and the node information includes attribute information and structure information of the node; the attribute information of the node comprises a feature vector of the node; the structure information is a structure characterization vector obtained through a graph characterization learning algorithm.

14. A network reconfiguration device, the device comprising:

the intelligent agent training module is used for representing the state and action of an intelligent agent by using nodes in an original network, and performing reinforcement learning in the original network by adopting the intelligent agent based on environment return information so that the intelligent agent learns structural topology information in the original network and rules for removing noise information in the original network; the original network includes: a computer network or a power network;

A reconstruction network generation module, configured to use a network formed by connecting all the moving paths as a reconstruction network of the original network;

the intelligent training module is specifically used for: