CN110807230B

CN110807230B - Method for autonomously learning and optimizing topological structure robustness of Internet of things

Info

Publication number: CN110807230B
Application number: CN201911036835.1A
Authority: CN
Inventors: 邱铁; 陈宁; 李克秋; 周晓波
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2024-03-12
Anticipated expiration: 2039-10-29
Also published as: CN110807230A

Abstract

The invention discloses a method for optimizing the robustness of an internet of things topological structure through autonomous learning, which comprises the following steps of: initializing an Internet of things topological structure; step 2: compressing a topological structure; step 3: an autonomous learning model is initialized. According to the features of deep learning and reinforcement learning, constructing a deep deterministic learning strategy model to train the topological structure of the Internet of things; step 4: training and testing a model; step 5: periodically repeating step 4 in one independent repetition experiment, and periodically repeating steps 1, 2, 3 and 4 in multiple independent repetition experiments; up to a maximum number of iterations. In the process, the maximum iteration number is set, and the experiment is independently repeated each time, and the optimal result is selected. The experiment was repeated several times and the average value was chosen as the result of this experiment. The invention can obviously improve the capability of the initial topological structure for resisting attack; and the robust capability of the network topology structure is optimized through autonomous learning, so that high-reliability data transmission is ensured.

Description

Method for autonomously learning and optimizing topological structure robustness of Internet of things

Technical Field

The invention relates to the technical field of Internet of things networks, in particular to a method for optimizing topological structure robustness of the Internet of things.

Background

The internet of things is an important component in a smart city network, and large-scale equipment nodes are connected together through the internet of things to provide high-quality services for people. However, connected device nodes need to be tolerant of failure threats, such as random failures of devices, artificial vandalism and natural disasters, network part node failures caused by energy exhaustion, etc., so that the whole network is paralyzed. Under the wide application scene of the Internet of things, how to ensure that large-scale nodes ensure high-quality data service communication of the network on the premise of network part node failure has important research significance.

In conventional optimization of network topology of the internet of things, nodes are typically deployed at fixed locations with certain communication range limitations. The network topology initializes its network model according to a scaleless network model. In the network topology optimization strategy, most of researches adopt greedy edge-changing strategies or evolution algorithms to optimize the robustness of the network topology, so that the whole network has very high capability of resisting attacks. For example, journal Robustness optimization scheme with multi-delivery co-evolution for scale-free wireless sensor networks proposes a multi-population genetic algorithm to solve the problem of local optimization so as to obtain a globally optimal network topology, but the time cost for optimizing a network topology is large, and the algorithm cannot accumulate the optimization experience every time, so that the algorithm needs to be restarted every time. Secondly, some researchers use neural network models to represent learning behaviors before and after network topology optimization, so that the optimization time of the topology is reduced, but the method needs label target data, and the label target data defines the maximum value of optimization. Therefore, in the optimization of the topological structure of the Internet of things, the network topological structure strategy of autonomous learning optimization is utilized to improve the network topological robustness, the online of the optimization target value is eliminated, and the experience of each learning is accumulated to guide the subsequent optimization behavior.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a method for optimizing the robustness of the topological structure of the Internet of things by autonomous learning, wherein the topological structure of the Internet of things is used as an environment space according to the characteristics of reinforcement learning and deep learning, meanwhile, an action space is designed to search the environment, the effect of each action on the environment space is optimized, and the accumulated optimization effect is maximized, so that the robustness of the topological structure of the Internet of things is improved, and the autonomous learning behavior of the whole topological structure of the network is increased.

The invention discloses a method for optimizing the topological structure robustness of the Internet of things through autonomous learning, which comprises the following steps:

step 1: initializing an internet of things topological structure, namely randomly deploying nodes according to rules of a scaleless network model and an edge density parameter M, and fixing geographic positions, wherein the edge density parameter M is set to be M=2;

step 2: the method comprises the steps of compressing a topological structure, namely eliminating redundant node information which is not in a communication range, reserving only node connection relations in the communication range in a form of an adjacency matrix, compressing a storage space of the network topological structure, and taking the compressed network topology as an environment space S, wherein the environment space S is a row vector, and the row vector is changed along with the change of a network topology state;

step 3: initializing an autonomous learning model, namely constructing a deep deterministic learning strategy model according to the features of deep learning and reinforcement learning to train the topological structure of the Internet of things: adopting a depth deterministic Q-learning network model, simulating a selection strategy pi of actions and an optimization strategy Q of a network, mapping a continuous action space into a discrete action space, and designing an updating rule of a target optimization function O and the whole training model; wherein:

the action selection policy pi is defined by equation (1):

a _t ＝π(s _t |θ) (1)

wherein a is _t Representing the selected deterministic action s _t Representing the current network topology state, and theta represents the parameters of the action network;

the network optimization strategy Q is defined by equation (2):

Q(s _t ,a _t )＝E(r(s _t ,a _t )+γQ(s _t+1 ,π(s _t+1 ))) (2)

wherein r represents the current action a _t For the current network state s _t And (c) the immediate return value of gamma represents the discount factor, the accumulated learning experience, Q(s) _t+1 ,π(s _t+1 ) A future return value representing an action taken in the next network state, and thus the effect Q(s) of the current action on the current network state _t ,a _t ) The method comprises the steps of forming an immediate return value and a future return value, wherein E () represents an expected value, and selecting an effect before strategy accumulation for a series of actions;

the objective function O of the autonomous learning model is defined by equation (3) according to the above description:

O(θ)＝Ε(r ₁ +γr ₂ +γ ² r ₃ +...|π(,θ)) (3)

where r represents the sum of the actions for environment O (θ) =e (r ₁ +γr ₂ +γ ² r ₃ +. |pi (, θ)) produced effects, i.e., a return value, γ represents a discount factor, cumulative learning experience, pi (, θ) represents a policy of action selection, θ represents a parameter of an action policy network, and E represents an average expected value;

the update rules of the network are defined by equation (4):

wherein T is _i Represents a target expected value, defined by the formula (5):

T _i ＝r _i +γQ'(s _i ,π'(s _i+1 |θ ^π’ )|θ ^Q’ ) (5)

in the formula, Q ', pi' represent target networks of action selection strategies and optimization strategies, and errors of the whole autonomous learning model are calculated;

step 4: training and testing a model, namely in a training stage, randomly obtaining a discrete action a through an action selection neural network model, evaluating the effect of the action on the current environment through a network optimization neural network model strategy, accumulating the previous learning experience and updating the whole network model, and finally obtaining an optimal result; in the test stage, testing sample data to obtain a test result; wherein:

the output of the discrete action is defined by equation (6):

d＝MAP(a) (6)

wherein MAP represents a mapping relationship between a continuous motion space and a discrete motion space, and a is defined by formula (7):

a＝π(s)＝π(s|θ)+N (7)

wherein N represents a random sampling rule, and more effective action behaviors in the action space are explored;

wherein, the action selecting strategy network updating principle is updated towards the direction of making the strategy selecting network value maximum, so the selected action makes the strategy selecting network maximum;

the updating rule of the target network is defined as formula (10);

θ ^Q’ ←τθ ^Q +(1-τ)θ ^Q’

θ ^π’ ←τθ ^π +(1-τ)θ ^π’ (10)

where τ represents the update rate of the target network;

step 5: periodically repeating step 4 in one independent repetition experiment, and periodically repeating steps 1, 2, 3 and 4 in multiple independent repetition experiments; up to the maximum iteration times;

in the process, the maximum iteration times are set, the experiment is independently repeated each time, the optimal result is selected, the experiment is repeated for a plurality of times, and the average value is selected as the result of the experiment.

The positive technical effects obtained by the invention include:

(1) The invention designs an autonomous learning optimization strategy for the topological structure robustness of the Internet of things by utilizing the deep reinforcement learning neural network model, and can obviously improve the capability of the initial topological structure for resisting attack;

(2) The invention utilizes the state representation of the topological structure of the Internet of things, discrete action space mapping relation, the non-scale characteristic and compression characteristic of the network to come from the robust capability of main learning optimization network topological structure, thereby ensuring high-reliability data transmission.

Drawings

FIG. 1 is a flowchart of an overall method for optimizing the topological robustness of an Internet of things by autonomous learning;

FIG. 2 is a schematic diagram of a mapping relationship between continuous and discrete actions of an autonomous learning optimization model;

fig. 3 is a schematic diagram of a topological structure compression model of the internet of things.

Detailed Description

The following describes in detail the specific manner, structure, features and actions of a node deployment strategy designed according to the present invention with reference to the accompanying drawings.

As shown in FIG. 1, the overall flow chart of the method for optimizing the topological structure robustness of the Internet of things by autonomous learning is disclosed, and the method comprehensively considers the mapping relation between a large-scale continuous action space and a discrete action space, the compression mode of a network topological structure and the connection relation of nodes, effectively improves the network robustness, enhances the autonomous learning behavior of the overall network, balances the node connection distribution and ensures the high-quality communication capability of the network. The flow of the method specifically comprises the following steps:

step 1: initializing an Internet of things topological structure. And randomly deploying nodes according to the rule of the scale-free network model and the edge density parameter M, and fixing the geographic position. Most nodes have few degrees, few nodes have large degrees, and the real world Internet of things topological structure is described to the greatest extent. Each node has the same attributes.

Wherein the edge density parameter is set to m=2, indicating that the number of edges in the network is 2 times the number of nodes.

Step 2: the topology is compressed. Different from the adjacent matrix representation mode of the network topology structure, the invention eliminates redundant node information which is not in the communication range, only reserves the node connection relation in the communication range on the form of the adjacent matrix, compresses the storage space of the network topology structure, and takes the compressed network topology as the environment space S.

The environment space S is a row vector, which changes with the change of the network topology state.

Step 3: an autonomous learning model is initialized. And constructing a deep deterministic learning strategy model to train the topological structure of the Internet of things according to the deep learning and reinforcement learning characteristics. A depth deterministic Q-learning network model is adopted, a selection strategy pi of a simulation action and an optimization strategy Q of a network are adopted, a continuous action space is mapped into a discrete action space, and an updating rule of a target optimization function O and the whole training model is designed.

Wherein the action selection policy pi is defined by formula (1).

a _t ＝π(s _t |θ) (1)

Wherein a is _t Representing the selected deterministic action s _t Representing the current network topology state, θ represents a parameter of the action network. Current network topology state s _t And obtaining a deterministic action through the action policy function pi, wherein the deterministic action can directly operate on the current network topology structure.

The network optimization strategy Q is defined by a formula (2), and the effect of the selected action on the environment space is measured.

Q(s _t ,a _t )＝E(r(s _t ,a _t )+γQ(s _t+1 ,π(s _t+1 ))) (2)

Where r represents the current action a _t For the current network state s _t Gamma is the discount factor, and the accumulated learning experience. Q(s) _t+1 ,π(s _t+1 ) A future return value representing an action taken in the next network state, and thus the effect Q(s) of the current action on the current network state _t ,a _t ) Consists of an immediate return value and a future return value, E () represents the expected value, and the effect before policy accumulation is selected for a series of actions.

Wherein the objective function O of the autonomous learning model can be defined specifically as formula (3) according to the above description

(3)

Where r represents the sum of the actions for environment O (θ) =e (r ₁ +γr ₂ +γ ² r ₃ +. |pi (, θ)), the effect produced, i.e., the return value. Gamma denotes the discount factor, and the learning experience is accumulated. Pi (, θ) represents the policy of action selection, θ is a parameter of the action policy network. E represents the average expected value as an objective function of the entire autonomous learning model.

Wherein the update rule of the network is defined by equation (4).

Wherein T is _i The target expected value is represented by the expression (5).

T _i ＝r _i +γQ'(s _i ,π'(s _i+1 |θ ^π’ )|θ ^Q’ ) (5)

In the formula, Q ', pi' represents a target network of an action selection strategy and an optimization strategy, and the error of the whole autonomous learning model is calculated.

Step 4: training and testing the model. In the training stage, a discrete action a is randomly obtained through action selection neural network model, the effect of the action on the current environment is evaluated through a network optimization neural network model strategy, the previous learning experience is accumulated, the whole network model is updated, and finally the optimal result is obtained. And in the test stage, testing the sample data to obtain a test result.

Wherein the output of the discrete action is defined by equation (6).

d＝MAP(a) (6)

a＝π(s)＝π(s|θ)+N

(7)

where N represents a random sampling rule, exploring more effective action behaviors in the action space.

After obtaining the effect of the action on the current environment, namely the return value, the return value is stored in a memory, the convergence rate of the autonomous learning model is accelerated by using the previous learning experience in the candidate optimization learning, and the rule is defined as formula (8).

(s _t ,a _t ,r _t ,s _t+1 )→D (8)

Wherein D represents a memory in the network model, and stores the current network state, action, instant return value, and next network state.

Wherein, in the stage of updating the autonomous learning optimization model, the action selection policy network updating rule is defined by the formula (9).

▽π＝E _π' [▽ _a Q(s,a)▽ _θ π(s)] (9)

In the formula, the action selection policy network updating principle is updated towards the direction of maximizing the value of the policy selection network, so that the selected action maximizes the policy selection network.

Wherein the update rule of the target network is defined as formula (10).

θ ^Q’ ←τθ ^Q +(1-τ)θ ^Q’

θ ^π’ ←τθ ^π +(1-τ)θ ^π’ (10)

Where τ represents the update rate of the target network.

Step 5: periodically repeating step 4 in one independent repetition experiment, and periodically repeating steps 1, 2, 3 and 4 in multiple independent repetition experiments; up to a maximum number of iterations. In the process, the maximum iteration number is set, and the experiment is independently repeated each time, and the optimal result is selected. The experiment was repeated several times and the average value was chosen as the result of this experiment.

Claims

1. The method for autonomously learning and optimizing the topological structure robustness of the Internet of things is characterized by comprising the following steps of:

the action selection policy pi is defined by equation (1):

a _t ＝π(s _t |θ) (1)

the network optimization strategy Q is defined by equation (2):

Q(s _t ,a _t )＝E(r(s _t ,a _t )+γQ(s _t+1 ,π(s _t+1 ))) (2)

O(θ)＝Ε(r ₁ +γr ₂ +γ ² r ₃ +...|π(,θ)) (3)

where r represents the sum of the actions for environment O (θ) =e (r ₁ +γr ₂ +γ ² r ₃ +. |pi (, θ)) produced effects, i.e., a return value, γ represents a discount factor, cumulative learning experience, pi (, θ) represents a policy of action selection, θ represents a parameter of an action policy network, and E represents an average expected value; θ

The update rules of the network are defined by equation (4):

wherein T is _i Represents a target expected value, which is determined by the formula (5)The meaning is as follows:

T _i ＝r _i +γQ'(s _i ,π'(s _i+1 |θ ^π′ )|θ ^Q′ ) (5)

step 4: training and testing a model, namely in a training stage, randomly obtaining a discrete action a through an action selection neural network model, evaluating the effect of the action on the current environment by a network optimization neural network model strategy, accumulating the previous learning experience and updating the whole network model, and finally obtaining an optimal result; in the test stage, testing sample data to obtain a test result; wherein:

the output of the discrete action is defined by equation (6):

d＝MAP(a) (6)

a＝π(s)＝π(s|θ)+N (7)

wherein N represents a random sampling rule, more effective action behaviors in an action space are explored, and s represents the current network state;

the updating rule of the target network is defined as formula (10);

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^π′ ←τθ ^π +(1-τ)θ ^π′ (10)

where τ represents the update rate of the target network;