CN114168971A

CN114168971A - Internet of things coverage vulnerability repairing method based on reinforcement learning

Info

Publication number: CN114168971A
Application number: CN202111502028.1A
Authority: CN
Inventors: 邓贤君; 夏云芝; 易灵芝; 杨天若; 朱晨露; 杨静
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-03-11

Abstract

The invention discloses a reinforcement learning-based method for repairing the coverage loophole of the Internet of things, which comprises the following steps: (1) establishing a network model according to the monitoring coverage requirement of a target area; (2) establishing a network coverage model through the credible information coverage model, and calculating the coverage rate; (3) determining directional repair nodes of the vulnerability area by adopting a vulnerability reconstruction point and movable node minimum shortest coordinate distance bidirectional selection method; (4) and training the directional repair node M-node by adopting a Q-Learning method, and repairing the vulnerability subgrid until the coverage rate meets the requirement or the iteration number reaches a set upper limit. The method disclosed by the invention has the advantages that the overall coverage rate is high after the repair, the spatial correlation of the coverage target area monitoring reconstruction point is comprehensively mined from the information cooperation angle, the movement of the directional repair node is guided by utilizing a reward mechanism based on the reinforcement learning method, the coverage hole repair is completed, the energy consumption and the repair time are saved, and the coverage rate is improved.

Description

Internet of things coverage vulnerability repairing method based on reinforcement learning

Technical Field

The invention belongs to the technical field of Internet of things, and particularly relates to a reinforcement learning-based Internet of things coverage vulnerability repairing method.

Background

The coverage of the Internet of things is the core technical problem of the development of the Internet of things, and the Internet of things can meet the basic requirement of accurately collecting monitoring target information in real time. The coverage hole caused by energy consumption, environmental factors, software defects and other factors can greatly reduce the coverage rate of the internet of things and influence the safety and reliability of the network. The task of the Internet of things coverage hole repair is to quickly identify a coverage hole which appears suddenly and repair the coverage hole at the fastest speed with the lowest energy consumption as possible so as to meet the coverage requirement. Therefore, it is important to repair the networking coverage hole effectively in real time.

The difficulty of the coverage vulnerability repair of the Internet of things is mainly reflected in three aspects. Firstly, the network coverage model has coverage portrayal and definition difference for different actual scenes, and the selection of the model directly influences the applicability of the repairing method and the actual application scene; secondly, selecting a proper sensor repairing node for the vulnerability, so that the sensor repairing node can reach a repairing position in the shortest time to guarantee repairing time effectiveness and reduce repairing energy consumption; and thirdly, when the sensor node performs leak repairing, an effective motion direction and an effective motion path need to be selected, and unnecessary motion tracks are avoided, so that the energy of the sensor node is saved.

Disclosure of Invention

Aiming at the defects or improvement requirements in the prior art, the invention provides a reinforcement learning-based method for repairing the coverage loophole of the Internet of things, which aims to quickly and effectively repair the coverage loophole in the network of the Internet of things. According to the vulnerability repair method, the coverage range of the sensor is defined through the credible information coverage model, the spatial correlation of vulnerability reconstruction points is fully exerted, the directional repair nodes of a vulnerability area are determined by adopting a minimum shortest coordinate distance bidirectional selection method between the vulnerability reconstruction points and movable nodes, the movement of the directional repair nodes is guided by utilizing a reward mechanism based on a reinforcement learning method, the energy consumption and the repair time are saved, and therefore the technical problem of the Internet of things coverage vulnerability repair is solved.

In order to achieve the above object, according to an aspect of the present invention, there is provided a reinforcement learning-based method for repairing a vulnerability covered by an internet of things, including the following steps:

(1) establishing a network model according to the monitoring coverage requirement of a target area;

(1.1) setting a variable range and an estimated root mean square error threshold according to the spatial correlation of a detection target, and performing area sub-grid division on a target coverage area according to the variable range;

(1.2) recording the number i of the sensor nodes according to the distribution of the sensor nodes, and taking the center point of each sub-grid as a reconstruction point and expressing the center point as p according to the divided coverage sub-grids;

(2) establishing a network coverage model through the credible information coverage model, and calculating the coverage rate;

(3) determining directional repair nodes of the vulnerability area by adopting a vulnerability reconstruction point and movable node minimum shortest coordinate distance bidirectional selection method;

(4) and training the directional repair node M-node by adopting a Q-Learning method, and repairing the vulnerability subgrid until the coverage rate meets the requirement or the iteration number reaches a set upper limit.

In an embodiment of the present invention, the step (2) specifically includes the following sub-steps:

(2.1) in the credible information coverage model, for the space points which are not sampled, calculating the estimated value of the environment variable of the reconstruction point by adopting a common Krigin interpolation function, namely adopting the sensor node tau in the reconstruction neighborhood Z (p)_iComputing an environment variable estimate by a weighted average of the measured values of (a); interpolation weight coefficient lambda of sensor node in neighborhood_iSatisfy the requirement of

n is the sensor node tau in the reconstruction neighborhood Z (p)_iThe number of (2);

(2.2) calculating the root mean square error phi (p) of the reconstruction point p by combining a common kriging interpolation function, wherein the calculation expression is as follows:

wherein

And μ (p) is solved by steps (2.1.1) and (2.1.2);

(2.3) according to the definition of the credible information coverage model, if phi (p) > epsilon₀If the root mean square error is larger than the set coverage threshold, the sub-grid is covered, otherwise, the sub-grid is not covered; recording the number j' of the covered reconstruction point, the error of the root mean square is less than the threshold value epsilon₀The vulnerability reconstruction point number of j ";

and (2.4) calculating the coverage rate of the target area.

In one embodiment of the present invention, the calculation formula of the coverage rate in the step (2.4) is as follows:

wherein S is the total area of the covered area, j' is the number of the covered reconstruction point, S_j'The area of the sub-grid where the covered reconstruction point j' is located is m, which is the total number of covered reconstruction points.

In one embodiment of the invention, said step (2.1) comprises the following sub-steps:

(2.1.1) interpolation weight coefficient lambda_iObtaining a group of optimal solutions through the minimum kriging variance; introducing a Lagrange multiplier mu (p) to generate a linear Krigin system consisting of n +1 equation sets with n +1 unknowns, and solving to obtain an interpolation weight coefficient lambda_i；

Wherein, γ (τ)_i,τ_j) And γ (τ)_iP) is calculated by a variation function;

(2.1.2) calculating γ (. tau.) in step (2.1.1)_i,τ_j) And γ (τ)_i,p)；Selecting a Gaussian variation function as a variation function of the environment variable for describing the sensor node tau_iCollecting spatial correlation between data; the gaussian variation function is formulated as:

wherein d is_τpFor the sensor node τ_iAnd the euclidean distance of the reconstruction point p,

for the sensor node τ_iAnd τ_jEuclidean distance of C₀And C is a constant.

In an embodiment of the present invention, the step (3) specifically includes the following sub-steps:

(3.1) selecting a fixed node and a movable node from the undamaged sensor nodes; selecting a fixed node for the covered sub-grid, wherein the fixed node does not move any more so as to ensure the coverage of the covered sub-grid; calculating Euclidean distances between the sensor nodes in the range of the reconstruction point of each covered sub-grid and the reconstruction point, selecting the sensor node with the minimum Euclidean distance from the covered reconstruction point as a fixed node of the covered grid and recording the sensor node as an F-node, and recording other sensor nodes as movable nodes and recording the movable nodes as R-nodes;

(3.2) determining a directional repair node for each vulnerability reconstruction point;

in an embodiment of the present invention, the step (3.2) specifically includes the following sub-steps:

(3.2.1) calculating the shortest coordinate distance from the R-node to the vulnerability reconstruction point j'; selecting an intention repairing node from the R-node for each vulnerability reconstruction point j' according to the shortest coordinate distance;

(3.2.2) if one R-node is selected as an intention repairing node only by one vulnerability reestablishing point, establishing bidirectional selection, wherein the intention repairing node is a directional repairing node corresponding to the vulnerability reestablishing point; if the same R-node is selected as an intention repairing node by a plurality of vulnerability reconstruction points, the intention repairing node takes the vulnerability reconstruction point with the minimum shortest coordinate distance as a target repairing reconstruction point, and the intention repairing node is a directional repairing node of the selected target repairing reconstruction point; recording a directional repair node as an M-node;

(3.2.3) deleting the directional repairing node selected in the step (3.2.2) from the R-nodes space of the selectable repairing node, and reselecting an intention repairing node or a directional repairing node in the next round by other vulnerability reconstructing points which do not establish bidirectional selection;

(3.2.4) recording the shortest coordinate distance between the directional repair node and the target repair reconstruction point of the directional repair node as an initial node profit value of the directional repair node;

(3.2.5) repeating (3.2.1), (3.2.2), (3.2.3) and (3.2.4) until all vulnerability reconstruction points have selected corresponding directional repair nodes.

In one embodiment of the present invention, the step (4) comprises the following sub-steps:

(4.1) initializing Q-Learning training model parameters; setting learning probability alpha and exploration probability epsilon, wherein alpha and epsilon are constants between [0,1 ]; setting the coverage rate requirement and the training times t of a target area, establishing a Q table, and initializing the initial Q value of each sensor node to be 0; the initial restoration energy consumption value is 0, and the node initial utility function value is the initial node income value recorded in the step (3.2.4);

(4.2) selecting an action strategy for the M-node, and updating the state of the sensor node;

(4.3) learning; and updating the Q value of the state position corresponding to the sensor node, wherein the expression is as follows:

Q(s,a_i,a_-i)←(1-α)Q(s,a_i,a_-i)+α[R(s,a_i)+γπ(s')]

wherein, Q (s, a)_i,a_-i) For the sensor node τ_iSelecting policy a in state s_iQ value of (1);

s 'is the next state of the sensor node, Num (s', a)_-i) Selecting a divide strategy a for next state neighbor nodes_iNumber of other strategies than; n (s ') is the number of adjacent nodes with undamaged next state, Q (s', a)_i,a_-i) For the sensor node τ_iSelecting policy a under state s_iQ value of (1);

(4.4) repeating; and (4) repeating the steps (2), (3), (4.2) and (4.3) until the coverage rate meets the requirement or the iteration number reaches the set upper limit.

In an embodiment of the present invention, the step (4.2) specifically includes the following sub-steps:

(4.2.1) exploring; generating a random number, and randomly selecting one of the four strategies of moving upwards, downwards, leftwards and rightwards with a probability epsilon as a_i；

(4.2.2) utilization; selecting a strategy a by considering the combined action of the Q value of the M-node and other sensor node strategies according to the probability 1-epsilon_iSatisfy the following requirements

Wherein s represents the current state of the sensor node, a_-iRepresenting a divide strategy a in a strategy space_iStrategy other than Num (s, a)_-i) Selecting a divide strategy a for adjacent nodes_iThe times of other strategies except for n(s) are the number of undamaged adjacent nodes;

(4.2.3) calculating a utility function R; utility function value by node profit value P (a)_i,a_-i) And the repair energy consumption value C (a)_i) Jointly determining; the expression is as follows:

R(a_i,a_-i)＝μ_αP(a_i,a_-i)-μ_βC(a_i)

wherein R (a)_i,a_-i) Representing a sensor node selection policy a_iUtility value of time, P (a)_i,a_-i) Selecting policy a for sensor node_iTime nodeProfit value, C (a)_i) Selecting policy a for sensor node_iTimely repair energy consumption value; mu.s_αAnd mu_βBalancing the node profit value and the weight coefficient of the energy consumption for restoration; the node profit value is determined by the shortest coordinate distance from the sensor node to the target vulnerability repair point, the repair energy consumption value is determined by the moving distance of the sensor node, and the expression is as follows:

wherein, omega is a constant and is the order of magnitude relation of the balanced moving distance and the energy consumption;

the shortest coordinate distance from the sensor node to the target restoration reconstruction point is obtained; c is a constant to avoid unnecessary frequent back and forth movement of the sensor node, Δ d_iIs the distance of movement;

(4.2.4) node state protection; the state information of the sensor node comprises position information, a Q value, a strategy value and a utility function value of the node; when the sensor node executes the strategy, the utility function value is smaller than the utility reward value of the previous state, namely R_i(t)＜R_i(t-1), not executing the current action strategy and still keeping the current state s; otherwise, the current strategy is executed, and the sensor node enters the next state s'.

Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects:

(1) the method has the advantages that the overall coverage rate is high after the restoration, the spatial correlation of the coverage target area monitoring reconstruction points is comprehensively mined from the information cooperation angle, the coverage error is estimated by using the root mean square error, the coverage prediction is completed, and the coverage rate is improved;

(2) the Q-Learning method under reinforcement Learning is used for rewarding and drawing by taking the distance between the sensor node and the vulnerability reconstruction point as a reward, so that the sensor node can be quickly guided to move, unnecessary movement tracks are reduced, and the repair time is saved;

(3) the method has low energy consumption, and the bidirectional selection method adopted by the invention can select the directional repair nodes for the vulnerability sub-grid, thereby avoiding the movement and energy consumption of unnecessary sensor nodes. Meanwhile, the Q-Learning method for reinforcement Learning guides the directional repair node to move in the shortest distance to repair the bug, so that repair energy consumption is saved;

(4) the internet of things is a universal network and is suitable for various application scenes. The credible information coverage model adopted in the invention can be used for covering different terrains, regions and different data monitoring targets.

Drawings

FIG. 1 is a flowchart of a coverage hole repairing method based on reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic representation of utility values of sensor nodes in an embodiment of the present invention;

fig. 3 is a visualization result example of bug fixes in the embodiment of the present invention, where:

FIG. 3(a) is a schematic diagram of initial time overlay;

fig. 3(b) is a schematic diagram of the first loophole, where the training time t is 1;

fig. 3(c) is a diagram illustrating the training times t-99;

fig. 3(d) is a schematic diagram of the second loophole, where the training time t is 100;

fig. 3(e) is a diagram of the training times t 199;

fig. 3(f) is a schematic diagram of the third loophole, where the training time t is 200;

fig. 3(g) is a diagram of training time t-299;

fig. 3(h) is a schematic diagram of the fourth loophole, where the training time t is 300;

fig. 3(i) is a diagram of training time t-500;

in all the figures, the same reference numerals are usedTo denote the same element or structure, wherein: f-node represents a fixed node, R-node represents a mobile node, M-node represents a directional repair node, Rp_j(Rx_j,Ry_j) Representing a vulnerability grid reconstruction point to be repaired, CR representing a variable range, Num representing the number of normal sensor nodes in the current state network, t representing the training frequency, and Per representing the coverage rate of a target area under current training.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The technical terms of the present invention are explained and explained first:

variation Range (CR): a distance threshold characterizing the spatial correlation of the environmental variable. For a particular environment variable and spatial point, only the values of other spatial points within the range of the variable are relevant to the current spatial point.

Root Mean Square Error (RMSE): the reconstruction and estimation quality, i.e. the measure of the error between the estimated values and the reference point values, is measured and evaluated for values of the unused spatial environment variables.

Trusted Information overlay (trusted Information overlay): in a target monitoring area, if the root mean square error of reconstruction information on a space point in the area is less than or equal to a threshold epsilon set forth by practical application requirements₀Then the spatial point is covered by trusted information.

Euclidean distance: the absolute distance between two points or vectors in a multidimensional space, i.e. the square root of the difference between the vectors, is measured. Point A (a)_x,a_y) To point B (B)_x,b_y) Has an Euclidean distance of

Shortest coordinate distance: the sensor node selectively moves 1 unit distance in four directions of up, down, left and right in each movement, and the point A (a)_x,a_y) To point B (B)_x,b_y) Has a shortest coordinate distance of | a_x-b_x|+|a_y-b_y|。

Kriging interpolation: the kriging method is essentially a moving weighted average method and has the characteristics of being optimal, linear, unbiased and the like. Kriging (Kriging) is a regression algorithm that spatially models and predicts (interpolates) random processes or random fields according to a covariance function. The kriging method can give an optimal linear unbiased estimate in a specific stochastic process, such as an inherently stationary process, and is therefore also referred to as a spatially optimal unbiased estimator in geostatistical.

Q-Learning: a model-independent reinforcement learning algorithm takes Markov Decision Processes (MDPs) as a theoretical basis, and continuously improves and finally obtains an optimal behavior strategy through interaction with the environment.

Utility function: q-learning is a function for measuring the return after strategy selection and is used for evaluating the quality of certain action taken under a specific state.

And (4) adjacent nodes: and the sensor node tau_iIn its communication range R_cOther sensor nodes in the array are tau_iOf the neighboring node.

Q value: the value used in Q-learning to evaluate an action, called Q-value, represents the expectation of the agent selecting the action until the final state reward sum.

argmaxf (x) function: solving for the value of x such that f (x) takes the maximum value.

The solution to the difficulties existing in the prior art is as follows:

aiming at the difficulty one, most of the existing methods adopt a disc model to define the coverage range of the sensor node, and the model is too ideal and simple. The information collaborative reconstruction can be carried out from the dimensionality of the airspace by adopting a credible information coverage model, and the coverage degree of the reconstruction point is predicted by utilizing the correlation of the spatial point. The credible information coverage model makes full use of spatial correlation and can be well applied to practice. For the second difficulty, most of the existing methods select a repair node based on the shortest path, and the nodes without considering directivity carry out vulnerability repair. These methods are generally high in energy consumption due to global or local movement of network nodes. The problem can be well improved by adopting the loophole based on the shortest path, namely a movable node bidirectional selection method, and the movement energy consumption of unnecessary repair nodes is greatly reduced. Aiming at the third difficulty, the existing Voronoi diagram method is commonly used for space geometric division and path selection, and the topology control method adjusts the network coverage by adjusting the position and the sensing radius of the sensor node. However, these methods have problems of excessive energy consumption or a large number of repetitions. The reinforcement learning method can guide the movement of the sensor nodes through a reward mechanism, and the utilization rate of energy and the repair speed are improved.

As shown in fig. 1, the reinforcement learning-based internet of things coverage vulnerability repair method of the present invention includes the following steps:

(1) and establishing a network model according to the monitoring coverage requirement of the target area.

and (1.2) recording the number i of the sensor nodes according to the distribution of the sensor nodes, and taking the center point of each sub-grid as a reconstruction point according to the divided coverage sub-grids, wherein the center point of each sub-grid is denoted as p.

(2) And establishing a network coverage model through the credible information coverage model, and calculating the coverage rate. The method specifically comprises the following substeps:

(2.1) in the credible information coverage model, for the space points which are not sampled, calculating the estimated value of the environment variable of the reconstruction point by adopting a common Krigin interpolation function, namely adopting the sensor node tau in the reconstruction neighborhood Z (p)_iTo calculate an environment variable estimate. Interpolation weight coefficient lambda of sensor node in neighborhood_iSatisfy the requirement of

n is the sensor node tau in the reconstruction neighborhood Z (p)_iThe number of the cells. Wherein λ_iThe calculation of (b) comprises the following sub-steps:

(2.1.1) interpolation weight coefficient lambda_iA set of optimal solutions can be obtained with a minimum kriging variance. Introducing a Lagrange multiplier mu (p) to generate a linear Krigin system consisting of n +1 equation sets with n +1 unknowns, and solving to obtain an interpolation weight coefficient lambda_i。

Wherein, γ (τ)_i,τ_j) And γ (τ)_iP) can be calculated by a variogram.

(2.1.2) calculating γ (. tau.) in step (2.1.1)_i,τ_j) And γ (τ)_iP). Selecting a Gaussian variation function as a variation function of the environment variable for describing the sensor node tau_iSpatial correlation between the acquired data. The general formula for the gaussian variation function is:

wherein d is_τpFor the sensor node τ_iAnd the Euclidean distance, d, of the reconstruction point p_τiτjFor the sensor node τ_iAnd τ_jEuclidean distance of C₀And C is a constant when C₀When C is 0 and 1, it is a standard gaussian function.

wherein

And μ (p) are solved by steps (2.1.1) and (2.1.2).

(2.3) according to the definition of the credible information coverage model, if phi (p) > epsilon₀I.e. the root mean square error is larger than the set coverage threshold, the subgrid is covered, otherwise it is not covered. Recording the number j' of the covered reconstruction point, the error of the root mean square is less than the threshold value epsilon₀The vulnerability reconstruction point number of (1) is j ".

(2.4) calculating the coverage rate of the target area, wherein the calculation formula is as follows:

(3) And determining directional repair nodes of the vulnerability area by adopting a vulnerability reconstruction point and movable node minimum shortest coordinate distance bidirectional selection method. The method specifically comprises the following substeps:

and (3.1) selecting a fixed node and a movable node from the undamaged sensor nodes. And selecting a fixed node for the covered sub-grid, wherein the fixed node does not move any more so as to ensure the coverage of the covered sub-grid. And calculating Euclidean distances between the sensor nodes in the range of the reconstruction point of each covered sub-grid and the reconstruction point, selecting the sensor node with the minimum Euclidean distance from the covered reconstruction point as a fixed node of the covered grid and recording the fixed node as an F-node, and recording other sensor nodes as movable nodes as R-nodes.

And (3.2) determining a directional repairing node for each vulnerability reconstruction point. The method specifically comprises the following substeps:

and (3.2.1) calculating the shortest coordinate distance from the R-node to the vulnerability reconstruction point j'. And selecting an intention repairing node from the R-node for each vulnerability reconstruction point j' according to the shortest coordinate distance.

(3.2.2) if one R-node is selected as an intention repairing node only by one vulnerability reconstructing point, establishing bidirectional selection, wherein the intention repairing node is the directional repairing node corresponding to the vulnerability reconstructing point. If the same R-node is selected as an intention repairing node by a plurality of vulnerability reconstruction points, the intention repairing node selects the vulnerability reconstruction point with the minimum shortest coordinate distance from the intention repairing node as a target repairing reconstruction point, and the intention repairing node is a directional repairing node of the selected target repairing reconstruction point. And recording the directional repair node as an M-node.

And (3.2.3) deleting the directional repairing node selected in the step (3.2.2) from the R-nodes space of the selectable repairing node, and reselecting an intention repairing node or a directional repairing node in the next round by other vulnerability reconstructing points which do not establish bidirectional selection.

And (3.2.4) recording the shortest coordinate distance between the directional repair node and the target repair reconstruction point thereof as an initial node benefit value of the directional repair node.

(4) And training the directional repair node M-node by adopting a Q-Learning method, and repairing the vulnerability subgrid until the coverage rate meets the requirement or the iteration number reaches a set upper limit. The method comprises the following substeps:

(4.1) initializing Q-Learning training model parameters. And setting a learning probability alpha and an exploration probability epsilon, wherein alpha and epsilon are constants between [0 and 1 ]. Setting the coverage rate requirement and the training times t of the target area, establishing a Q table, and initializing the initial Q value of each sensor node to be 0. And (4) since the initial restoration energy consumption value is 0, the node initial utility function value is the initial node profit value recorded in the step (3.2.4).

And (4.2) selecting an action strategy for the M-node and updating the state of the sensor node. The method specifically comprises the following substeps:

(4.2.1) search. Generating a random number, and randomly selecting one of the four strategies of moving upwards, downwards, leftwards and rightwards with a probability epsilon as a_i。

(4.2.2) use. Selecting a strategy a by considering the combined action of the Q value of the M-node and other sensor node strategies according to the probability 1-epsilon_iSatisfy the following requirements

Wherein s represents the current state of the sensor node, a_-iRepresenting a divide strategy a in a strategy space_iStrategy other than Num (s, a)_-i) Selecting a divide strategy a for adjacent nodes_iThe times of other strategies except the strategy, n(s), are the number of undamaged adjacent nodes.

(4.2.3) calculating the utility function R. Utility function value by node profit value P (a)_i,a_-i) And the repair energy consumption value C (a)_i) And (4) jointly determining. The expression is as follows:

R(a_i,a_-i)＝μ_αP(a_i,a_-i)-μ_βC(a_i)

wherein R (a)_i,a_-i) Representing a sensor node selection policy a_iUtility value of time, P (a)_i,a_-i) Selecting policy a for sensor node_iValue of node profit of time, C (a)_i) Selecting policy a for sensor node_iTimely repair energy consumption value. Mu.s_αAnd mu_βAnd balancing the node profit value and the weight coefficient of the repair energy consumption. The node profit value is determined by the shortest coordinate distance from the sensor node to the target vulnerability repair point, as shown in fig. 2, the repair energy consumption value is determined by the moving distance of the sensor node, and the expression is as follows:

where ω is a constant, which is an order of magnitude relationship for equalizing the moving distance and energy consumption.

And the shortest coordinate distance from the sensor node to the target repairing and reconstructing point is obtained. C is a constant to avoid unnecessary frequent back and forth movement of the sensor node, Δ d_iIs the distance of movement.

And (4.2.4) node state protection. The state information of the sensor node includes position information, a Q value, a policy value, a utility function value, and the like of the node. When the sensor node executes the strategy, the utility function value is smaller than the utility reward value of the previous state, namely R_i(t)＜R_i(t-1), the current action policy is not executed, and the current state s is maintained. Otherwise, the current strategy is executed, and the sensor node enters the next state s'.

And (4.3) learning. And updating the Q value of the state position corresponding to the sensor node, wherein the expression is as follows:

Q(s,a_i,a_-i)←(1-α)Q(s,a_i,a_-i)+α[R(s,a_i)+γπ(s')]

wherein, Q (s, a)_i,a_-i) For the sensor node τ_iSelecting policy a in state s_iThe Q value of (1).

s 'is the next state of the sensor node, Num (s', a)_-i) Selecting a divide strategy a for next state neighbor nodes_iTimes of other strategies than the above. n (s ') is the number of adjacent nodes with undamaged next state, Q (s', a)_i,a_-i) For the sensor node τ_iSelecting policy a under state s_iThe Q value of (1).

And (4.4) repeating. And (4) repeating the steps (2), (3), (4.2) and (4.3) until the coverage rate meets the requirement or the iteration number reaches the set upper limit.

Fig. 3 shows an example of a visualization result of bug fixes in the embodiment of the present invention, where: FIG. 3(a) is a schematic diagram of initial time overlay; fig. 3(b) is a schematic diagram of the first loophole, where the training time t is 1; fig. 3(c) is a diagram illustrating the training times t-99; fig. 3(d) is a schematic diagram of the second loophole, where the training time t is 100; fig. 3(e) is a diagram of the training times t 199; fig. 3(f) is a schematic diagram of the third loophole, where the training time t is 200; fig. 3(g) is a diagram of training time t-299; fig. 3(h) is a schematic diagram of the fourth loophole, where the training time t is 300;

fig. 3(i) is a diagram illustrating the training number t of 500. As can be seen from FIG. 3, the method avoids unnecessary invalid node movement, and the repairing effect is obvious.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An Internet of things coverage vulnerability repairing method based on reinforcement learning is characterized by comprising the following steps:

2. The reinforcement learning-based internet of things coverage hole repairing method according to claim 1, wherein the step (2) specifically comprises the following substeps:

wherein

And μ (p) is solved by steps (2.1.1) and (2.1.2);

and (2.4) calculating the coverage rate of the target area.

3. The reinforcement learning-based internet of things coverage hole repairing method according to claim 2, wherein the coverage rate in the step (2.4) is calculated according to the following formula:

wherein S is the total area of the covered area, j' is the number of the covered reconstruction point, S_j'For covered reconstruction pointsj' is the area of the sub-grid, and m is the total number of covered reconstruction points.

4. The reinforcement learning-based internet of things coverage vulnerability repair method according to claim 2 or 3, wherein the step (2.1) comprises the following sub-steps:

Wherein, γ (τ)_i,τ_j) And γ (τ)_iP) is calculated by a variation function;

(2.1.2) calculating γ (. tau.) in step (2.1.1)_i,τ_j) And γ (τ)_iP); selecting a Gaussian variation function as a variation function of the environment variable for describing the sensor node tau_iCollecting spatial correlation between data; the gaussian variation function is formulated as:

for the sensor node τ_iAnd τ_jEuclidean distance of C₀And C is a constant.

5. The reinforcement learning-based internet of things coverage vulnerability repair method according to claim 1 or 2, wherein the step (3) specifically comprises the following sub-steps:

and (3.2) determining a directional repairing node for each vulnerability reconstruction point.

6. The reinforcement learning-based internet of things coverage hole repairing method according to claim 5, wherein the step (3.2) specifically comprises the following sub-steps:

7. The reinforcement learning-based internet of things coverage vulnerability repair method according to claim 1 or 2, wherein the step (4) comprises the following sub-steps:

Q(s,a_i,a_-i)←(1-α)Q(s,a_i,a_-i)+α[R(s,a_i)+γπ(s')]

8. The reinforcement learning-based internet of things coverage vulnerability repair method according to claim 1 or 2, wherein the step (4.2) specifically comprises the following sub-steps:

R(a_i,a_-i)＝μ_αP(a_i,a_-i)-μ_βC(a_i)

wherein R (a)_i,a_-i) Representing a sensor node selection policy a_iUtility value of time, P (a)_i,a_-i) Selecting policy a for sensor node_iValue of node profit of time, C (a)_i) Selecting policy a for sensor node_iTimely repair energy consumption value; mu.s_αAnd mu_βBalancing the node profit value and the weight coefficient of the energy consumption for restoration; the node profit value is determined by the shortest coordinate distance from the sensor node to the target vulnerability repair point, the repair energy consumption value is determined by the moving distance of the sensor node, and the expression is as follows: