CN114356535A

CN114356535A - Resource management method and device for wireless sensor network

Info

Publication number: CN114356535A
Application number: CN202210255790.2A
Authority: CN
Inventors: 曾勇; 万子金; 熊山山
Original assignee: Beijing Jincheng Century Consulting Service Co ltd
Current assignee: Beijing Jincheng Century Consulting Service Co ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-04-15

Abstract

The application relates to a resource management method and a device of a wireless sensor network; the method comprises the following steps: taking each sensor node in the wireless sensor network as an agent; setting network parameters for a wireless sensor network, wherein the network parameters at least comprise: environmental status, action lists, and reward functions; performing iterative interaction of multiple agents based on the network parameters to determine an optimal strategy; and performing resource allocation and task scheduling on the sensor nodes in the wireless sensor network according to the optimal strategy. According to the scheme, the dynamic interaction theory of the multi-agent is applied to the wireless sensor network, and the problems of resource allocation and task scheduling in the wireless sensor network are solved, so that the wireless sensor network can actively perform resource allocation and task scheduling and provide an online monitoring function under the conditions that the wireless sensor network is inaccessible and the outside cannot intervene.

Description

Resource management method and device for wireless sensor network

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a resource management method and device of a wireless sensor network.

Background

Typically in wireless sensor networks, wireless sensor nodes are heterogeneous, energy-constrained, and tend to operate under dynamic and ambiguous conditions. In these cases, the nodes need to know how to collaborate on tasks and resources (including power and bandwidth).

In the related art, in some application scenarios, the wireless sensor network sometimes disconnects from the outside and is in an inaccessible state, and the outside cannot schedule and manage the sensor network. In such a case, the wireless sensor network needs to actively perform resource allocation and task scheduling.

Disclosure of Invention

To overcome at least some of the problems in the related art, the present application provides a resource management method and apparatus for a wireless sensor network.

According to a first aspect of embodiments of the present application, there is provided a resource management method for a wireless sensor network, including:

taking each sensor node in the wireless sensor network as an agent;

setting network parameters for a wireless sensor network, wherein the network parameters at least comprise: environmental status, action space, and reward function;

performing iterative interaction of multiple agents based on the network parameters to determine an optimal strategy;

and performing resource allocation and task scheduling on the sensor nodes in the wireless sensor network according to the optimal strategy.

Further, the environmental state includes: battery power and/or spectrum availability; the action list includes: receiving or sending a specified packet, and/or performing a specified task; the reward function includes: internal rewards and/or external rewards.

Further, the internal reward is a reward function defined based on an internal variable, and the external reward is a reward function defined according to feedback of a central controller or other nodes;

each sensor node is provided with a corresponding reward function; the other nodes are other sensor nodes except the other nodes in the wireless sensor network.

Further, the taking each sensor node in the wireless sensor network as an agent includes:

modeling a wireless sensor network

As a collection of agents; wherein

The number of the sensor nodes in the wireless sensor network;

order to

Representing a state space; wherein the content of the first and second substances,

is a shared state space that is used by the client,

is an intelligent agent

The local state space of (a) is,

；

order to

Representing a space of action in which

Is as follows

The action space of the individual agent.

Further, the reward function is:

wherein the content of the first and second substances,

as an agent

The reward earned;

。

further, the performing iterative interactions of the multi-agent includes:

defining an action value function and a cost function;

converging to an optimal action value function through iterative interaction of multiple agents;

and determining an optimal strategy according to the optimal action value function.

Further, the action value function is:

；

the cost function is:

；

wherein the content of the first and second substances,

indicating slave status

Initiating and selecting actions from an action space

Enter the next state

The reward obtained by the agent;

the value range of gamma is more than or equal to 0 and less than or equal to 1 as the discount factor.

Further, the step of iterative interaction of the multi-agent comprises:

；

wherein the content of the first and second substances,

indicating the learning rate.

Further, the determining an optimal policy according to the optimal action value function includes:

；

wherein the content of the first and second substances,

is shown in a state

Temporal selection of actions from action space

Is an optimal strategy.

According to a second aspect of the embodiments of the present application, there is provided a resource management apparatus for a wireless sensor network, including:

the setting module is used for taking each sensor node in the wireless sensor network as an agent and setting network parameters for the wireless sensor network; the network parameters include at least: environmental status, action lists, and reward functions;

the iteration module is used for carrying out iterative interaction of the multiple intelligent agents based on the network parameters and determining an optimal strategy;

and the management module is used for carrying out resource allocation and task scheduling on the sensor nodes in the wireless sensor network according to the optimal strategy.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the scheme of the application applies the multi-agent dynamic interaction theory to the wireless sensor network, and solves the problems of resource allocation and task scheduling in the wireless sensor network, so that the wireless sensor network can actively carry out resource allocation and task scheduling and provide an online monitoring function under the conditions of no access and no intervention from the outside, for example: controlling the temperature of a nuclear reactor, or invasive brain or muscle signal monitoring.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart illustrating a resource management method of a wireless sensor network according to an example embodiment.

FIG. 2 is a schematic diagram of the interaction of an agent with an environment in multi-agent reinforcement learning.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of methods and apparatus consistent with certain aspects of the present application, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a resource management method of a wireless sensor network according to an example embodiment. The method may comprise the steps of:

step S1, taking each sensor node in the wireless sensor network as an agent;

step S2, setting network parameters for the wireless sensor network, wherein the network parameters at least comprise: environmental status, action lists, and reward functions;

step S3, carrying out iterative interaction of multiple agents based on the network parameters, and determining an optimal strategy;

and step S4, performing resource allocation and task scheduling on the sensor nodes in the wireless sensor network according to the optimal strategy.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

To further detail the technical solution of the present application, the multi-agent reinforcement learning problem is first introduced briefly.

Multi-agent reinforcement learning consists of several agents interacting with the environment and awarding prizes on an interactive basis. In order to model a wireless sensor network with reinforcement learning, the present solution refers to wireless sensor nodes as agents, and may consider the environment in which they are located, or consider other nodes as environments in which they tend to interact over a period of time.

In reinforcement learning, there is an environmental state; in some embodiments, there may be a series of measurements being made by the node, such as: their battery power, spectrum availability. The set of all environment states is defined as a state space, and the size of the state space grows exponentially as the number of parameters in the set increases.

Another index to be resolved is an action list. A node may receive or send a specified packet and may even perform a specified task.

Finally, it is necessary to define how the reward function is set. Two types of reward functions were studied: (1) internal rewards, i.e. the agent defines a reward function for itself based on some internal variables, such as energy usage; (2) external rewards, i.e., the agent receiving certain rewards from a central controller or other node, e.g., confirming that a packet has been successfully received.

The problem of multi-agent reinforcement learning is a broad research topic. The scheme mainly considers the solutions related to Q-Learning, which is one of the classical solutions of the scenes without available models in the environment.

To model an environment, Q-Learning treats the environment as a Markov decision process, where the state set, probability function, of the model environment is based on the current state, the actions of the agent, and the next state.

The scheme is applied to the application of the multi-agent Q-learning in the management problem of the communication resources of the wireless sensor. Three main frameworks are used to solve the multi-agent Q-Learning problem of resource allocation in wireless sensor networks: (1) the wireless node is an independent learner; (2) simulating a scenario of the joint learner using a framework of random gambling; (3) for the case of one leader and several followers, convergence to the optimal action value function is faster.

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

1、Q-learning

The present solution defines a list of main parameters. A random game is a tuple:

intelligent agent set

A state space is shown that is composed of a local state space and a shared state space.

Is a shared state space that is used by the client,

is an intelligent agent

Of a local state space of, wherein

。

Action space of agent

Transfer function

Wherein

Intelligent agent actual value reward function

In Q-Learning, intelligenceThe body finds the optimal strategy through iterative interactions with the environment. In each step, the agent first observes the environmental state, considers its full observability, and models it using a Markov Decision Process (MDP) based on its current policy function. It decides to take any action that can change the environmental state to maximize its desired jackpot: (

Max). )

Based on the reward value it receives to observe the next state, it updates its decision from the environment.

To mathematically discuss the Q-Learning function, it is first necessary to define a state action value function and a state value function:

a state action value function that, if an action U is taken starting from state S and from the set of available actions, (or Q-function, equation 1) results in an expected cumulative reward based on state-action; wherein the state value function (or V-function, formula 2) indicates that if starting from state S, the expected jackpot based on state can be obtained. It should be noted that formula one (Q function) is the expected reward obtained based on state action, and formula two (V function) is the expected reward obtained based on state no action only; both functions are expected values, while the expected arguments are different.

The discount factor 0 ≦ γ ≦ 1 indicates: how far and how long the agent considers when making a decision, the value range (0, 1)]。

The larger the agent, the more steps are considered forward, but the training difficulty is higher;

smaller agents focus on the pre-ocular benefits and less training difficulty.

If the best action value function is known, the best strategy can be calculated as follows:

in Q-Learning, the agent iteratively begins interacting with the environment. In each step, it iteratively updates its state action value function and state value function (equation 4) based on the state it started, the action it took, the reward it earned, and the state it earned. In the formula 4, the first and second groups of the compound,

indicating the learning rate. The goal of Q-Learning is to iteratively converge to an optimal state action value function and state value function, as shown in equation 4:

2. multi-agent scene oriented extended Q-Learning

As shown in fig. 2, there are multiple agents interacting with the same environment. The most obvious solution is to consider independent learners interacting passively with the environment and add agent index numbers i to the state action value function, the state value function, and the reward function (add index numbers i to agents).

This approach has several problems:

first, in this case, the agents may selfish attempt to maximize their expected cumulative rewards without regard to the actions of the other agents.

Second, a agent cannot unilaterally maximize its expected cumulative reward without regard to other agent behaviors.

Finally, the definition of the cost function is no longer valid. The expected jackpot cannot be updated by maximizing the operational cost function for the set of available operations for agent i.

To solve the first and second problems, actions of other agents may be added to the state action value function and the reward function (equation 6).

3. Method for searching optimal value function

Generally, there are two main methods for updating the cost function:

A. a random strategy framework is adopted, which is a generalized form of Markov strategies and is suitable for a plurality of intelligent agents to interact with the same environment at the same time; B. a wide variety of games are used to simulate continuous action taking a scene.

In the application of wireless sensor network resource management, the method for finding the optimal cost function can be divided into two main frameworks.

3.1 independent Agents

In the problem of wireless sensor network resource management, a multi-agent Q-Learning algorithm based on independent learners is provided. Although training sensor nodes as joint action learners is more accurate, in most cases, the performance of the agents in both frameworks is nearly the same.

This approach will reduce training costs, whether the entire network or just one new sensor node, and the need for communication between nodes.

There are two cases where their approach is not feasible: (1) when strict coordination tasks need to be performed among the agents for specific tasks; (2) when there is a delay between the action taken by the agent and the reward it has acquired, e.g. when the agent needs to wait for some confirmation by the recipient, the node cannot connect the delayed reward with its action.

3.2 random Game

The multi-agent Q-learning problem is the primary and classical approach to modeling using the framework of random gambling. Three most successful algorithms for updating the cost function are proposed: NashQ-Learning, friends and Foe Q-Learning, and Minimax Q-Learning.

The authors of these three methods all show that in some cases the action value function converges to an optimal value.

The main challenge of this framework is the dimensionality, which makes it difficult to train a large number of agents.

Based on Minimax Q-Learning, the value functions in the two zeros and agents can be updated as follows:

for a general scenario, each agent has a peer and a competitor. Based on this assumption, the cost function can be updated as follows:

a common solution, called NashQ-Learning, uses the following command update value function:

the scheme aims to investigate the Q-Learning algorithm of the multi-agent, analyze different game theory frameworks and solve the application of each framework. The target application of the scheme is resource management in the wireless sensor network, the Q-Learning algorithm is expanded to be used in a multi-agent scene, and a game theory framework for solving the problems of resource allocation and task scheduling in the wireless sensor network is solved.

An embodiment of the present application further provides a resource management device for a wireless sensor network, including:

With regard to the apparatus in the above embodiment, the specific steps in which the respective modules perform operations have been described in detail in the embodiment related to the method, and are not described in detail herein. The modules in the resource management device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A resource management method of a wireless sensor network is characterized by comprising the following steps:

taking each sensor node in the wireless sensor network as an agent;

setting network parameters for a wireless sensor network, wherein the network parameters at least comprise: environmental status, action lists, and reward functions;

2. The method of claim 1, wherein the environmental state comprises: battery power and/or spectrum availability; the action list includes: receiving or sending a specified packet, and/or performing a specified task; the reward function includes: internal rewards and/or external rewards.

3. The method of claim 2, wherein the internal reward is a reward function defined based on an internal variable, and the external reward is a reward function defined from feedback from a central controller or other node;

4. The method according to any one of claims 1-3, wherein said using each sensor node in the wireless sensor network as an agent comprises:

modeling a wireless sensor network