CN116137628A

CN116137628A - Relay node selection method, device, equipment and computer readable storage medium

Info

Publication number: CN116137628A
Application number: CN202111375086.2A
Authority: CN
Inventors: 杨科
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2023-05-19

Abstract

The embodiment of the invention relates to the technical field of the Internet of things and discloses a relay node selection method, which is applied to a source node and comprises the following steps: transmitting signals to each relay node and the destination node; the relay node is any relay node in the cooperative relay group; acquiring a first signal-to-noise ratio and a second signal-to-noise ratio of a target node; the first signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node directly transmits signals to the destination node, and the second signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node transmits signals to the destination node through a link of the relay node; and determining an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio. Through the mode, the method and the device for determining the optimal relay node achieve the effects that the determining process of the optimal relay node is simple and the accuracy is high.

Description

Relay node selection method, device, equipment and computer readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of the Internet of things, in particular to a relay node selection method, a device, equipment and a computer readable storage medium.

Background

At present, in a modern wireless internet of things communication network, a large number of relay devices with simple functions are distributed, some relay devices which cannot be recycled and can not be charged manually may be replaced manually, and the economic cost of replacing relays is too high. In order to save energy and avoid the waste of relay equipment, in a wireless collaborative internet of things communication system, one of the research focuses on the problem of relay selection, namely, selecting an optimal relay node to participate in forwarding work, and possibly saving a certain number of relay nodes while guaranteeing the performance of the communication system, thereby achieving the purposes of prolonging the service life of the relay equipment and reducing the cost and saving the energy. The inventor of the application finds that the process of determining the optimal relay node in the prior art is complex and has low accuracy in the process of implementing the embodiment of the invention.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a method, an apparatus, a device, and a computer readable storage medium for selecting a relay node, which are used to solve the problems in the prior art that a determination process of an optimal relay node is complex and an accuracy is low.

According to an aspect of the embodiment of the present invention, there is provided a relay node selection method applied to a source node, the method including:

Transmitting signals to each relay node and the destination node; the relay node is any relay node in the cooperative relay group;

acquiring a first signal-to-noise ratio and a second signal-to-noise ratio of a target node; the first signal-to-noise ratio is the signal-to-noise ratio of a destination node when the source node directly transmits a signal to the destination node, and the second signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node transmits the signal to the destination node through a link of a relay node;

and determining an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio.

In an alternative manner, after the signal is sent to each relay node and the destination node, the method includes: decoding the signals by each relay node; and recoding the signal by the successfully decoded relay node and then sending the recoded signal to the destination node.

In an optional manner, the relay node that passes the decoding success re-encodes the signal and then sends the signal to the destination node, which includes: and comparing the signal-to-noise ratio of each relay node after receiving the signals sent by the source node with an access threshold value, and determining whether the relay node is successfully decoded, thereby determining the relay node with successful decoding.

In an optional manner, before the determining, according to the first signal-to-noise ratio and the second signal-to-noise ratio, an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm, the method further includes:

the average throughput is determined according to the following formula:

wherein, gamma _s,d First signal-to-noise ratio, gamma, of link for source node s to send signal directly to destination node d _i,d A second signal to noise ratio for the link from the source node s to the destination node d via the ith relay node.

In an optional manner, the determining, according to the first signal-to-noise ratio and the second signal-to-noise ratio, an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm includes: taking the set of all the relay nodes as a state space set; the number of the relay nodes is formed to be an action space set; determining a reward function according to the first signal-to-noise ratio and the second signal-to-noise ratio; and carrying out iterative updating on the Q value matrix according to the state space set, the action space set, the rewarding function and the state transfer function until training is finished, and obtaining the optimal relay node.

In an optional manner, the step of iteratively updating the Q-value matrix according to the state space set, the action space set, the reward function and the state transfer function until training is finished, to obtain an optimal relay node includes: randomly selecting one state from the state space set as a current state; determining a current action according to the probability of selecting each action in the action space in the current state; executing the current action to obtain a reward value; updating a Q value function according to the reward value, the current state and the current action; updating the annealing temperature and the learning rate, updating the current state to the next state, and re-executing the current action according to the probability of selecting each action in the action space when the current state is adopted, and obtaining the optimal relay node until the training is finished.

In an optional manner, after determining the optimal relay node from the relay nodes by adopting a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio, the method includes: transmitting the signal to the optimal relay node; and transmitting the signal to the destination node through the optimal relay node.

According to another aspect of the embodiment of the present invention, there is provided a relay node selection apparatus, including:

the sending module is used for sending signals to each relay node and the destination node; the relay node is any relay node in the cooperative relay group;

the acquisition module is used for acquiring a first signal-to-noise ratio and a second signal-to-noise ratio of the target node; the first signal-to-noise ratio is the signal-to-noise ratio of a destination node when the source node directly transmits a signal to the destination node, and the second signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node transmits the signal to the destination node through a link of a relay node;

and the determining module is used for determining an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio.

According to another aspect of an embodiment of the present invention, there is provided a computing device including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform the operations of the relay node selection method described above.

According to yet another aspect of an embodiment of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction that, when executed on a computing device, causes the computing device to perform the operations of the relay node selection method described above.

The embodiment of the invention sends signals to each relay node and the destination node; the relay node is any relay node in the cooperative relay group, a first signal-to-noise ratio and a second signal-to-noise ratio of the target node are obtained, and an optimal relay node is determined from the relay nodes by adopting a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio, so that the optimal relay node can be determined rapidly and accurately.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.

Drawings

The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a schematic flow chart of a relay node selection method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a wireless cooperative networking system to which the relay node selection method according to the embodiment of the present invention is applied;

FIG. 3 shows a schematic diagram of throughput simulation employing QL-RSA and R-RSA optimal relay node selection;

fig. 4 is a three-dimensional diagram of a probability distribution of relay node selection provided by the embodiment of the invention;

fig. 5 is a schematic structural diagram of a relay node selection device according to an embodiment of the present invention;

FIG. 6 illustrates a schematic diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.

The internet of things is one of three application scenarios of a fifth generation mobile communication system, and has penetrated into aspects of daily life of people. The sensor node equipment in the Internet of things is simple, numerous, limited in power and the like, so that the sensor node equipment is not suitable for long-distance transmission, and a communication system of the Internet of things with full coverage cannot be constructed, so that the aim of interconnection of everything is fulfilled. With the commercial use of 5G, future communication systems require lower latency, faster transmission rates, and higher quality communications than existing communication systems. Relay collaboration technology in wireless communication networks is one of the hotspot technologies that address the above-mentioned needs. The relay cooperation technology is to deploy relay stations between original communication stations to assist the original stations to communicate, so that the channel capacity can be increased, the communication distance can be extended, the coverage area of communication can be increased, and the diversity gain can be improved, thereby meeting the requirement of high-quality communication of users.

In modern wireless internet of things communication networks, a large number of relay devices with simple functions are distributed, some of the relay devices may not be recycled, and the relay devices cannot be charged manually, so that the economic cost of replacing the relay manually is too high. In order to save energy and avoid the waste of relay equipment, in a wireless collaborative internet of things communication system, one of the research focuses on the problem of relay selection, namely how to formulate a proper selection standard, and select an optimal relay node to participate in forwarding work, so that the performance of the communication system is ensured, and a certain number of relay nodes can be saved, thereby the service life of the relay equipment is prolonged, and the purposes of reducing cost and saving energy are achieved. The schemes of selecting the relay nodes to participate in transmission in the existing relay selection technical schemes can be roughly divided into the following three types:

1. the relay selection scheme based on the maximum receiving signal-to-noise ratio of the destination terminal is a common selection technology in the relay selection technology, and the relay selection is performed by taking the maximum signal-to-noise ratio of the destination terminal as a selection criterion.

2. Relay selection scheme based on physical topology location of nodes. The basic idea of this scheme is to abstract the source-to-relay and relay-to-destination distances as "average hops" and then select the relay with the smallest "average hops" among the candidate relay nodes to assist the system in completing the communication.

3. The scheme is based on the instantaneous channel state information, and the principle of the scheme is that the best link combination is selected from a plurality of communication links to carry out cooperative transmission, so that the scheme is the most convenient and best-performance scheme in theory.

The inventor of the present application has found that all three relay selection technologies have certain drawbacks:

1. for the relay selection scheme based on the maximum receiving signal-to-noise ratio of the destination terminal, the scheme is based on channel state statistical information, so that after the corresponding relay selection is completed, no matter whether the channel changes, the relay selection is not carried out again, and the relay node selection is not adaptively changed along with the change of the channel.

2. For a relay selection scheme based on the physical topological location of the nodes, the computational complexity of the scheme can increase greatly with the number of candidate relay nodes. For the practical scene that the wireless internet of things has large-scale candidate relay nodes and meets the low-delay and high-quality communication requirements, the scheme clearly faces a great challenge.

3. For the scheme based on the instantaneous channel state information, although the optimal performance can be obtained most simply in theory, in most practical scenes, it is difficult for the relay end to estimate the instantaneous channel state information in real time, so that the scheme is not widely applied in practice.

In recent years, artificial intelligence has been studied and applied in the fields of the internet of things and the like, and therefore, the inventors of the present application introduce reinforcement learning in artificial intelligence into cooperative communication as a relay selection algorithm. Reinforcement learning is a strategy in which an agent finds optimal data by constantly interacting with an unknown environment, and aims to make the agent pay a minimum cost to achieve a certain goal, and unlike supervised learning, reinforcement learning mainly relies on an environmental feedback signal (typically, such a signal is a scalar signal) to constantly correct its own behavior strategy, and does not need to manually set a certain judgment standard to perform intervention. Compared with the traditional relay selection technology, the method has the advantages that the reinforcement learning algorithm is adopted by the method:

1. the final decision of reinforcement learning is only derived from the rewards of the environmental feedback, excessive external factor intervention is not needed, and when the channel state changes, the rewards of the environmental feedback also change, so that the reinforcement learning algorithm can adaptively select the relay nodes for cooperative communication.

2. The reward required for reinforcement learning is a scalar signal, so that the relay is not required to estimate instantaneous channel state information in real time, and the requirement on relay hardware equipment is greatly reduced.

3. Since the iterative rules of reinforcement learning are simple, and are maximizing the cumulative rewards to learn the optimal strategy. Different return values can be designed for different communication standards to obtain an optimal strategy. The method has the advantages that a large number of complex formula deductions are not needed, the calculation complexity is greatly reduced, the ideal performance effect can be achieved, and the method has the characteristics of simplicity and universality in algorithm design.

Fig. 1 shows a flowchart of a relay node selection method provided by an embodiment of the present invention, which is performed by a computing device. The computing device may be a server, a terminal, a source node in the internet of things, other computer devices, such as a personal computer, a tablet computer, etc., or other intelligent agent devices. As shown in fig. 1, the method comprises the steps of:

step 110: transmitting signals to each relay node and the destination node; the relay node is any relay node in the cooperative relay group.

As shown in fig. 2, a wireless collaborative internet of things communication network is shown, which consists of a source node (S), a destination node (D) and m relay nodes (R). Considering that the internet of things node configuration is simple, it is assumed that each node is equipped with a single antenna and that the signals on the S-R-D and S-D paths use orthogonal channels through Time Division Multiple Access (TDMA), each channel link meeting small-scale rayleigh fading. The source node sends signals to each relay node, and sends the signals to the destination node through the relay node which decodes successfully. In addition, the source node also transmits the signal to the destination node directly through a feedback channel, wherein the feedback channel is a channel directly connected with the destination node by the source node. Thus, after sending signals to each relay node and destination node, the method comprises: decoding the signals by each relay node; and recoding the signal by the successfully decoded relay node and then sending the recoded signal to the destination node. And comparing the signal to noise ratio of each relay node after receiving the signal sent by the source node with an access threshold value, and determining whether the relay node is successfully decoded or not in the embodiment of the invention, thereby determining the relay node with successful decoding.

Specifically, the transmission of a signal is divided into two time slots: in the first time slot, the source node sends signals to the destination node and all the relay nodes in the form of broadcasting, and then the destination node and the ith relay node r _i The received signals may be expressed as:

wherein y is _s,d 、y _s,i Representing signals received by the destination node and the ith relay node, P _s Is the transmitting power of the source node, h _s,d ,h _s,i Respectively represent S-D and S-r _i Parameters, eta of the channel link of (a) _s,d And eta _s,i Respectively the power is delta _s,d ² 、δ _s,i ² Is added to the white gaussian noise of the (c),x represents a signal broadcast by a source node and satisfies E { |x| ² }＝1。

In the second time slot, DF forwarding protocol is adopted, if the relay node can successfully decode the signal transmitted by the source node, the recoding is sent to the destination node, if the decoding fails, the relay node keeps silent. Thus, a threshold value gamma is introduced _th Signal-to-noise ratio gamma received from first stage of relay node _s,i Comparing to determine whether decoding is successful, the destination receives the relay r _i Is the signal y of (2) _s,d Can be expressed as:

wherein P is _i Is the transmission power of the ith relay node, h _i,d Representing the i-th relay node r _i The channel parameters of the D-channel link,

is the successfully decoded signal, gamma _s,i Is the signal-to-noise ratio (SNR) and gamma after the ith relay received signal _s,i ＝P _s |h _s,i | ² /δ _s,i ² 。η _i,d The power of the additive Gaussian white noise which is independent and distributed at the destination end is delta _i,d ² . In the embodiment of the invention, the threshold value gamma is entered _th The embodiment of the invention is not particularly limited, and is obtained by corresponding setting according to specific scenes by those skilled in the art.

Step 120: and acquiring a first signal-to-noise ratio and a second signal-to-noise ratio of the destination node.

In the embodiment of the invention, the first signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node directly transmits the signal to the destination node, and the second signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node transmits the signal to the destination node through a link of the relay node.

Wherein, under DF protocol, if relay node r _i Selected, synthesizing at destination node by maximum ratio combiningThe signal, and thus the average throughput at the destination node, can be expressed as:

wherein, gamma _s,d 、γ _i,d The first signal-to-noise ratio and r of the S-D link respectively _i -a second signal-to-noise ratio of the D link;

wherein, the liquid crystal display device comprises a liquid crystal display device,

step 130: and determining an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio.

In practical communication, it is very difficult for the relay node to rapidly and accurately estimate the instantaneous channel information of each channel link, so the application assumes that the relay node does not know the instantaneous channel information. The feedback channel feeds back the received signal-to-noise ratio of the destination node to the source node, and the source node uses the information fed back by the destination node as the return rewards of reinforcement learning and is used for guiding the source node to select the best relay node for cooperative transmission.

In the embodiment of the invention, the optimal node can be selected by a time difference reinforcement learning algorithm or a Monte Carlo reinforcement learning algorithm. In the monte carlo reinforcement learning algorithm, an agent (an executing end of the relay node selection method, such as a server or a source node in the internet of things) estimates a value of a current state through sampling training in a plurality of complete cycles, and when the number of sampling is enough, the value of the bid value can be accurately calculated, so that the monte carlo reinforcement learning method is an unbiased estimation method. However, the benefit of the monte carlo method for obtaining the state must go through one complete cycle of sampling, many random states and actions are experienced in the process, if the probability distribution of the action rewards is large, the variance of the state benefit may be infinite due to multiple sampling, which is intolerable to the communication system and even the optimal strategy cannot be found by the monte carlo reinforcement learning method when the environment scene without the complete state sequence is encountered. In another embodiment of the present invention, a time difference reinforcement learning algorithm is adopted for selection, wherein the benefit of the algorithm in the current state is estimated through the next state, so that the time difference method belongs to biased estimation. The time difference reinforcement learning method is different from the Monte Carlo method in that only the next random state and action are used, the randomness of the obtained state gain is smaller than that in Monte Carlo, the corresponding variance is smaller than that in Monte Carlo method, and the optimal strategy can be found in the face of the full-cycle-free Markov decision process.

In one embodiment of the invention, a time-differential reinforcement learning algorithm of a Q-learning (QL) anomaly strategy is employed to determine an optimal relay node from among the relay nodes. Wherein, the formula of the Q value function (namely action cost function) is as follows:

where α=1/(1+visit (s, a), α∈ (0, 1)]Is the learning rate and visit (s, a) is the total number of times the Q-learning algorithm accesses the state-action pair (s, a). With the visit times (training times) t-infinity, the state action pairs can be updated countless times and alpha will approach zero, at this time Q _t (s, a) will converge to an optimal Q ^* (s, a), and then obtaining the optimal strategy according to the following formula:

wherein the training time t represents the communication time, and the optimal relay node is determined along with the continuous increase of the communication time.

Therefore, according to the first signal-to-noise ratio and the second signal-to-noise ratio, determining an optimal relay node from the relay nodes by adopting a reinforcement learning algorithm comprises:

taking the set of all the relay nodes as a state space set;

the number of the relay nodes is formed to be an action space set;

determining a reward function according to the first signal-to-noise ratio and the second signal-to-noise ratio;

and carrying out iterative updating on the Q value matrix according to the state space set, the action space set, the rewarding function and the state transfer function until training is finished, and obtaining the optimal relay node.

Specifically, the source node selects relay nodes, and the Softmax selection policy is transmitted in broadcast form to each selected relay node in the first time slot (the selected relay node has been activated, the other relay nodes remain silent). The prize value is obtained through a feedback channel between the source node and the destination node for updating the Q value matrix and guiding future policy selections. Each relay node is considered to be the state of QL in the system. Thus, states and actions in the collaborative communication network are defined as:

1. a set of state spaces (S). In the QL algorithm proposed in the present application, each relay node represents one state, so the state space set s= { S ₁ ,s ₂ ....,s _m A set of relay nodes.

2. A set of motion spaces (A). A is composed of the number of relay nodes, wherein:

A＝{a ₁ ,a ₂ ....,a _m }；

action a _i When executed, indicates that the system will enter the next state s _i 。

3. State transition function sxa→s when the system is in state S _j Action a _k Is executed, the state transfer function is defined as:

f(s _j ,a _k )＝s _k 。

4. a bonus function r. Its design is based on an index of system performance. The purpose of the system is to obtain maximum throughput of the destination node after performing the operation. The higher the SNR obtained by the destination node, the greater its throughput. Thus, under the DF forwarding protocol, the reward functions may be defined as:

5. Q value matrix Q(s) _t ,a _t ). The Q-value function Q(s) _t ,a _t ) Whenever an agent selects an action, then gets rewarded, then updates the Q table according to equation (5). Just after training begins, each entry of the Q table is set to zero. In the QL algorithm, the agent must select an action a _t E A, then transition to the corresponding state s according to the state transfer function _t+1 E S to update the Q table. We define a Q table in the form of a matrix, called a Q value matrix, which can be written as:

wherein R is ^m×m Is a matrix of order m x m. The system updates the Q value by a continuous trial and error method and then stores the Q value in a Q value matrix.

In the embodiment of the present invention, according to the state space set, the action space set, the reward function and the state transfer function, the Q value matrix is iteratively updated until training is completed, so as to obtain an optimal relay node, which specifically includes:

randomly selecting one state from the state space set as a current state;

determining a current action according to the probability of each action in the action space selected in the current state;

executing the current action to obtain a reward value;

updating a Q value function according to the reward value, the current state and the current action;

Updating the annealing temperature and the learning rate, updating the current state to the next state, and re-executing the current action according to the probability of selecting each action in the action space when the current state is adopted, and obtaining the optimal relay node until the training is finished.

In order to balance between "exploration" and "utilization," among other things, the present application employs a Softmax selection algorithm to select actions. The distribution of the Softmax selection algorithm action probabilities satisfies the boltzmann distribution:

wherein: p (a) _i S) means that the agent selects action a in state s _i Probability of T>0 is the annealing temperature, and the temperature of the annealing material is equal to 0,

refers to selecting action a at state s _i Is a function of the value of (2). Here, for example, in a communication process of a certain time slot, the source node selects the 3 rd relay node to transmit information, the state S at this time is 3, and the next transmission selects which node (a _i ) It depends on equation 8. As can be seen from the formula (8), a larger T value can make all the action probability distribution of the relay node equal probability distribution, so that the randomness is strong, and the source node can fully 'explore' the relay node with the current Q value not higher, thereby being expected to obtain larger return. Conversely, when the value of T is smaller, the probability that an action with a higher Q will be selected will increase, so that the source node will "utilize" its learned knowledge to select an action that it deems to be the most rewarding. In order to ensure that the Q-learning algorithm can fully explore in the early stage and maximize utilization in the later stage in the training process, a very large initial T value is set, and the initial T value gradually decreases to a final T value along with the increase of training times _final Smooth transition from "exploration" to "utilization" is achieved.

Specifically, as known from the definitions of action sets, state sets and immediate return values in reinforcement learning, the relay selection algorithm based on Q-learning directs the source node to select the relay node in a direction of maximizing the received signal-to-noise ratio of the destination node through the design of the return values, so that the system obtains the maximum throughput. The single relay selection algorithm based on Q-learning is as follows:

/>

in order to prove the effectiveness of the method, the simulation result based on the Q-learning relay node algorithm (QL based relay selection algorithm, QL-RSA) is subjected to performance analysis, and the data of the simulation graph are obtained through 5000 independent experimental calculations. As a comparison reference, the present application also simulates the performance under a random relay selection algorithm ((Random relay selection algorithm, R-RSA). The random relay selection algorithm refers to that a source node selects any relay node for transmission with equal uniform probability, simulation parameters are set as follows, m relay nodes are uniformly distributed on a plane with a radius r=0.5 and an x-y with an origin as a center, source and destination nodes are respectively located at (-0.5, 0) and (0.5, 0) channels between the two nodes

Wherein d is _i,j Is the distance between two nodes. The path loss of the channel is set to v=2.5. In order to ensure that all relay nodes can be fully "explored" in the early stage of Q-learning training, only the optimal relay node can be "utilized" in the later stage of training, we set the initial temperature of the annealing process to t=10 ⁵⁰ And decreasing to the final temperature T with a negative exponential law of 0.9 _final ，T _final Set to 0.1. As shown in fig. 3, the performance differences of QL-RSA and R-RSA are compared in the case where the number of relay nodes m=10 at different access thresholds. As can be seen in fig. 3Consistent with expectations, the smaller the entry threshold (hereinafter referred to as the threshold), the greater the throughput that the system achieves and the better the performance of QL-RSA than R-RSA. When threshold gamma _th When 3dB, QL-RSA performs substantially the same as R-RSA. This is because in DF mode, when the threshold requirement is too high, the relay node basically cannot successfully decode and forward, and the system basically selects the source-destination path for information transmission. The source node therefore obtains substantially the same return value based on whichever action is selected from set a by the QL-RSA, thus resulting in substantially the same performance as the R-RSA. When the threshold is smaller, the probability of success of decoding and forwarding of the relay nodes is increased, the relay nodes closer to the target node can be selected to obtain higher system throughput, and the QL-RSA can intelligently identify the relay nodes through continuous interactive learning with the environment, so that optimal action selection is made from the set A, and the throughput obviously superior to that of the R-RSA is obtained.

Fig. 4 shows a three-dimensional graph of probability distribution of simulation node selection, the coordinates of relay nodes are respectively: [ (-0.089, -0.481), (0.436,0.115), (-0.363,0.323), (0.021, -0.040), (-0.317,0.011), (0.294, -0.160), (-0.238, -0.148), (-0.401, -0.257), (-0.022, -0.485), (-0.106,0.387)]. At the beginning of training, the annealing temperature T of the Softmax selection strategy tends to infinity, so that probability of each node being selected is distributed with equal probability, and the Q-learning is in the 'exploration' stage. Along with continuous interaction with the environment, the annealing temperature T-T _final As can be seen from the equation (8), the probability distribution of the probability of selecting each node depends on the Q value corresponding to each node, and Q-learning is in the "utilization" phase. As can be seen from the graph, the second node of the relay has a higher Q value, so as the training frequency increases continuously, the probability of selecting the second node of the relay tends to be 1, and the probability of selecting other nodes tends to be 0. As can be seen from fig. 3, only relay node 2 is the best node at this time.

According to the first signal-to-noise ratio and the second signal-to-noise ratio, after an optimal relay node is determined from the relay nodes by adopting a reinforcement learning algorithm, the embodiment of the invention further transmits the signal to the optimal relay node, and the signal is transmitted to the destination node through the optimal relay node.

Fig. 5 shows a schematic structural diagram of a relay node selection device according to an embodiment of the present invention. As shown in fig. 5, the apparatus 300 includes: a sending module 310, an obtaining module 320 and a determining module 330.

A transmitting module 310, configured to transmit signals to each relay node and destination node; the relay node is any relay node in the cooperative relay group;

an obtaining module 320, configured to obtain a first signal-to-noise ratio and a second signal-to-noise ratio of the destination node; the first signal-to-noise ratio is the signal-to-noise ratio of a destination node when the source node directly transmits a signal to the destination node, and the second signal-to-noise ratio is the signal-to-noise ratio of the destination node when the source node transmits the signal to the destination node through a link of a relay node;

And a determining module 330, configured to determine an optimal relay node from the relay nodes by using a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio.

the average throughput is determined according to the following formula:

/>

The specific working process of the relay node selection device in the embodiment of the present invention is substantially identical to that of the above embodiment, and will not be described herein.

FIG. 6 illustrates a schematic diagram of a computing device according to an embodiment of the present invention, and the embodiment of the present invention is not limited to a specific implementation of the computing device. The computing device may be a source node in the internet of things, may be a server, or may be other computing devices.

As shown in fig. 6, the computing device may include: a processor 402, a communication interface (Communications Interface) 404, a memory 406, and a communication bus 408.

Wherein: processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408. A communication interface 404 for communicating with network elements of other devices, such as clients or other servers. The processor 402 is configured to execute the program 410, and may specifically perform the relevant steps in the embodiment of the relay node selection method described above.

In particular, program 410 may include program code including computer-executable instructions.

The processor 402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

Memory 406 for storing programs 410. Memory 406 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

Program 410 may be specifically invoked by processor 402 to cause a computing device to:

the average throughput is determined according to the following formula:

The specific working process of the computing device according to the embodiment of the present invention is substantially the same as that of the foregoing embodiment, and will not be described herein.

Embodiments of the present invention provide a computer readable storage medium storing at least one executable instruction that, when executed on a computing device, causes the computing device to perform a relay node selection method according to any of the method embodiments described above.

The executable instructions may be particularly useful for causing a computing device to:

the average throughput is determined according to the following formula:

The specific working process of the instructions in the computer storage medium according to the embodiments of the present invention when executed on a computing device is substantially the same as the above embodiments, and will not be described herein.

The embodiment of the invention provides a relay node selection device which is used for executing the relay node selection method.

Embodiments of the present invention provide a computer program that is callable by a processor to cause a computing device to perform the relay node selection method of any of the method embodiments described above.

An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when run on a computer, cause the computer to perform the relay node selection method in any of the method embodiments described above.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component, and they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A relay node selection method, applied to a source node, comprising:

2. The method of claim 1, wherein after the signaling to each relay node and destination node, the method comprises:

decoding the signals by each relay node;

and recoding the signal by the successfully decoded relay node and then sending the recoded signal to the destination node.

3. The method according to claim 2, wherein the re-encoding the signal by the successfully decoded relay node and then transmitting the re-encoded signal to the destination node comprises:

and comparing the signal-to-noise ratio of each relay node after receiving the signals sent by the source node with an access threshold value, and determining whether the relay node is successfully decoded, thereby determining the relay node with successful decoding.

4. The method of claim 1, wherein prior to determining an optimal relay node from each of the relay nodes using a reinforcement learning algorithm based on the first signal-to-noise ratio and the second signal-to-noise ratio, the method further comprises:

the average throughput is determined according to the following formula:

5. The method according to any one of claims 1-4, wherein the determining an optimal relay node from the relay nodes using a reinforcement learning algorithm according to the first signal-to-noise ratio and the second signal-to-noise ratio comprises:

taking the set of all the relay nodes as a state space set;

the number of the relay nodes is formed to be an action space set;

6. The method of claim 5, wherein iteratively updating the Q-value matrix according to the state space set, the action space set, the reward function, and the state transfer function until training is completed, to obtain an optimal relay node, comprises:

randomly selecting one state from the state space set as a current state;

determining a current action according to the probability of selecting each action in the action space in the current state;

executing the current action to obtain a reward value;

7. The method of claim 1, wherein after determining an optimal relay node from each of the relay nodes using a reinforcement learning algorithm based on the first signal-to-noise ratio and the second signal-to-noise ratio, the method comprises:

transmitting the signal to the optimal relay node;

And transmitting the signal to the destination node through the optimal relay node.

8. A relay node selection apparatus, the apparatus comprising:

9. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the relay node selection method according to any one of claims 1-7.

10. A computer readable storage medium, wherein at least one executable instruction is stored in the storage medium, which when executed on a computing device, causes the computing device to perform the operations of the relay node selection method according to any of claims 1-7.