Disclosure of Invention
The invention overcomes the defects and provides a method, a device and a system for predicting a wireless communication receiving window, which avoid redundant receiving by predicting the receiving window so as to reduce receiving power consumption.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for predicting a wireless communication receive window, comprising: which has a training mode and a utilization mode,
while in the training mode, comprising the steps of,
obtaining global information including current state information and information whether a reception action taken in a current time slot is correct,
performing deep reinforcement learning training based on the global information to obtain an optimal receiving strategy,
updating the current receiving strategy with the optimal receiving strategy;
in the utilization mode, comprising the steps of,
the current state information is obtained and the current state information,
generating a receiving control signal according to the current state information and based on a current receiving strategy obtained by deep reinforcement learning training to decide whether to execute a data receiving action in the current time slot,
the current state information indicates that the current time slot is the receiving time slot after the data packet is received last time.
Further, a deep reinforcement learning technology is adopted to obtain a current receiving strategy or an optimal receiving strategy, wherein according to a state space formed by state information, an action space formed by actions of receiving data executed by the wireless communication equipment, a state transition rule function and an incentive function, a preset decision process is adopted for modeling, a reinforcement learning task is executed, an action strategy is generated, and the action strategy is iteratively updated according to multi-step discount accumulated incentives until the optimal receiving strategy or the current receiving strategy is obtained through convergence; wherein the receiving strategy is a probability distribution of the receiving action in a state space.
Further, in the utilization mode, a specific method for selecting whether to perform reception is as follows:
generating random numbers with uniformly distributed [0,1) intervals and comparing the random numbers with the receiving probability of executing receiving action in the current state, and selecting receiving when the random numbers are less than or equal to the receiving probability, otherwise not selecting receiving.
Further, when the result of the deep reinforcement learning training reaches convergence in the training mode, an optimal receiving strategy is obtained, the training mode is ended, and the mode is switched to a utilization mode; and when the data is in the utilization mode, if the data reception does not meet the reliability requirement, switching to a training mode.
Further, the data reception failing to meet the reliability requirement includes that packet leakage occurs or the packet loss rate is greater than a set threshold value in the data reception.
Further, when the probability distribution of the receiving action in the state space obtains a peak value in a state p, setting the peak value state p and k states which are continuous before and after the peak value state p as a forced receiving state, and executing the data receiving action, wherein k is a positive integer.
Further, in the utilization mode, after receiving a data packet from the transmitting end, setting m subsequent consecutive states as a forced receiving state, and continuously performing a data receiving action, where m is a positive integer.
Further, in the utilization mode, if the received data packet fails CRC or is incomplete, the data receiving operation is continuously performed until the received data packet passes CRC check and there is no header error.
Further, the predetermined decision process is a markov decision process;
the reinforcement learning task is modeled by a Markov decision process and is represented as a four-tuple E (S, A, P and R), wherein S is a state space, A is an action space, P is a state transfer function, R is a reward function, and the output result of the reinforcement learning task is a receiving strategy pi;
the action space a is {0,1}, where action a is 0 to indicate that the current timeslot performs radio frequency reception shutdown, and a is 1 to indicate that the current timeslot turns radio frequency reception on;
the state space is expressed as a set S ═ {1,2, L, i, L, N }, wherein the state i represents that the current time slot is the ith receiving time slot after the latest data packet receiving is completed, and N represents the maximum receiving time slot number between two adjacent received data packets; wherein
njIndicates the idle time slot of the jth transmitting device, L indicates the n of L kinds of the transmission devices in the setj;
The state transition function P is specifically: if the execution of action a is 0 in state i, that is, if no reception is performed, the state transitions to state i + 1; if the action a is 1, i.e. reception is performed, the state transition depends on the reception result, if no packet is received, the state transition is to the state i +1, if a packet is received, the state returns to the state 1 after reception is completed;
the reward function R specifically is that in the iterative process of the deep reinforcement learning task, according to the global information, the action a is selected to be scored under the state s to serve as feedback of a deep reinforcement learning model, and subsequent actions are adjusted according to the current reward;
the probability that a is the correct action to be selected in state s is pi (s, a), with the probability normalization condition:
the iterative mode of the deep reinforcement learning task adopts a Q-learning algorithm, the algorithm is converged to an optimal receiving strategy by maximizing the Q value,
estimating the Q value by using a deep neural network, wherein the definition of the convergence of the algorithm is as follows:
|πt(s,a)-πt-1(s,a)|max≤10-6.。
a wireless communication receiving window prediction device is used for controlling data transceiving of a wireless signal transceiving unit and comprises a link control module and a protection module, wherein the device is provided with a training mode and a utilization mode,
when the device is operating in the training mode,
a prediction module for obtaining global information from the link control module, performing deep reinforcement learning training to obtain an optimal receiving strategy, and updating the current receiving strategy in the protection module with the optimal receiving strategy,
wherein the global information includes current state information and information whether a receiving action taken by the wireless communication device in a current time slot is correct;
when the device is operating in the utilization mode,
the protection module is used for receiving the current state information from the link control module and generating a receiving control signal for deciding whether to execute a data receiving action at the current time slot or not based on the current receiving strategy obtained by deep reinforcement learning training,
the link control module is used for controlling the wireless signal transceiving unit to receive data based on the receiving control signal from the protection module,
and the current state information indicates that the current time slot is the receiving time slot after the data packet is received last time.
Further, when the device is in a training mode and the result of the deep reinforcement learning training reaches convergence, an optimal receiving strategy is obtained, the training mode is ended, and the device is switched to a utilization mode; when the device is in the utilization mode, if the receiving of the data does not meet the reliability requirement, the device is switched to the training mode; when the device processes the training mode, the link control module controls the wireless signal receiving and sending unit to keep a data receiving state at any receiving time slot.
Further, the current receiving strategy in the protection module is obtained by offline training, wherein global information collected in advance is utilized on an offline platform, is obtained by deep reinforcement learning training, and is loaded into the protection module and/or the prediction module; or, the current receiving strategy in the protection module is obtained by on-line training, wherein when the prediction module works in a training mode for the first time, the prediction module obtains the current receiving strategy through deep reinforcement learning training from zero by using global information provided by the link control module, and loads the current receiving strategy into the protection module.
Further, the data reception failing to meet the reliability requirement includes that a packet leakage occurs or a packet loss rate is greater than a set threshold value in the data reception.
Further, the processing unit comprises a wireless communication reception window prediction apparatus according to one of claims 11 to 15 and executes a wireless communication reception window prediction method according to one of claims 1 to 10.
When the invention works in the utilization mode, the receiving control signal is generated by acquiring the current state information and based on the current receiving strategy acquired by the deep reinforcement learning training so as to determine whether to execute the data receiving action in the current time slot, and the receiving strategy is utilized to predict the receiving window, thereby controlling the data receiving, avoiding the redundant receiving and reducing the receiving power consumption.
Furthermore, the invention also has a training mode, and carries out deep reinforcement learning training by using the acquired global information to obtain and update the current receiving strategy, so that the receiving strategy is continuously optimized. After data reception is carried out for a period of time, along with the change of the actual channel condition, if the current receiving strategy deviates so that the data reception cannot meet the reliability evaluation requirement, the current receiving strategy is switched to the training mode again to carry out the updating of the receiving strategy. Therefore, redundant reception can be effectively avoided on the basis of ensuring reliable reception.
Detailed Description
The core idea of the invention is that a deep reinforcement learning technology is adopted to predict a receiving window of the wireless communication equipment so as to determine whether to execute data receiving action under the current time slot, thereby avoiding redundant receiving as much as possible and effectively reducing the power consumption of the wireless communication equipment.
The technical solution of the present invention will be further described in detail with reference to specific embodiments. It should be noted that, in order to make the technical solutions and advantages in the embodiments of the present application more clearly understood, the following description of the exemplary embodiments of the present application with reference to the accompanying drawings is made in further detail, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all the embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example 1
Wireless communication reception window prediction apparatus fig. 1 is a schematic block diagram of a wireless communication reception window prediction apparatus according to embodiment 1 of the present invention, which includes a link control module, a protection module, and a prediction module, and has a utilization mode and a training mode.
When the device works in a utilization mode, the protection module is used for receiving current state information from the link control module and generating a receiving control signal for determining whether to receive and transmit data in the current time slot based on a current receiving strategy obtained through deep reinforcement learning training; and the link control module is used for controlling the wireless signal receiving and transmitting unit to receive data based on the receiving control signal from the protection module.
The current state information indicates that the current time slot is the receiving time slot after the data packet is received last time.
The current receiving strategy can be pre-stored in the protection module or obtained through the training of a prediction module.
When the device works in a training mode, the prediction module is used for obtaining global information from the link control module, performing deep reinforcement learning training to obtain an optimal receiving strategy, and updating the current receiving strategy in the protection module according to the optimal receiving strategy; wherein the global information includes the current state information and information whether a reception action taken by the wireless communication device in the current time slot is correct.
The deep reinforcement learning task can be realized by adopting various existing or future deep reinforcement learning technologies. As a preferred implementation, the obtaining of the current reception policy or the optimal reception policy by using the deep reinforcement learning technique may be modeling by using a predetermined decision process according to a state space S formed by state information S, S belonging to S, an action space a formed by an action a of receiving data executed by the wireless communication device, a belonging to a, a state transition rule function P, and a reward function R, executing a reinforcement learning task, generating an action policy, and iteratively updating the action policy according to a multi-step discount accumulated reward until convergence to obtain the optimal reception policy or the current reception policy;
wherein the receiving strategy is a probability distribution of the receiving action in a state space.
The protection mode and the training mode can be switched to each other. When the device is in a training mode and the result of the deep reinforcement learning training reaches convergence, obtaining an optimal receiving strategy, ending the training mode, and switching the device to a utilization mode; when the device is in the utilization mode, if the reception of the data does not meet the reliability requirement, the device switches to the training mode. The link control module can perform intelligent receiving through the receiving control signal output by the protection module, and the radio frequency receiving is closed in the redundant time slot to reduce the power consumption.
Example 2
An embodiment 2 of the present invention provides a wireless communication device, as shown in fig. 2, which is a schematic block diagram of the wireless communication device in this embodiment, and includes a processing unit and a wireless signal transceiver unit, where the processing unit may be formed by a chip, and the wireless signal transceiver unit is formed by a radio frequency antenna, a power amplifier, and the like. The processing unit comprises a wireless communication receiving window prediction device which is composed of a link control module, a protection module and a prediction module and is used for controlling data receiving and transmitting of the wireless signal receiving and transmitting unit.
Example 3
Embodiment 3 of the present invention provides a method for predicting a wireless communication receiving window, for example, fig. 3 is a flowchart of the method for predicting a wireless communication receiving window provided in this embodiment, which may include a training mode and a utilization mode:
acquiring global information in the training mode, wherein the global information comprises the current state information and information whether a receiving action taken in the current time slot is correct or not; performing deep reinforcement learning training based on the global information, and obtaining an optimal receiving strategy when the result of the deep reinforcement learning training is converged; and updating the optimal receiving strategy as the current receiving strategy, ending the training mode, and switching to a utilization mode.
And under the utilization mode, acquiring current state information, and generating a receiving control signal according to the current state information and a current receiving strategy obtained by deep reinforcement learning training so as to determine whether to execute a data receiving action in the current time slot.
And if the receiving of the data does not meet the preset reliability requirement, switching to a training mode. The unsatisfied reliability requirements include, but are not limited to, the following: and the packet leakage or the packet loss rate is greater than a set threshold value when the data is received.
The method and apparatus for predicting the wireless communication receiving window in the above embodiments and the above wireless communication device will be further described in detail below.
In the embodiment of the present invention, the wireless communication receiving window predicting apparatus and the wireless communication device may provide two working modes: the training mode and the utilization mode can be switched mutually. In connection with fig. 2, the utilization pattern is shown in solid lines and the training pattern in dashed lines.
The device may be in a utilization mode when operating normally. The protection module receives the current state information from the link control module and generates a receiving control signal for determining whether to receive and transmit data in the current time slot based on the current receiving strategy obtained through deep reinforcement learning training; and the link control module is used for controlling whether the wireless signal receiving and sending unit executes the receiving action under the current time slot or not based on the receiving control signal from the protection module.
After the data is in the utilization mode for a period of time, along with the change of the actual channel condition, if the receiving of the data does not meet the preset reliability requirement, switching to the training mode. And the link control module starts the training module, enters a training mode and trains the deep reinforcement learning of the prediction module. After training is finished, the equipment is switched to a utilization mode by the link control module. Because the receiving strategy adopted by the wireless communication equipment is obtained through deep reinforcement learning training, the wireless data receiving can be intelligently controlled, some redundant receiving can be avoided, and the equipment is in a low power consumption mode. In addition, the equipment can be repeatedly switched between the training mode and the utilization mode for many times in the working process, and the receiving strategy is adjusted in time to adapt to various different communication environments.
In addition, in a preferred embodiment, the link control module may not receive the reception control signal from the protection module during the training process, i.e. not perform the output result of the prediction module, but control the wireless signal transceiver unit to maintain the data receiving state in any receiving time slot, so that the normal wireless signal receiving activity of the device is not affected, thereby effectively coping with the poor channel condition.
The prediction module executes a deep reinforcement learning task, and the training process is a process which continuously iterates and converges according to the global information and the output feedback.
As a preferred embodiment, the deep reinforcement learning training may be modeled by a markov decision process, and is represented as a quadruple E ═ S, a, P, R >, where S is a state space, a is an action space, P is a state transition rule function, and R is a reward function; and the output result of the reinforcement learning task is a receiving strategy pi.
The action space is denoted as a ═ 0,1, where action a ═ 0 denotes that the current timeslot performs radio reception off, and a ═ 1 denotes that the current timeslot turns radio reception on.
The state space is represented as a set S ═ {1,2, L, i, L, N }, where state i represents that the current slot is the i-th receive slot after the last completion of receiving a packet, and N represents the maximum number of receive slots between received adjacent packets. If the two parties of the transceiver communicate in a time division duplex manner, for example, in the bluetooth specification, the receiving time slots of the slave device are even time slots, so that the state space does not include odd time slots, and the i state of the state space may correspond to the 2i-1 time slot of the bluetooth time slot, i.e., the count is started from the 1 st master device sending time slot after the data packet reception is completed each time.
As shown in fig. 4, in the state i, if the execution action a is 0, that is, if the reception is not performed, the state transition rule function P transitions to the state i + 1; if the action a is 1, i.e. the radio frequency reception is started, the state transition depends on the reception result, if no data packet is received, the state transition will be to the state i +1, and if a data packet is received, the state will be returned to the state 1 directly after the reception is completed and the state i +1 is not entered. It should be noted that the received data packet may occupy multiple slots, for example, the 2-DH5 packet and the 3-DH5 packet in bluetooth are 5-slot packets, and the state transition is performed only after the reception is completed, and the state remains unchanged during the reception.
As an embodiment, for special cases such as that a received data packet needs to be discarded and waits for retransmission by a transmitting end without passing CRC check, processing may not be performed in the training mode, but may be performed in the utilization mode to ensure reliability of communication. Thus, in training mode, once the device receives a packet from the sender, it defaults to correct receipt of the packet and returns from state s directly to state 1 (setting the variable marking the state to 1) after completion of the receipt.
For different transmitting devices or different QoS conditions, the time slot interval (i.e., idle time slot) from the time when the current data packet is successfully transmitted to the time when the next data packet is transmitted is different, and is marked as N ═ N1,n2,L,nLAnd H, the selection requirement of N in the state space is satisfied.
Wherein n isjIndicating the idle time slot of the jth transmitting device, for example, 38 idle time slots for brand A mobile phone, which is marked as n138, the B-brand mobile phone has 36 idle timeslots, which is denoted as n236, etc. L denotes a total of L such n in the setj. For various value cases in N, the actual corresponding global information is different. Different results can be converged in the process of reinforcement learning, and different receiving strategies can be obtained.
The reward function R is to select the action a to be scored as feedback in the state s according to the global information (whether the current timeslot and the received action taken are correct) in the iterative process, and the subsequent actions are adjusted according to the current reward. The selection of reception when there is a packet to send and the selection of non-reception when there is no packet to send are defined as correct reception actions, and vice versa as incorrect reception actions, if correct reception actions are selected in the current state, the score should be positive, and vice versa, i.e. penalized. The specific prize value settings are shown in table 1.
TABLE 1 reward value settings
If the sending end does not send data in the current state, selecting action a to be 0 is correct, so that the power consumption can be reduced, and the score is 2, and a to be 1 is false action, and belongs to redundancy receiving, and the score is-1; if the transmitting end has the data packet to be transmitted under the current state, selecting a to be 1 is correct action and gets 3 points, and selecting a to be 0 will miss the packet and belongs to serious error and gets 5 points. The selection of action in the current state depends on the size of the discounted jackpot for the previous time slot.
The receiving strategy is the output result of the reinforcement learning task and can be expressed as the probability distribution of the action in the state space, if the probability that a is the correct action to be selected in the state s is recorded as pi (s, a), the probability normalization condition is provided
The iterative mode selects a Q-learning algorithm in reinforcement learning, and the algorithm is converged to an optimal receiving strategy by maximizing the Q value
In order to improve the convergence speed and accuracy of the algorithm, a deep neural network is used for estimating the Q value. The convergence of the algorithm is defined as
|πt(s,a)-πt-1(s,a)|max≤10-6. (4)
At this point, the obtained output result of the prediction module is the receiving strategy pi. In the utilization mode, the protection module generates a receiving control signal by adopting a receiving strategy pi output by the prediction module according to the current state information from the link control module, so that the link control module controls the signal receiving and sending states of the wireless signal receiving and sending unit under the action of the receiving control signal.
In the utilization mode, the specific method for the protection module to select whether to receive is as follows: a random number q with a uniformly distributed [0,1) interval is generated and compared with the reception probability, e.g. the probability that s performs a reception action in the current state is r, i.e. pi (s, a ═ 1) ═ r, then reception is selected when q ≦ r and not when q > r.
In order to deal with various special situations and ensure the reliability of reception, the utilization mode can also adopt one or more of the following reception strategies:
1) when the protection module obtains a peak value in a state p according to the probability distribution obtained by the current state information according to the current receiving strategy, the protection module sends a receiving control signal to enable the link control module to execute a receiving action in the peak value state, in order to improve the robustness of the system, the peak value state and k continuous states before and after the peak value state can be set as a forced receiving state, and the wireless signal receiving and sending unit is controlled to execute the receiving action, wherein k is a positive integer. For example, when k takes 3, pi (s, a ═ 1) is set to 1, s ∈ (p-3, p +3), that is, the peak state and the nearby 3 states are set to the forced reception state, that is, the state range is (p-3, p +3), so as to improve the robustness of the system.
2) And setting m continuous states after the device receives the data packet from the transmitting end as a forced receiving state, for example, when m takes 3, that is, pi (s, a is 1) is 1, and s is 1,2 and 3. Here, it is considered that the sender may retransmit the data packets after the current data packet is received, for example, the sender fails to receive the returned ACK, and retransmits the data packets, and the receiver still receives the data packets. Thus, in one embodiment, m may be valued with reference to the number of retransmissions.
3) And if the received data packet does not pass the CRC or is incomplete, the current receiving strategy obtained based on the deep reinforcement learning training is not used, the wireless signal receiving and sending unit is controlled to receive all the time, the retransmission of the sending end is waited, and the control signal is generated by continuously utilizing the current receiving strategy obtained based on the deep reinforcement learning training when the received data packet passes the CRC and has no packet header error after the successful reception.
4) Normally, the state transition process will return to state 1 before reaching state N, i.e. state N will not occur, if state N occurs or exceeds T of bluetooth specificationpoll-timeoutIf the packet missing occurs due to the fact that the large deviation exists, the system is considered that the data receiving does not meet the reliability requirement at the moment, the strategy is updated when the system is switched to the training mode, after the training is finished, the algorithm reaches convergence as the formula (4), the optimal action strategy is obtained again, and then the system is switched back to the utilization mode.
5) In the utilization mode, after data is received for a period of time, along with the change of the actual channel condition, if the packet loss rate counted in the system is too high, the data reception is considered to not meet the reliability requirement, and the system is switched to the training mode again to update the reception strategy.
The training mode process in the embodiment of the invention can be carried out not only on line, but also under line. The current reception strategy in the protection module may also be obtained by offline training or online training. The off-line training is to acquire global information in advance in a data packet capturing mode, train the model on an off-line platform such as Windows/Linux by using the acquired global information data, and load the training result, namely a receiving strategy, into the protection module as an initial receiving strategy. When the online process is carried out, when the prediction module works in a training mode for the first time, the global information provided by the link control module is utilized, and the current receiving strategy is obtained through the deep reinforcement learning training of the prediction module from zero. If offline training is adopted, the prediction module only needs to perform strategy updating when running for the first time without performing model training from the beginning, and as the initial receiving strategy of the strategy updating is previously subjected to offline training, namely is directly used as the current receiving strategy and is closer to the receiving strategy in a convergence state, convergence can be achieved more quickly, and the calculation amount is effectively reduced.
Example 4
Unlike the wireless communication reception window prediction apparatus described in embodiment 1, the apparatus of this embodiment includes only a link control module and a protection module, and the apparatus has only a utilization mode. The current receiving strategy obtained by deep reinforcement learning training in the protection module can be obtained by online platform pre-training and is updated by subsequent training of the online platform. The apparatus of this embodiment may also be configured to collect the global information to provide the global information to an offline platform, and initiate training and updating of a new reception strategy.
The working principle of the device of this embodiment can refer to the principle and method in the foregoing embodiments, and therefore, the detailed description is omitted.
The foregoing describes in detail a method, an apparatus, and a wireless communication device for predicting a wireless communication receiving window provided in an embodiment of the present invention, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the foregoing embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and applications, and in summary, the above description is only a specific embodiment of the present invention and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.