CN107094321B

CN107094321B - Multi-agent Q learning-based vehicle-mounted communication MAC layer channel access method

Info

Publication number: CN107094321B
Application number: CN201710205247.0A
Authority: CN
Inventors: 赵海涛; 于洪苏; 沈箬怡; 杜艾芊; 朱洪波
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2020-04-28
Anticipated expiration: 2037-03-31
Also published as: CN107094321A

Abstract

The invention discloses a vehicle-mounted communication MAC layer channel access method based on multi-agent Q learning, wherein each vehicle node constructs own joint state-action mapping relation and joint strategy in a VANETs environment; then judging whether a new vehicle node is added in the VANET network; if so, the newly added vehicle nodes quickly acquire an action space, a state space and a reward function through transfer learning, and then each vehicle node updates the joint state-action pair relation and the joint strategy of the vehicle node; if not, judging whether the current vehicle node has data to be sent; if data are to be sent, determining an action strategy solution meeting the relevant balance according to an eEQ algorithm; selecting actions from the action set which can enable the multi-agent system to finally reach relevant balance; determining a CW value and transmitting data by accessing a wireless channel with the CW value. The invention improves the probability of successful data transmission, reduces the back-off times, and effectively improves the data packet receiving rate, the end-to-end transmission delay problem and the like.

Description

Multi-agent Q learning-based vehicle-mounted communication MAC layer channel access method

Technical Field

The invention belongs to the technical field of Internet of things, and relates to a method for realizing MAC layer channel access based on multi-agent Q learning in vehicle-mounted communication.

Background

After the invention of motor vehicles from the second industrial revolution, with the rapid development of the automotive field, automobiles have become an indispensable part of people's modern life. With the pace of daily life of people accelerating, the use of vehicles such as buses and private cars is increasingly common. The automobile brings convenience to daily travel of people and causes many problems such as traffic jam, environmental pollution, traffic accidents and the like. Traffic congestion becomes a serious social problem, brings a lot of problems for road users, and causes a great deal of fuel waste and time waste due to traffic congestion every year. Not only make people's daily trip waste a large amount of time on the car road, the haze that fuel waste and exhaust emission etc. caused seriously threatens human health. Traffic accidents have also become one of the biggest threats to human life. In view of this, there is a need for safer, greener (e.g., less exhaust emissions), fully automated, more comfortable entertainment experience for passengers, etc. for future vehicle travel. Thus, to make the traffic infrastructure safer and more efficient, the traffic system must be intelligent enough. The ITS (Intelligent Transportation Systems) has been developed to improve road traffic safety, alleviate traffic congestion, reduce automobile fuel consumption and protect the environment, and has received extensive attention in both academic and industrial fields. ITS aims to improve the quality, efficiency and safety of future traffic systems using information and communication technologies. More advanced ITS technology will be deployed in the future to effectively manage urban traffic and improve highway and road safety. In addition, access to broadband networks via ITS technology is expected to revolutionize passenger and driver QoE (quality of experience) entertainment applications. The vehicular ad-hoc network (VANET) can support ITS application, and is used as an important component of the ITS to improve traffic safety and traffic efficiency, reduce oil consumption by relieving traffic congestion and protect the environment and provide safe and comfortable experience for passengers, so that most novel applications (such as mobile information entertainment) are brought into operation. VANETs applications can be divided into the following categories: security-related applications, traffic management and traffic efficiency applications, user entertainment services, and network connectivity applications, among others. These VANETs applications have varying demands on VANET networks. The secure message is to ensure fast access and short transmission delay, and the message is only valid for a short time. The data volume of the entertainment service is large, and the requirement on synchronization is strict. As VANETs are expected to be used in a wide variety of applications, VANETs networks need to support a wide variety of needs. Safety applications should be able to broadcast warning messages wirelessly between adjacent vehicles in order to quickly inform the driver of dangerous situations. To ensure efficiency, it is better that the secure application transmits data with a periodic delay, and the MAC (Media Access Control) protocol plays a crucial role in VANET providing efficient data transmission. The MAC protocol is located at the data link layer and not only needs to ensure fairness in channel access, but also needs to provide multi-channel cooperation and error control. It is therefore necessary to design efficient and reliable MAC protocols for VANET.

At present, various VANETs MAC protocols are proposed in succession, and the WAVE standard adopts IEEE 802.11p to realize an MAC layer and is based on CSMA/CA. However, when the backoff counters of multiple vehicles are decremented to zero to access the channel simultaneously, the CSMA-based protocol inevitably collides, especially in a high-density scenario, and causes an infinite increase in access delay and serious packet loss. In addition to the CSMA protocol, most researchers prefer to employ TDMA-based access mechanisms in VANETs, especially security applications. The TDMA protocol allocates different time slots for different vehicles which are closest to each other, so that the TDMA protocol has determined channel access time delay, good expandability and small transmission interference. However, due to the high speed mobility of the vehicle-mounted environment and the dynamic nature of the network density, VANETs distributed time slot scheduling becomes very difficult. In addition, some documents improve the conventional backoff algorithm, the MILD algorithm and the EIED algorithm are researched and compared on the basis of the conventional binary exponential backoff algorithm, the network performance is improved after the two algorithms are optimized, and then the backoff algorithm based on the statistical times is proposed on the basis of the newMILD algorithm, namely, after the vehicle node is successfully accessed into the wireless channel to transmit data, the contention window is reduced, but the algorithm sets a threshold value for increasing the opportunity that the vehicle node failed in data transmission is accessed into the wireless channel. And when the number of times that the node accesses the wireless channel continuously and successfully transmits data is larger than the threshold value, setting the value of the contention window of the node as the maximum value. Similarly, when the number of times of continuous failures of the node to access the wireless channel to transmit data is larger than the threshold value, the value of the contention window of the node is set to be the minimum value. Finally, simulation proves that the algorithm effectively reduces the influence of the hidden node on the network performance and improves the fairness of the node accessing to the wireless channel. There is also a document that proposes a minimum Contention Window adjustment algorithm based on the estimation of the number of neighbor nodes, i.e., Adaptive CWmin algorithm, which changes the adjustment rule of the minimum CW (Contention Window) and dynamically adjusts CWmin according to the use condition of the network channel. The relation between the CW value and the number of nodes is deduced on the basis of an IEEE 802.11 broadcast backoff Markov model, the minimum CW value is dynamically adjusted by estimating the number of neighbor nodes, and simulation proves that the algorithm is superior to other methods for improving the broadcast receiving rate. In addition, after the node successfully sends data, the optimal CWMin value adaptive to the vehicle-mounted network condition is calculated according to the function. The algorithm proposed in the document selects reasonable CW after the retransmission of the data packet, shortens the time for the competing node to wait for the retransmission, and increases the network throughput.

However, the above prior art is improved on the basis of the BEB algorithm, and in general, when data collision is about to be backed off, the CW value is multiplied by the CW value, and the CW value is restored to 15 after the data is successfully transmitted, and if a plurality of nodes finish transmitting data at the same time, the CW value is restored to 15, and when data is transmitted again, collision occurs again. The network load condition is considered less, and the method is not suitable for networks with different load degrees, namely, the method has no expandability on traffic flows with different densities, and the channel access fairness is not effectively improved.

Disclosure of Invention

Aiming at some problems in the prior art, the invention provides a method for realizing vehicle-mounted communication MAC layer channel access based on multi-agent Q learning, which is an IEEE 802.11p MAC layer data transmission method-QL-CW based on multi-agent Q learning_Multi-AgentThe algorithm is completely different from the traditional BEB algorithm, and each vehicle node continuously and interactively learns the surrounding environment by using the Q learning algorithm in the VANET network environment. Vehicle nodes are repeatedly tried and error in the VANETs environment, a Competition Window (CW) is dynamically adjusted according to feedback signals (namely reward values) obtained from the surrounding environment, and vehicle nodes newly added into the VANET network environment learn the network environment more quickly by means of transfer learning. The vehicle nodes not only need to learn the state-action pair mapping relation of the vehicle nodes according to the environment, but also need to learn the state-action pair relation of other vehicle nodes in the network environment, so as to construct a combined state-action pair relation constrained by other vehicle nodes for the vehicle nodes, finally obtain the combined strategy of the vehicle nodes, select a CW value which can enable other vehicle nodes to obtain the highest reward value according to the combined strategy, and enable the nodes to always obtain the highest reward valueAnd the optimal CW (namely the CW value selected when the reward value obtained from the surrounding environment is maximum) is accessed to the channel so as to reduce the collision rate of data frames and the transmission delay and improve the fairness of the node access to the channel.

Therefore, the technical scheme adopted by the invention is a vehicle-mounted communication MAC layer channel access method based on multi-agent Q learning, and the method comprises the following steps:

step 1: in a VANETs environment, each vehicle node constructs a self joint state-action mapping relation and a joint strategy according to the current network environment and other vehicle nodes;

step 2: judging whether a new vehicle node is added in the VANET network;

and step 3: if so, the newly added vehicle nodes quickly acquire an action space, a state space and a reward function through transfer learning, and then each vehicle node updates the joint state-action pair relation and the joint strategy of the vehicle node;

and 4, step 4: if not, judging whether the current vehicle node has data to be sent;

and 5: if data are to be sent, determining an action strategy solution meeting the relevant balance according to an eEQ algorithm;

step 6: selecting actions which enable the multi-agent system to finally reach relevant balance from the { I, K, R } action set;

and 7: determining a CW value after the action is executed, and accessing a wireless channel to transmit data according to the CW value;

and 8: whether a message needs to be sent exists in the current vehicle node or not, and if not, ending; if yes, returning to execute the step 2.

Further, in step 3, if a new vehicle node is added into the VANET, the newly added node can quickly acquire a state space, an action space and a reward function through transfer learning, and construct a joint state-action pair mapping relationship and a joint strategy that are constrained by other vehicle nodes.

Compared with the prior art, the invention has the beneficial effects that:

1. the vehicle node of the invention continuously interacts with the surrounding environment by utilizing the Q learning algorithm, dynamically adjusts the competition window according to the reward signal fed back by the network environment, so that the node can always access the channel with the optimal CW value when sending data next time, the probability of successful data sending is improved, the backoff times are reduced, and the problems of the data packet receiving rate, the end-to-end transmission delay and the like are effectively improved.

2. And the vehicle node newly added into the network environment acquires a joint strategy by transferring and learning the quick learning state-action pair mapping relation. The QL-CW proposed by the invention_Multi-agentThe communication node of the algorithm can quickly adapt to unknown environment, the receiving rate of the data packet and the transmission delay of the data packet are effectively improved, and more importantly, the QL-CW algorithm_MultiThe agent algorithm can provide higher fairness for the node access channel and is suitable for network environments with different load degrees.

3. The invention reduces the collision rate and transmission time delay of data frames, improves the fairness of the nodes to access into the channel, different vehicle nodes perform Q learning in the VANET and use different CW values to access into the wireless channel according to the learning result, and can see that if the vehicle node information is successfully sent, the CW value is reduced to 15, but the CW value is gradually reduced by using the Q learning and continuously exploring, and meanwhile, the opportunity of other vehicle nodes to access into the wireless channel is also considered, so that the fairness of the vehicle nodes to access into the wireless channel in the vehicle-mounted self-organizing network is obviously improved, and the algorithm is also applicable no matter how many vehicle nodes are in the network, namely the wireless channel access method provided by the invention has expansibility to different network load scenes.

Drawings

Fig. 1 is a flow chart of a vehicle node accessing a wireless channel by using the invention in vehicle-mounted communication.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, the method of the present invention comprises the steps of:

step 2: judging whether a new vehicle node is added in the VANET network;

Wherein QL-CW_Multi-agentThe algorithm comprises the following contents:

the number of vehicles in the whole vehicle-mounted self-organizing network is N, namely the intelligent agent set in the multi-intelligent agent Q learning system is N ═ 1,2_nDiscrete set A representing actions that a vehicle n can perform during backoff of an access channel in an on-board ad hoc network_nE { I, K, R }, i.e. including increasing (Increase) the contention window, keeping (Keep) the contention window size unchanged, decreasing (Reduce) the contention window, at some time from A, vehicle n_nTo select the action to be executed_nAnd (4) showing. Then the N vehicles select the joint action set of the competition window value in the back-off process as A ═ A₁×A₂×...×A_NThe contention window value used by the vehicle to access the wireless channel at a certain time, i.e. the discrete set of environmental conditions, is set as S ═ 15,31,63,127,255,511,1023, and R is used_nReward function representing successful transmission of data by vehicle n during access to the channel to obtain a reward from the network environment, since the reward value of the multi-agent system depends on all vehiclesThe joint action of the vehicle, the prize value is represented by S × A → R. Vehicle n adopts a fixed one-step strategy at time t

The joint strategy is denoted by pi.

In the backoff process that vehicle nodes in a vehicle-mounted self-organizing network need to send data to access a wireless channel, action models, state spaces and reward functions between any two vehicle nodes are the same, so when a new vehicle is added into the vehicle-mounted self-organizing network, the learning speed and efficiency of the vehicle nodes can be improved because the knowledge learned by a certain vehicle node can be used for strengthening the learning of other vehicle nodes, the new vehicle node can directly learn to other vehicle nodes in a transfer learning manner so as to quickly learn the state-action mapping relation of the new vehicle node and update the Q value iteration method of a Q table, and the final aim is to ensure that the vehicle node newly added into the vehicle-mounted self-organizing network can quickly learn the adaptive environment and solve tasks by using the least prior knowledge learned from other vehicle nodes. Therefore, knowledge transfer can be performed among the agents in the multi-agent system, and newly added vehicle nodes can learn the network environment more quickly by using transfer learning. The transfer learning process is as follows:

what is migrated: the action space, the state space and the reward function of any two vehicle nodes in the Q learning process are the same, so that the Q table obtained by the vehicle nodes in the vehicle-mounted self-organizing network through the Q learning can be migrated to the vehicle node newly added into the vehicle-mounted self-organizing network through the migration learning, and only the first Q maximum items in the Q table are migrated (sorted according to the Q values) in consideration of communication overhead.

How to migrate: the learned information is broadcast upon request using broadcast communication.

When to migrate: and when a new vehicle node is added into the vehicle-mounted self-organizing network, the vehicle node is migrated.

The specific migration process is as follows: when a new vehicle node joins the vehicle-mounted ad hoc network, the new vehicle node broadcasts a piece of migration request information, each vehicle node receiving the information starts a timer, and the timer value is inversely proportional to the distance between vehicles. The vehicle whose timer expires first broadcasts the largest Q entries in its Q table. And once the newly added vehicle node receives the migration information, updating the Q table of the newly added vehicle node according to the migration information, thereby accelerating the learning process.

Since the Q learning algorithm depends largely on the action value function, i.e., the Q function. In the single agent Q learning process, the strategy expression (namely the mapping relation from the state to the probability of selecting each action) selected by the agent is pi^*(s), the Q value function Q (s, a) is the expected reward value that the agent obtains from the environment after performing action a in state s, after which the agent follows the policy

The action of the next state is executed. In a multi-agent system, the Q value function Q of the vehicle n_nDepending on the joint action a of all agents and limited by the joint policy pi, the expression is as follows:

where s (t +1) denotes the next state, i.e. the vehicle n has performed the action a_nAnd (t) sending the contention window value used when the data needs to be accessed to the wireless channel again. Wherein T is SxAxS → [0,1 → []Representing a state transition probability function. Then T (s (T), a)₁(t),a₂(t),...,a_N(t), s (t +1)) represents a transition probability from the state s (t) to the state s (t + 1). Sigma A (t +1) represents that each agent is according to strategy pi_nPerforms the action a_nA reward value Q is obtained after (t +1)_n(s(t+1),a₁(t+1),...,a_N(t +1)), i.e. the value of the CW (i.e. s (t +1)) used by the vehicle n to retransmit the data access radio channel after having performed the I/K/R action (increasing CW/keeping CW unchanged/decreasing CW), i.e. the value of the reward value obtainable from the network environment. Gamma belongs to [0,1) as a discount factor, the larger the gamma isIndicating a higher degree of emphasis on the current prize value, and conversely indicating a higher degree of emphasis on subsequent prize values. Formula 1 shows that when a vehicle n has data to send at time t and accesses a wireless channel through a contention window s (t), other vehicles respectively select to execute action a₁To a_N(each action respectively represents increasing CW/keeping CW unchanged/decreasing CW), then the vehicles continue to learn interactively in the vehicle-mounted self-organizing network environment according to the strategy, and once the vehicles need to access the wireless channel to transmit data, each vehicle can perform a back-off process with an optimal CW value and then access the wireless channel to transmit data.

The final goal of reinforcement learning is that each agent can find the optimal strategy and select the action with the maximum value function. In cooperative gaming, the correlation balance is a matrix of probability distributions over the joint action space. The Q learning method for finally realizing the relevant balance defines a state-value function through the linear combination of Q functions based on the relevant action strategies, and is defined as follows:

wherein V_nk(s_k) Indicating that agent n is at s at the k-th iteration_kA state-value function in a state, which represents the relative equilibrium cooperation degree of the multi-agent in the state; a ═ a₁,...,a_n,...,a_N]，a_nIs the action performed by the nth agent, N represents the number of agents in the multi-agent system; a represents the multi-agent in state s_kThe set of available federated actions below; q_n(k-1)(s_kAnd a) denotes that agent n is at s during the k-1 iteration_kThe Q-value function of joint action a is executed in the state. Pi_n ^*(s_kAnd a) is a probability distribution vector of the joint action set A, representing that an agent n is at s_kThe following best correlation equalization action strategy.

The decision and Q value functions of other agents are considered by the combined action strategy of the agents in the multi-agent reinforcement learning, so that the accumulated reward values of all agents are increased. For state s_kDown slave federated action policyThe action of selecting the assigned to the nth agent may determine the associated equalization action policy by the following inequality constraint:

A_-n＝Π_m≠nA_m,

a_-n＝Π_m≠na_m,

a＝(a_-n,a_n) Equation 4

Wherein A is_nRepresenting the action set of the nth agent, A_-nRepresenting a set of federated actions of other agents than agent n, a_n∈A_nRepresents the action of the nth agent, a_-n∈A_-nRepresenting the combined action of agents other than agent n. a is_n' represents any one of the agent n set of actions; pi_nRepresenting a feasible solution for the nth agent to satisfy all the action strategies (i.e., action probabilities) of the above-equation associative equilibrium. 4.4 in the equation a set of linear inequality constraints, π, are defined for solving the optimal correlation equilibrium point_nIs an unknown variable and the Q-value function is a known variable.

After determining the action strategy solution satisfying the correlation equilibrium according to formula 4, obtaining pi according to eEQ (correlated equilibrium Q, correlation equilibrium Q learning) algorithm (i.e. maximizing the minimum value of all intelligent agent rewards)_nAnd determining the action which can always maximize the system state-value function for each intelligent agent according to a formula 3, so that the multi-intelligent-agent system can finally reach relevant balance.

In the VANETs environment, vehicle nodes utilize a Q learning algorithm to repeatedly try and error in the surrounding environment and continuously learn interactively with the environment, and a Competition Window (CW) is dynamically adjusted in the node backoff process according to a feedback signal given by the VANETs environment, so that the nodes can always access a channel with the optimal CW (the CW selected when a reward value obtained from the surrounding environment is maximum).

The invention applies the multi-agent Q learning algorithm to the vehicle-mounted communication MAC channel access method, and deduces the combined action set of a plurality of vehicle nodes in the Q learning process and the Q value iterative expression limited by the combined strategy pi. In the process that the vehicle node accesses the wireless channel by using the Q learning method in the vehicle-mounted self-organizing network, in order to reduce competition with other vehicle nodes, the vehicle node selects to execute joint action related to other vehicle nodes. Meanwhile, transfer learning is introduced into the multi-agent Q learning system, so that the learning speed of a vehicle node newly added into the vehicle-mounted self-organizing network is increased, and the time delay of the vehicle node for accessing a wireless channel to transmit data is greatly reduced. And finally, in order to enable the multi-agent system to finally reach relevant balance, calculating an optimal solution of an action strategy according to an eEQ (maximizing the minimum value awarded by all agents, namely maximizing the times of successfully sending data by accessing the vehicle nodes into the wireless channel), and then allocating actions which can always maximize the awarding values for the vehicle nodes according to the optimal action strategy, so that each vehicle node can access the wireless channel with the optimal CW value to successfully send data to the greatest extent, and the fairness of accessing each vehicle node into the wireless channel is remarkably improved.

Claims

1. A vehicle-mounted communication MAC layer channel access method based on multi-agent Q learning is characterized by comprising the following steps:

step 2: judging whether a new vehicle node is added in the VANET network;

and 8: whether a message needs to be sent exists in the current vehicle node or not, and if not, ending; if yes, returning to execute the step 2;

QL-CW_Multi-agentthe algorithm comprises the following contents:

the number of vehicles in the whole vehicle-mounted self-organizing network is N, namely the intelligent agent set in the multi-intelligent agent Q learning system is N ═ 1,2_nDiscrete set A representing actions that a vehicle n can perform during backoff of an access channel in an on-board ad hoc network_nE { I, K, R }, i.e. including increasing (Increase) the contention window, keeping (Keep) the contention window size unchanged, decreasing (Reduce) the contention window, at some time from A, vehicle n_nTo select the action to be executed_nThat means, the joint action set of N vehicles selecting the contention window value in the backoff process is a ═ a₁×A₂×...×A_NThe contention window value used by the vehicle to access the wireless channel at a certain time, i.e. the discrete set of environmental conditions, is set as S ═ 15,31,63,127,255,511,1023, and R is used_nA reward function representing successful transmission of data from the network environment by vehicle n during access to the channel, the reward value of which is represented by S x A → R since it depends on the joint action of all vehicles, vehicle n adopts a fixed one-step strategy at time t

The joint strategy is denoted by pi;

in the backoff process that vehicle nodes in a vehicle-mounted self-organizing network need to send data to access a wireless channel, action models, state spaces and reward functions between any two vehicle nodes are the same, so when a new vehicle is added into the vehicle-mounted self-organizing network, the learning speed and efficiency of the vehicle nodes are improved because the knowledge learned by a certain vehicle node can be used for strengthening the learning of other vehicle nodes, the state-action mapping relation is directly learned to other vehicle nodes in a transfer learning manner so as to enable the new vehicle node to quickly learn the adaptive network environment, and a Q value iteration method for rapidly learning the state-action mapping relation and updating a Q table is finally aimed at enabling the vehicle nodes newly added into the vehicle-mounted self-organizing network to quickly learn the adaptive environment and solve tasks by utilizing the least prior knowledge learned from other vehicle nodes, so that knowledge transfer is performed among the intelligent agents in the multi-agent system, the newly added vehicle node learns the network environment more quickly by using transfer learning, and the transfer learning process is as follows:

what is migrated: the action space, the state space and the reward function of any two vehicle nodes in the Q learning process are the same, so that a Q table obtained by the vehicle nodes in the vehicle-mounted self-organized network through Q learning is migrated to the vehicle nodes newly added into the vehicle-mounted self-organized network through migration learning, and only the first Q maximum items in the Q table are migrated (sorted according to the Q values) in consideration of communication overhead;

how to migrate: broadcasting the learned information according to the request using the broadcast communication;

when to migrate: when a new vehicle node is added into the vehicle-mounted self-organizing network, the vehicle node is migrated;

the specific migration process is as follows: when a new vehicle node joins in the vehicle-mounted self-organizing network, the new vehicle node broadcasts a piece of migration request information, each vehicle node receiving the information starts a timer, the timer value is in inverse proportion to the distance between vehicles, the vehicle with the timer arriving first broadcasts the largest Q item in the Q table, and once the newly joined vehicle node receives the migration information, the Q table of the newly joined vehicle node is updated according to the migration information, so that the learning process is accelerated.

2. The multi-agent Q learning-based vehicle-mounted communication MAC layer channel access method according to claim 1, characterized in that in step 3, if a new vehicle node joins in VANET, the newly joined node can rapidly acquire a state space, an action space and a reward function through transfer learning, and construct a joint state-action mapping relation and a joint strategy constrained by other vehicle nodes.