WO2024032228A1

WO2024032228A1 - Reinforcement learning training method and related device

Info

Publication number: WO2024032228A1
Application number: PCT/CN2023/104247
Authority: WO
Inventors: 刘鹏; 郭子阳; 罗嘉俊; 舒同欣; 杨讯; 颜敏
Original assignee: 华为技术有限公司
Priority date: 2022-08-12
Filing date: 2023-06-29
Publication date: 2024-02-15
Also published as: CN117651346A

Abstract

The present application provides a reinforcement learning training method and a related device. The method comprises: determining a first reward value according to the actions of a plurality of stations, wherein the first reward value is a reward value of a first station in the plurality of stations, and the first reward value is used for the first station to perform reinforcement learning training; and sending the first reward value to the first station. It can be seen that a reward value is determined according to the actions of a plurality of stations, such that the calculation of the reward value can be performed by considering the mutual influence between users, thereby improving the accuracy of the reward value. Thus, the actual application effect can be improved after the station uses the reward value to perform reinforcement learning training. The present application can be applied to WLAN systems such as EHT, Wi-Fi 7, or Wi-Fi 8.

Description

A training method and related devices for reinforcement learning

This application claims priority to the Chinese patent application filed with the China Patent Office on August 12, 2022, with application number 202210968171.8 and the application title "A training method for reinforcement learning and related devices", the entire content of which is incorporated by reference. in this application.

Technical field

The present application relates to the fields of computer technology and communication technology, and in particular, to a reinforcement learning training method and related devices.

Background technique

Reinforcement learning is a general method used to implement sequence decision-making. The agent learns in a "trial and error" manner, and the reward value obtained by interacting with the environment through actions guides behavior. The goal is to make intelligence The body obtains the maximum return value. At present, it is often necessary to use actions, environmental states and reward values for reinforcement learning training. However, in the existing scheme, the accuracy of the reward value obtained is low, which leads to poor practical application results after using actions, environmental states and reward values for reinforcement learning training.

Contents of the invention

This application provides a reinforcement learning training method and related devices, which can improve the accuracy of the return value, thereby enabling the site to improve the actual application effect after using the return value for reinforcement learning training.

In a first aspect, a reinforcement learning training method is provided. The method includes: determining a first reward value based on the actions of multiple sites. The first reward value is the reward value of the first site among the multiple sites. The first reward value Used for reinforcement learning training at the first site; sending the first return value to the first site. It can be seen that by determining the reward value based on the actions of multiple sites, the calculation of the reward value can be combined with the interaction between users, improving the accuracy of the reward value, and thus allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.

Optionally, the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.

It should be understood that the first site can be any site among multiple sites. This means that for any one of the multiple sites, the access point determines the reward value of that site based on the actions of the multiple sites. For example, the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3; the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3. The return value of site 2; the access point determines the return value of site 3 based on the actions of site 1, the action of site 2, and the action of site 3.

Optionally, the actions of different sites in multiple sites can be exactly the same, partially the same, or completely different, which is not limited here. For example, the action of site #1 is to initiate channel access, the action of site #2 is to initiate channel access, and the action of site #3 is to initiate channel access. So all three sites behave exactly the same. In another example, the action of station #1 is to initiate channel access, the action of station #2 is to initiate channel access, and the action of station #3 is to perform power control. So the action portion is the same for all three sites. In another example, the action of station #1 is to initiate channel access, the action of station #2 is to perform rate adaptation, and the action of station #3 is to perform power control. So the actions of the three sites are completely different.

Reinforcement learning (RL) is used to describe and solve the problem of an agent learning strategies to maximize returns or achieve specific goals during its interaction with the environment. A common model of reinforcement learning is the Markov decision process (MDP). MDP is a mathematical model for analyzing decision-making problems. Reinforcement learning is where the agent learns in a "trial and error" manner, and the rewards obtained by interacting with the environment through actions (actions) guide behavior. The goal is to enable the agent to obtain the maximum reward. It should be understood that in this application, an intelligent agent can be understood as an AI model, including a large number of parameters and calculation formulas (or calculation rules). Rewards can also be called return value, evaluation, etc.

Reinforcement learning can use the reinforcement signal (i.e. reward) provided by the environment to evaluate the quality of the action, rather than telling the reinforcement learning system how to produce the correct action. Since the external environment provides little information, the agent must rely on its own experience to learn. In this way, the agent acquires knowledge in an action-evaluation (i.e., reward) environment and improves its action plan to adapt to the environment. Common reinforcement learning algorithms include deep Q-learning (DQN), proximal policy optimization (PPO), etc.

Optionally, combined with the first aspect, determining the first reward value based on the actions of multiple sites includes: determining the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites. It can be seen that by determining the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, the calculation of the reward value can be combined with the mutual influence between users, and can also be combined with the time corresponding to the actions of different sites, enriching the Determining the relevant information of the return value improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.

Optionally, combined with the first aspect, the actions of multiple sites correspond to the same time. It can be seen that because the actions of multiple sites correspond to are the same, so when the access point determines the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, it can improve the accuracy of the reward value, thereby allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.

Optionally, combined with the first aspect, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site. It can be seen that because the return value is the return value corresponding to a certain time, the site can learn the actions and environmental status corresponding to that time, which in turn allows the site to improve the actual application effect after using the return value for reinforcement learning training.

Optionally, combined with the first aspect, sending the first report value to the first station includes: sending a broadcast frame to the first station, where the broadcast frame includes the first report value. It can be seen that because the first report value is carried by the broadcast frame, other stations can also receive the broadcast frame.

The broadcast frame may be, for example, a beacon frame or a trigger frame.

Optionally, combined with the first aspect, the multiple sites also include a second site, and the method further includes: if the first site and the second site send messages at the same time and cause transmission failure, determining the return value of the second site, and The return value of the second station is the same as the first return value; a broadcast frame is sent to the second station. It can be seen that when the return values of different sites are the same, by sending broadcast frames, different sites can obtain the return values, saving overhead.

Optionally, combined with the first aspect, sending the first report value to the first station includes: sending a response frame of the first message to the first station; wherein the response frame of the first message includes the first report value, and the first report value is sent to the first station. A return value corresponds to the second message, and the second message is received after the first message. It can be seen that the return value corresponding to the second message can be carried in the response frame of the first message. Because the second message is received after the first message, delayed sending of the return value corresponding to the second message is achieved. , which provides more time for the calculation of the return value.

Wherein, the correspondence between the first return value and the second message can be understood as: the first return value corresponds to the action of the first station in the second message. The time corresponding to the action of the first station is the time when the access point receives the second message.

In this application, the response frame may be, for example, an acknowledgment (ACK) frame, a clear tosend (CTS) frame, or a block acknowledgment (block ACK, BA) frame, etc.

Optionally, combined with the first aspect, the response frame of the first message further includes identification information of the second message or a timestamp of the second message. It can be seen that since the response frame of the first message also includes the identification information of the second message or the timestamp of the second message, the first station can learn which message the first return value corresponds to.

In a possible implementation, the identification information of the second message may be, for example, the index value of the second message. In another possible implementation, the identification information of the second message may be, for example, the difference between the index value of the first message and the index value of the second message. For example, the index value of the first message is 10, the index value of the second message is 4, and the identification information of the second message can be 4 or 6.

Optionally, combined with the first aspect, the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the reception time of the first message and the reception time of the second message. difference.

Optional, combined with the first aspect, the first return value

Among them, d ₀ is the time interval between the first station and the last time it received the acknowledgment frame from the first station, N is the number of stations, d ₁ is the time interval between the first station and the last time it heard the acknowledgment frame from other stations, and others A site is a site other than the first site among multiple sites. It can be seen that the calculation of the return value can be combined with the interaction between users, which improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.

It should be understood that in this application, when the first station successfully transmits the message and other stations successfully transmit the message, the first return value is d ₀ -(N-1)*d ₁ . When the first station successfully transmits the message and other stations fail to transmit the message, the first report value is d ₀ -(N-1)*d ₁ . When the first station fails to transmit the packet and other stations fail to transmit the packet, the first report value is -N. When the first station fails to transmit the packet and the other station successfully transmits the packet, the first report value is -N.

In a second aspect, a communication device is provided. The device includes a processing module and a transceiver module. The processing module is configured to determine a first reporting value based on actions of multiple sites. The first reporting value is the value of the first site among the multiple sites. Return value, the first return value is used for reinforcement learning training at the first site; the transceiver module is used for sending the first return value to the first site.

Optionally, combined with the second aspect, an action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.

Optionally, combined with the second aspect, when determining the first reward value based on the actions of multiple sites, the processing module is configured to determine the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites. .

Optionally, combined with the second aspect, the actions of multiple sites correspond to the same time.

Optionally, combined with the second aspect, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site.

Optionally, combined with the second aspect, when sending the first report value to the first station, the transceiver module is configured to send a broadcast frame to the first station, where the broadcast frame includes the first report value.

Optionally, combined with the second aspect, the multiple sites also include a second site and a processing module, which is also used to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure. The report value of the second station is the same as the first report value; the transceiver module is also used to send a broadcast frame to the second station.

Optionally, combined with the second aspect, when sending the first report value to the first site, the transceiver module is configured to send a response frame of the first message to the first site; wherein the response frame of the first message includes the A return value, the first return value corresponds to the second message, and the second message is received after the first message.

Optionally, combined with the second aspect, the response frame of the first message also includes identification information of the second message or a timestamp of the second message.

Optionally, combined with the second aspect, the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the reception time of the first message and the reception time of the second message. difference.

Optional, combined with the second aspect, the first return value

Among them, d ₀ is the time interval between the first station and the last time it received the acknowledgment frame from the first station, N is the number of stations, d ₁ is the time interval between the first station and the last time it heard the acknowledgment frame from other stations, and others A site is a site other than the first site among multiple sites.

In a third aspect, a chip is provided. The chip includes at least one processor and an interface. The processor is used to read and execute instructions stored in the memory. When the instructions are executed, the chip executes the method described in any one of the first aspects. method.

In a fourth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the method described in any one of the first aspects. method.

In a fifth aspect, a communication device is provided, including a processor, a memory, an input interface and an output interface. The input interface is used to receive information from other communication devices other than the communication device, and the output interface is used to send information to other communication devices other than the communication device. The communication device outputs information, and the processor calls the computer program stored in the memory to implement the method as described in any one of the first aspects.

In a possible design, the communication device may be a chip that implements the method in the first aspect or a device containing the chip.

A sixth aspect provides a computer program product, which when a computer reads and executes the computer program product, causes the computer to execute the method described in any one of the first aspects.

A seventh aspect provides a communication system, including an access point and/or a station for implementing any one of the methods in the first aspect.

Description of drawings

The following will briefly introduce the drawings needed to describe the embodiments.

in:

Figure 1 is a network architecture diagram of a WLAN provided by an embodiment of the present application;

Figure 2 is a schematic diagram of reinforcement learning provided by an embodiment of the present application;

Figure 3 shows a schematic diagram of the hardware structure of a communication device applicable to embodiments of the present application;

Figure 4 is a schematic flow chart of a reinforcement learning training method provided by an embodiment of the present application;

Figure 5 is a schematic diagram of a delayed feedback reward value provided by an embodiment of the present application;

Figure 6 is a beneficial effect diagram provided by the embodiment of the present application;

FIG. 7 is a schematic structural diagram of a communication device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Among them, the terms "system" and "network" in the embodiments of this application can be used interchangeably. Unless otherwise specified, "/" indicates that the related objects are in an "or" relationship, for example, A/B can mean A or B; "and/or" in this application is just an association relationship describing related objects, indicating that there can be three relationships, for example, A and/or B can mean: A alone exists , there are three situations: A and B exist at the same time, and B exists alone, where A and B can be singular or plural. Furthermore, in the description of this application, unless otherwise specified, "plurality" means two or more than two. "At least one of the following" or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, c can be one or more . In addition, in order to facilitate a clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as “first” and “second” are used to distinguish network elements from identical or similar items that have basically the same function. . Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not limit the number and execution order.

Reference in describing an embodiment of the application to "one embodiment" or "some embodiments" or the like means that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Therefore, the phrases "in one embodiment", "in some embodiments", "in other embodiments", "in other embodiments", etc. appearing in different places in this specification are not necessarily References are made to the same embodiment, but rather to "one or more but not all embodiments" unless specifically stated otherwise. The terms “including,” “includes,” “having,” and variations thereof all mean “including but not limited to,” unless otherwise specifically emphasized.

The following specific implementations further describe the objectives, technical solutions and beneficial effects of the present application in detail. It should be understood that the following are only specific implementations of the present application and are not intended to limit the scope of protection of the present application. Any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solutions of this application shall be included in the protection scope of this application.

In the various embodiments of this application, if there is no special explanation or logical conflict, the terms and/or descriptions between different embodiments are consistent and can be referenced to each other. The technical features in different embodiments are based on their inherent Logical relationships can be combined to form new embodiments.

It should be understood that the embodiments of the present application can be applied to wireless local area network (WLAN) scenarios and can be applied to IEEE 802.11 system standards, such as 802.11a/b/g, 802.11n, 802.11ac, 802.11ax, or Its next generation, such as 802.11be or next generation standards. Or the embodiments of this application can also be applied to the Internet of Things (IoT), Vehicle to X (V2X), narrowband Internet of things (NB-IoT) systems, and other short-distance communication systems (Such as bluetooth, ultra wide band (UWB)), etc. Of course, the embodiments of the present application can also be applied to other possible communication systems, such as long term evolution (long term evolution, LTE) system, LTE frequency division duplex (FDD) system, LTE time division duplex (time division) system duplex (TDD), universal mobile telecommunication system (UMTS), global interoperability for microwave access (WiMAX) communication system, and future 6G communication system, etc.

The following takes a scenario where the embodiments of the present application are applicable to WLAN as an example. It should be understood that WLAN starts with the 802.11a/g standard and goes through 802.11n, 802.11ac, 802.11ax and the 802.11be and Wi-Fi 8 that are being discussed today. Among them, 802.11n can also be called high throughput (HT); 802.11ac can also be called very high throughput (VHT); 802.11ax can also be called high efficient (HE) or Wi -Fi 6; 802.11be can also be called extremely high throughput (EHT) or Wi-Fi 7, while standards before HT, such as 802.11a/b/g, are collectively called non-high throughput (EHT). -HT).

Refer to Figure 1, which is a network architecture diagram of a WLAN provided by an embodiment of the present application. Figure 1 takes the WLAN as an example including 1 wireless access point (AP) and 2 stations (STAs). A STA associated with an AP can receive wireless frames sent by the AP and can also send wireless frames to the AP. In addition, the embodiments of the present application are also applicable to the communication between APs. For example, each AP can communicate with each other through a distributed system (DS). The embodiments of the present application are also applicable to the communication between STAs. . It should be understood that the number of APs and STAs in Figure 1 is only an example, and may be more or less.

The STA involved in the embodiments of this application can be various user terminals, user devices, access devices, subscriber stations, subscriber units, mobile stations, user agents, user equipment or other names with wireless communication functions, where the user terminal can Including various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to wireless modems with wireless communication functions, as well as various forms of user equipment (UE), mobile stations, MS), terminal, terminal equipment, portable communications device, handset, portable computing device, entertainment device, gaming device or system, global positioning system device or anything configured for network communications via a wireless medium Other suitable equipment, etc. For example, STA can be a router, switch, bridge, etc. Here, for the convenience of description, the above-mentioned devices are collectively called sites or STA.

The APs and STAs involved in the embodiments of this application may be APs and STAs applicable to the IEEE 802.11 system standard. AP is a device deployed in a wireless communication network to provide wireless communication functions for its associated STAs. The AP can be used as the hub of the communication system. It is usually a network-side product that supports the MAC and PHY of the 802.11 system standard. For example, it can be a base station. , routers, gateways, repeaters, communication servers, switches or bridges and other communication equipment, wherein the base stations may include various forms of macro base stations, micro base stations, relay stations, etc. Here, for convenience of description, the above-mentioned devices are collectively referred to as APs. STA usually supports media access control (media access control) of 802.11 system standard. control, MAC) and physical layer (physical, PHY) terminal products, such as mobile phones, laptops, etc.

The embodiments of this application can also be applied to a scenario where one node performs data transmission with one or more nodes; it can also be applied to a single-user uplink/downlink data transmission scenario, or a multi-user uplink/downlink data transmission scenario; it can also be used Applied to device-to-device (D2D) data transmission scenarios. Any of the above nodes can be AP or STA.

This solution can be applied to wireless communication systems. The wireless communication system can be a wireless local area network or a cellular network. This solution can be implemented by a communication device in the wireless communication system or a chip or processor in the communication device. The communication device can be a device that supports multiple channels. Wireless communication devices whose links transmit in parallel are, for example, called multi-link devices or multi-band devices. Compared with devices that only support single-link transmission, multi-link devices have higher transmission efficiency and higher throughput. Multi-link devices include one or more affiliated STAs (affiliated STAs). An affiliated STA is a logical site and can work on one link. Among them, the affiliated station can be an access point (Access Point, AP) or a non-access point station (non-Access Point Station, non-AP STA). For the convenience of description, in this application, the multi-link device whose site is AP can be called multi-link AP or multi-link AP device or AP multi-link device (AP multi-link device), and the site it belongs to is non- The multi-link device of AP STA can be called multi-link STA or multi-link STA device or STA multi-link device. It should be understood that each station in the multi-link device can work on one link respectively, but multiple stations are allowed to work on the same link.

It should be noted that in this application, the AP and STA can have certain artificial intelligence (AI) capabilities. For example, they can use neural networks for reasoning and decision-making, and can also perform neural network training. It should be understood that in this application, it mainly involves the training of reinforcement learning, and reinforcement learning can be, for example, deep reinforcement learning (DRL).

Refer to Figure 2, which is a schematic diagram of reinforcement learning provided by an embodiment of the present application. As shown in Figure 2, reinforcement learning mainly contains five elements: agent, environment, state, action and reward. Among them, the input of the agent is the state and the output is the action. The training process of reinforcement learning is: through multiple interactions between the agent and the environment, the actions, states, and rewards of each interaction are obtained; these multiple groups (actions, states, rewards) are used as training data to train the agent once. Using the above process, the agent is trained for the next round until the convergence conditions are met.

For example, the process of obtaining the actions, status, and rewards of an interaction is shown in Figure 2. The current state S0 of the environment is input to the agent, and the action A0 output by the agent is obtained. According to the relevant performance indicators of the environment under the action A0 Calculate the reward R0 of this interaction. At this point, the action A0, state S0 and reward R0 of this interaction are obtained. Record the action A0, state S0 and reward R0 of this interaction for subsequent use in training the agent. The next state S1 of the environment under the action A0 can also be recorded in order to realize the next interaction between the agent and the environment.

In this application, the action is the action of the station, and the action of the station may include at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.

Optionally, when the station's action is that the station initiates channel access, the status may be one or more of the carrier sensing results, such as channel quality, packet loss rate, etc. When the station's action is to select a channel for the station, the status may be the load condition of the channel, etc. When the station's action is to perform power control for the station, the status may be one or more of the station's location, channel quality, throughput, etc. When the station's action is to perform rate adaptation for the station, the status may be one or more of the carrier sensing results, such as channel quality, packet loss rate, etc.

Optionally, each device (such as AP, STA, etc.) in Figure 1 can be implemented by one device, can also be implemented by multiple devices, or can be a functional module in one device. This is not the case in the embodiment of this application. Specific limitations. It can be understood that the above functions can be either network elements in hardware devices, software functions running on dedicated hardware, or virtualization functions instantiated on a platform (eg, cloud platform).

For example, each device in Figure 1 can be implemented by the communication device 300 in Figure 3 . FIG. 3 is a schematic diagram of the hardware structure of a communication device applicable to embodiments of the present application. The communication device 300 includes at least one processor 301, communication line 302, and memory 303 and at least one communication interface 304.

The processor 301 may be a general central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), a neural network processor (neural-network processing unit, NPU), etc. one or more. The processor 301 may also be one or more integrated circuits used to control the execution of the program of the present application.

Communication line 302 may include a path for communicating information between the above-mentioned components.

The communication interface 304 is any transceiver-like device (such as an antenna, etc.) used to communicate with other devices or communication networks, such as Ethernet, RAN, wireless local area networks (WLAN), etc.

The memory 303 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions. A dynamic storage device can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other medium for access, but not limited to this. The memory may exist independently and be connected to the processor through a communication line 302 . Memory can also be integrated with the processor. The memory provided by the embodiment of the present application may generally be non-volatile.

Among them, the memory 303 is used to store computer execution instructions for executing the solution of the present application, and is controlled by the processor 301 for execution. The processor 301 is used to execute computer execution instructions stored in the memory 303, thereby implementing the methods provided by the following embodiments of the application.

Optionally, the computer-executed instructions in the embodiments of the present application may also be called application codes, which are not specifically limited in the embodiments of the present application.

In a possible implementation, the processor 301 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 3 .

In a possible implementation, the communication device 300 may include multiple processors, such as the processor 301 and the processor 307 in FIG. 3 . Each of these processors may be a single-CPU processor or a multi-CPU processor. A processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).

In a possible implementation, the communication device 300 may also include an output device 305 and an input device 306. Output device 305 communicates with processor 301 and can display information in a variety of ways. For example, the output device 305 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector. wait. Input device 306 communicates with processor 301 and can receive user input in a variety of ways. For example, the input device 306 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.

When the communication device is turned on, the processor 301 can read the software program in the memory 303, interpret and execute the instructions of the software program, and process the data of the software program. When data needs to be sent wirelessly, the processor 301 performs baseband processing on the data to be sent, and then outputs the baseband signal to the radio frequency circuit. The radio frequency circuit performs radio frequency processing on the baseband signal and then sends the radio frequency signal out in the form of electromagnetic waves through the antenna. When data is sent to the communication device, the radio frequency circuit receives the radio frequency signal through the antenna, converts the radio frequency signal into a baseband signal, and outputs the baseband signal to the processor 301. The processor 301 converts the baseband signal into data and performs processing on the data. deal with.

In another implementation, the radio frequency circuit and the antenna can be arranged independently of the processor that performs baseband processing. For example, in a distributed scenario, the radio frequency circuit and the antenna can be arranged independently of the communication device in a remote arrangement.

Optionally, the neural network processor may include, for example, a training module and an inference module not shown in Figure 3. The inputs of the training module may include, for example, actions, states, reward values, etc., and the outputs may be neural network parameters. Generally speaking, the trained neural network parameters can be fed back to the inference module. It should be understood that the neural network processor can interact with various modules of the communication device 300, such as controlling the transmission of data on the communication interface to save energy; or interacting with the antenna to control the orientation of the antenna. In a possible implementation, the communication device 300 may also include a media access control (media access control, MAC) not shown in FIG. 3 . The neural network processor can also interact with the MAC to control channel access, channel selection, and spatial multiplexing decisions.

The above-mentioned communication device 300 may be a general-purpose device or a special-purpose device. In a specific implementation, the communication device 300 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, an embedded device, or a device with a similar structure as shown in Figure 3 equipment. The embodiment of the present application does not limit the type of communication device 300.

The following describes the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Referring to Figure 4, Figure 4 is a schematic flow chart of a reinforcement learning training method provided by an embodiment of the present application. As shown in Figure 4, the method includes but is not limited to the following steps:

401. The access point determines a first reward value based on the actions of multiple stations, and the first reward value is the reward value of the first station among the multiple stations.

Optionally, in this application, the access point can learn the action of the first station in any of the following ways. It should be understood that whether the access point uses method 1.1 or method 1.2 to learn the action of the first station may depend on the implementation of the access point, a prior agreement, or a standard definition.

Method 1.1: Before step 401, the access point receives the action of the first station sent by the first station. For example, the packet sent by the first station is received by the access point. Because the packet includes the action of the first station, the access point can learn the action of the first station. That is, when the packet sent by the first station is received by the access point, the access point learns the action of the first station through the packet. In another example, the message sent by the first station is not received by the access point. Because the message is lost, the access point cannot learn the action of the first station, so the first station can re-send the lost message to the access point. action. That is, when the packet sent by the first station is not received by the access point, the access point learns the action of the first station through the action of losing the packet.

Method 1.2: The access point determines the action of the first station by itself. For example, the first station does not send packets, so the access point does not receive the packets sent by the first station. At this time, the access point can determine the action of the first station by itself. In another example, the packet sent by the first station is received by the access point, and the packet includes one or more of the rate information and time length information of the packet. For example, the packet header of the packet includes one or more of the packet's rate information, time length information, etc. Therefore, the access point can determine the action of the first station based on one or more of the packet rate information, time length information, and the like. It should be understood that when the access point determines the action of the first station by itself, the action of the first station may be empty.

It can be seen that either method 1.1 or 1.2 enables the access point to learn the action of the site, thereby preparing for subsequent access points to determine the return value.

The first reward value can be used at the first site for reinforcement learning training.

Optionally, step 401 may include: the access point determines the first report value based on the actions of multiple stations and the times corresponding to the actions of the multiple stations. It can be seen that by determining the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, the calculation of the reward value can be combined with the mutual influence between users, and can also be combined with the time corresponding to the actions of different sites, enriching the Determining the relevant information of the return value improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.

It should be noted that when the packet sent by the first station is received by the access point, because the packet includes the action of the first station, the time corresponding to the action of the first station may be, for example, when the access point receives the packet. time of writing. When the packet sent by the first station is not received by the access point, the time corresponding to the action of the first station may be, for example, the sending time of the lost packet. In the case where the access point determines the action of the first station by itself, the time corresponding to the action of the first station may be, for example, the time when the first station initiates channel access.

Among them, the actions of multiple sites correspond to the same time. It can be seen that because the actions of multiple sites correspond to the same time, the access point can improve the accuracy of the return value when determining the return value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, thus making The site can improve the actual application effect after using the return value for reinforcement learning training.

Optionally, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site. It can be seen that because the return value is the return value corresponding to a certain time, the site can learn the actions and environmental status corresponding to that time, which in turn allows the site to improve the actual application effect after using the return value for reinforcement learning training.

In this application, the first return value

It can be seen that the calculation of the return value can be combined with the interaction between users, which improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.

Optionally, the method may also include step 402.

402. The access point sends the first report value to the first station.

Correspondingly, the first station receives the first report value sent by the access point.

Optionally, in this application, step 402 can be implemented in any of the following ways. It should be understood that whether the access point uses method 2.1 or method 2.2 to send the first return value may depend on the implementation of the access point, prior agreement, or standard definition.

Method 2.1: The access point sends a broadcast frame or a multicast frame to the first station. Correspondingly, the first station receives the broadcast frame or multicast frame sent by the access point. Wherein, the broadcast frame or the multicast frame includes the first report value, and the multicast frame may also include the address of the first station. The broadcast frame may be, for example, a beacon frame or a trigger frame. It should be understood that for the situation of method 2.1, the access point can use the above method 1.1 to learn the action of the first station through the action of losing the packet, or the access point can use the above method 1.2 to learn the action of the first station. It can be seen that because the first report value is carried by the broadcast frame or the multicast frame, other stations can also receive the broadcast frame or the multicast frame.

Method 2.2: The access point sends a response frame of the first message to the first station. Correspondingly, the first station receives the response frame of the first message sent by the access point. Wherein, the response frame of the first message includes a first return value, the first return value corresponds to the second message, and the second message is received after the first message.

Wherein, the correspondence between the first return value and the second message can be understood as: the first return value corresponds to the action of the first station in the second message. The time corresponding to the action of the first station is the time when the access point receives the second message. In addition, in this application, the response frame may be, for example, an acknowledgment (ACK) frame, a clear tosend (CTS) frame, or a block acknowledgment (block ACK, BA) frame, etc.

For example, see FIG. 5 , which is a schematic diagram of a delayed feedback reward value provided by an embodiment of the present application. As shown in Figure 5, in step 501, station 1 sends message 1 to the access point, and message 1 includes the action of station 1. In step 502, the access point sends a response frame of message 1 to station 1. In step 503, station 1 sends message 2 to the access point. In step 504, the access point sends a response frame of message 2 to station 1. The response frame of message 2 includes the return value corresponding to message 1. The return value corresponding to message 1 is determined based on the actions of multiple stations. The station 1 is one of multiple sites. That is, because it takes time for the access point to calculate the report value corresponding to message 1, the access point can carry the report value corresponding to message 1 in the response frame of message 2.

It can be seen that in method 2.2, the return value corresponding to the second message can be carried in the response frame of the first message. Because the second message is received after the first message, delayed sending of the second message is achieved. The corresponding return value, which provides more time for the calculation of the return value.

Optionally, for method 2.1, the multiple sites may also include a second site. The method also includes: if the first site and the second site send messages at the same time and cause transmission failure, the access point determines the report of the second site. value, the report value of the second station is the same as the first report value; the access point sends a broadcast frame to the second station. Among them, the first station and the second station send packets at the same time and cause transmission failure, which can be understood as: the first station and the second station send packets at the same time, causing the first station to fail to transmit, and the second station also fails to transmit. It can be seen that when the return values of different sites are the same, by sending broadcast frames, different sites can obtain the return values, saving overhead.

For example, at time t, the actions of station 1 and station 2 are channel access, that is, station 1 and station 2 send packets to the access point at the same time. Due to the conflict, the packet transmission fails. In order to punish station 1 and station 2, the AP For the behavior of site 2 at time t, the return value can be set to a large negative value, such as -100. At this time in, is the return value of site 1, is the return value of site 2. In this case, the most cost-effective way is to broadcast, which allows both site 1 and site 2 to obtain the reward value.

Optionally, for method 2.2, the response frame of the first message may also include any of the following. It should be understood that whether the access point carries the first type or the first type in the response frame of the first message may depend on the implementation of the access point, pre-agreement or standard definition.

Identification information of the first and second packets. In a possible implementation, the identification information of the second message may be, for example, the index value of the second message. In another possible implementation, the identification information of the second message may be, for example, the difference between the index value of the first message and the index value of the second message. For example, the index value of the first message is 10, the index value of the second message is 4, and the identification information of the second message can be 4 or 6.

The timestamp of the second and second packets. In a possible implementation, the timestamp of the second message may be, for example, the reception time of the second message. In another possible implementation, the timestamp of the second message may be, for example, the difference between the reception time of the first message and the reception time of the second message. In this application, the reception time of the message can be understood as the time when the access point receives the message.

It can be seen that since the response frame of the first message also includes the identification information of the second message or the timestamp of the second message, the first station can learn which message the first return value corresponds to.

It should be understood that in this application, the specific information carried by the response frame of the first message can be referred to Table 1 or Table 2, for example. In Table 1, the first The response frame of the message includes the first report value and the identification information of the second message. In Table 2, the response frame of the first message includes the first report value and the timestamp of the second message.

Table 1

Table 2

Optionally, the method may also include step 403.

403. The first site performs reinforcement learning training based on the first return value.

Optionally, step 403 may include: the first station obtains the status and action corresponding to the first reward value; and the first station performs reinforcement learning training based on the first reward value, status and action. Wherein, the first station performing reinforcement learning training based on the first reward value, status and action can be understood as: the first station performs reinforcement learning training on the agent based on the first reward value, status and action.

Optionally, in this application, different sites among multiple sites can use different reinforcement learning algorithms for reinforcement learning training. For example, site 1 uses DQN for reinforcement learning training, site 2 uses PPO for reinforcement learning training, etc.

It can be seen that by determining the reward value based on the actions of multiple sites, the calculation of the reward value can be combined with the interaction between users, improving the accuracy of the reward value, and thus allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.

Referring to Figure 6, Figure 6 is a beneficial effect diagram provided by the embodiment of the present application. In Figure 6, different sites among multiple sites use different deep learning algorithms. For example, the number of sites using DQN is 1, and the number of sites using PPO is also 1. In the scenario where these stations initiate channel access, spectrum can still be shared fairly and efficiently. Specifically, in 6-1 of Figure 6, the abscissa is the access delay (Delay) in seconds, and the ordinate is the cumulative probability distribution (Probability). It can be seen that these stations use Distribution of access delays corresponding to sites with different algorithms. In 6-2 of Figure 6, the abscissa is time in seconds, and the ordinate is throughput. The total throughput of these sites tends to be stable between 0.8 and 1, and the throughput of each site is at 0.4. It tends to be stable between 0.6 and 0.6, which means that the throughput of each site is not very different, so these sites can share the spectrum fairly and efficiently.

The above mainly introduces the solution provided by this application from the perspective of interaction between various devices. It can be understood that, in order to realize the above functions, the above-mentioned implementation devices include corresponding hardware structures and/or software modules for executing each function. Persons skilled in the art should easily realize that, with the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.

Embodiments of the present application can divide access points, sites, etc. into functional modules according to the above method examples. For example, each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module. , The above integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiment of the present application is schematic and is only a logical function division. In actual implementation, there may be other division methods.

In the case of using an integrated module, see FIG. 7 , which is a schematic structural diagram of a communication device provided by an embodiment of the present application. The communication device 700 can be applied to the method shown in FIG. 4 . As shown in FIG. 7 , the communication device 700 includes: a processing module 701 and a transceiver module 702 . The processing module 701 may be one or more processors, and the transceiver module 702 may be a transceiver or a communication interface. The communication device can be used to implement the site or access point involved in any of the above method embodiments, or to implement the functions of the network element involved in any of the above method embodiments. The network element or network function can be a network element in a hardware device, a software function running on dedicated hardware, or a virtualized function instantiated on a platform (eg, cloud platform). Optionally, the communication device 700 may also include a storage module 703 for storing program codes and data of the communication device 700 .

In one example, when the communication device serves as an access point or is a chip applied in the access point, and performs the steps performed by the access point in the above method embodiment. The transceiver module 702 is used to support communication with sites, etc. The transceiver module specifically performs the sending and/or receiving actions performed by the access point in Figure 4, such as supporting the access point to perform step 402, and/or as described herein. Other procedures for the described technology. The processing module 701 may be used to support the communication device 700 to perform processing actions in the above method embodiments, for example, to support the access point to perform one or more steps in step 401, etc., and/or other processes of the technology described herein.

Exemplarily, the processing module 701 is used to determine a first reward value based on the actions of multiple sites. The first reward value is the reward value of the first site among the multiple sites. The first reward value is used for strengthening the first site. Learning and training; sending and receiving module 702, used to send the first reward value to the first site.

Optionally, when determining the first reward value based on the actions of multiple sites, the processing module 701 is configured to determine the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites.

Optionally, the actions of multiple sites correspond to the same time.

Optionally, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site.

Optionally, when sending the first report value to the first station, the transceiving module 702 is configured to send a broadcast frame to the first station, where the broadcast frame includes the first report value.

Optionally, the multiple sites also include a second site. The processing module 701 is also used to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure. The report value is the same as the first report value; the transceiver module 702 is also used to send a broadcast frame to the second station.

Optionally, when sending the first report value to the first station, the transceiver module 702 is configured to send a response frame of the first message to the first station; wherein the response frame of the first message includes the first report value, The first return value corresponds to the second message, and the second message is received after the first message.

Optionally, the response frame of the first message also includes identification information of the second message or a timestamp of the second message.

Optionally, the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.

Optional, first return value

In a possible implementation, when the access point or station is a chip, the transceiver module 702 may be a communication interface, pin or circuit, etc. The communication interface can be used to input data to be processed to the processor, and can output the processing results of the processor to the outside. In specific implementation, the communication interface can be a general purpose input output (GPIO) interface, which can communicate with multiple peripheral devices (such as display (LCD), camera (camara), radio frequency (RF) module, antenna, etc. etc.) connection. The communication interface is connected to the processor through a bus.

The processing module 701 may be a processor, and the processor may execute computer execution instructions stored in the storage module, so that the chip executes the method involved in the embodiment of FIG. 4 .

Further, the processor may include a controller, arithmetic unit, and a register. For example, the controller is mainly responsible for decoding instructions and sending control signals for operations corresponding to the instructions. The arithmetic unit is mainly responsible for performing fixed-point or floating-point arithmetic operations, shift operations, and logical operations. It can also perform address operations and conversions. Registers are mainly responsible for storing register operands and intermediate operation results temporarily stored during instruction execution. In specific implementation, the hardware architecture of the processor can be application specific integrated circuits (ASIC) architecture, microprocessor without interlocked piped stages architecture (MIPS) architecture, advanced reduced instructions Set machine (advanced RISC machines, ARM) architecture or network processor (network processor, NP) architecture, etc. The processor can be single-core or multi-core.

The storage module can be a storage module within the chip, such as a register, cache, etc. The storage module can also be a storage module located outside the chip, such as Read Only Memory (ROM) or other types of static storage devices that can store static information and instructions, Random Access Memory (Random Access Memory, RAM), etc. .

It should be noted that the corresponding functions of the processor and the interface can be realized through hardware design, software design, or a combination of software and hardware. There are no restrictions here.

Embodiments of the present application also provide a communication device, including a processor, a memory, an input interface and an output interface. The input interface is used to receive information from other communication devices other than the communication device, and the output interface is used to send information to other communication devices other than the communication device. Other communication devices output information, and the processor calls the computer program stored in the memory to implement the embodiment shown in Figure 4.

An embodiment of the present application also provides a chip. The chip includes at least one processor and an interface. The processor is configured to read and execute instructions stored in the memory. When the instructions are executed, the chip executes the embodiment shown in Figure 4 .

Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the embodiment shown in Figure 4.

An embodiment of the present application also provides a computer program product. When a computer reads and executes the computer program product, the computer executes the embodiment shown in Figure 4 .

The units described above as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the goals of the embodiments of the present application. In addition, each network element unit in various embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or software network element unit.

If the above integrated unit is implemented in the form of a software network element unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the part that contributes essentially to the technical solution of the present application, or all or part of the technical solution, can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes a number of instructions. So that a computer device (which can be a personal computer, a terminal device, a cloud server, or a network device, etc.) executes all or part of the steps of the above methods in various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. . The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various equivalent methods within the technical scope disclosed in the present application. Modification or replacement, these modifications or replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A training method for reinforcement learning, characterized in that the method includes:

Determine a first reward value based on the actions of multiple sites. The first reward value is the reward value of the first site among the multiple sites. The first reward value is used for the first site to perform reinforcement learning training. ;

Send the first reward value to the first site.
The method according to claim 1, characterized in that the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate control. Adaptive.
The method according to claim 1 or 2, characterized in that determining the first reward value according to the actions of multiple sites includes:

The first reward value is determined according to the actions of the multiple sites and the times corresponding to the actions of the multiple sites.
The method according to claim 3, characterized in that the actions of the plurality of stations correspond to the same time.
The method according to any one of claims 1 to 4, characterized in that the first reward value is a reward value corresponding to a first time, and the first time is a time corresponding to the action of the first site.
The method according to any one of claims 1-5, wherein sending the first reward value to the first site includes:

A broadcast frame is sent to the first station, where the broadcast frame includes the first reward value.
The method according to claim 6, wherein the plurality of sites further includes a second site, and the method further includes:

If the first station and the second station send messages at the same time and cause transmission failure, then determine the return value of the second station, and the return value of the second station is the same as the first return value;

Send the broadcast frame to the second station.
The method according to any one of claims 1-5, wherein sending the first reward value to the first site includes:

Send a response frame of the first message to the first station;

Wherein, the response frame of the first message includes the first return value, the first return value corresponds to the second message, and the second message is received after the first message.
The method according to claim 8, characterized in that the response frame of the first message further includes identification information of the second message or a timestamp of the second message.
The method according to claim 9, characterized in that:

The timestamp of the second message is the reception time of the second message; or,

The timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.
The method according to any one of claims 1-10, characterized in that,

The first return value

Wherein, d 0 is the time interval since the first station last received the acknowledgment frame from the first station, N is the number of stations, and d 1 is the last time the first station heard the other stations. The time interval of the acknowledgment frame, the other stations are the stations among the multiple stations except the first station.
A communication device, characterized in that the device includes a processing module and a transceiver module,

A processing module configured to determine a first return value based on actions of multiple sites, where the first return value is the return value of a first site among the multiple sites, and the first return value is used for the first site. The site conducts reinforcement learning training;

A transceiver module, configured to send the first report value to the first site.
The device according to claim 12, wherein the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate control. Adaptive.
The device according to claim 12 or 13, characterized in that, when determining the first reward value according to the actions of multiple sites, the processing module is configured to determine the first reward value based on the actions of the multiple sites and the multiple The first reward value is determined based on the time corresponding to the site's action.
The device according to claim 14, characterized in that the actions of the plurality of stations correspond to the same time.
The device according to claim 14, characterized in that the first reward value is a reward value corresponding to a first time, and the first The time is the time corresponding to the action of the first site.
The device according to any one of claims 12-16, characterized in that, when sending the first report value to the first station, the transceiver module is used to send a broadcast frame to the first station. , the broadcast frame includes the first report value.
The device according to claim 17, characterized in that:

The multiple sites also include a second site, and the processing module is also configured to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure. , the return value of the second site is the same as the first return value;

The transceiver module is also used to send the broadcast frame to the second station.
The device according to any one of claims 12 to 16, characterized in that, when sending the first report value to the first site, the transceiver module is used to send the first report value to the first site. The response frame of the message;

Wherein, the response frame of the first message includes the first return value, the first return value corresponds to the second message, and the second message is received after the first message.
The device according to claim 19, wherein the response frame of the first message further includes identification information of the second message or a timestamp of the second message.
The device according to claim 19, characterized in that:

The timestamp of the second message is the reception time of the second message; or,

The timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.
The device according to any one of claims 12-21, characterized in that,

The first return value

Wherein, d 0 is the time interval since the first station last received the acknowledgment frame from the first station, N is the number of stations, and d 1 is the last time the first station heard the other stations. The time interval of the acknowledgment frame, the other stations are the stations among the multiple stations except the first station.
A chip, characterized in that the chip includes at least one processor and an interface, the processor is used to read and execute instructions stored in a memory, and when the instructions are executed, the chip executes the claims The method described in any one of 1-11.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the claims. The method described in any one of 1-11.