WO2024032228A1 - Reinforcement learning training method and related device - Google Patents

Reinforcement learning training method and related device Download PDF

Info

Publication number
WO2024032228A1
WO2024032228A1 PCT/CN2023/104247 CN2023104247W WO2024032228A1 WO 2024032228 A1 WO2024032228 A1 WO 2024032228A1 CN 2023104247 W CN2023104247 W CN 2023104247W WO 2024032228 A1 WO2024032228 A1 WO 2024032228A1
Authority
WO
WIPO (PCT)
Prior art keywords
station
message
site
value
actions
Prior art date
Application number
PCT/CN2023/104247
Other languages
French (fr)
Chinese (zh)
Inventor
刘鹏
郭子阳
罗嘉俊
舒同欣
杨讯
颜敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024032228A1 publication Critical patent/WO2024032228A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W74/00Wireless channel access
    • H04W74/002Transmission of channel access control information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L5/00Arrangements affording multiple use of the transmission path
    • H04L5/003Arrangements for allocating sub-channels of the transmission path
    • H04L5/0053Allocation of signaling, i.e. of overhead other than pilot signals
    • H04L5/0055Physical resource allocation for ACK/NACK
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W48/00Access restriction; Network selection; Access point selection
    • H04W48/08Access restriction or access information delivery, e.g. discovery data delivery
    • H04W48/10Access restriction or access information delivery, e.g. discovery data delivery using broadcasted information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W74/00Wireless channel access

Definitions

  • the present application relates to the fields of computer technology and communication technology, and in particular, to a reinforcement learning training method and related devices.
  • Reinforcement learning is a general method used to implement sequence decision-making.
  • the agent learns in a "trial and error” manner, and the reward value obtained by interacting with the environment through actions guides behavior.
  • the goal is to make intelligence
  • the body obtains the maximum return value.
  • it is often necessary to use actions, environmental states and reward values for reinforcement learning training.
  • the accuracy of the reward value obtained is low, which leads to poor practical application results after using actions, environmental states and reward values for reinforcement learning training.
  • This application provides a reinforcement learning training method and related devices, which can improve the accuracy of the return value, thereby enabling the site to improve the actual application effect after using the return value for reinforcement learning training.
  • a reinforcement learning training method includes: determining a first reward value based on the actions of multiple sites.
  • the first reward value is the reward value of the first site among the multiple sites.
  • the first reward value Used for reinforcement learning training at the first site; sending the first return value to the first site. It can be seen that by determining the reward value based on the actions of multiple sites, the calculation of the reward value can be combined with the interaction between users, improving the accuracy of the reward value, and thus allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.
  • the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
  • the first site can be any site among multiple sites. This means that for any one of the multiple sites, the access point determines the reward value of that site based on the actions of the multiple sites. For example, the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3; the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3. The return value of site 2; the access point determines the return value of site 3 based on the actions of site 1, the action of site 2, and the action of site 3.
  • the actions of different sites in multiple sites can be exactly the same, partially the same, or completely different, which is not limited here.
  • the action of site #1 is to initiate channel access
  • the action of site #2 is to initiate channel access
  • the action of site #3 is to initiate channel access. So all three sites behave exactly the same.
  • the action of station #1 is to initiate channel access
  • the action of station #2 is to initiate channel access
  • the action of station #3 is to perform power control. So the action portion is the same for all three sites.
  • the action of station #1 is to initiate channel access, the action of station #2 is to perform rate adaptation, and the action of station #3 is to perform power control. So the actions of the three sites are completely different.
  • Reinforcement learning is used to describe and solve the problem of an agent learning strategies to maximize returns or achieve specific goals during its interaction with the environment.
  • a common model of reinforcement learning is the Markov decision process (MDP).
  • MDP is a mathematical model for analyzing decision-making problems.
  • Reinforcement learning is where the agent learns in a "trial and error” manner, and the rewards obtained by interacting with the environment through actions (actions) guide behavior. The goal is to enable the agent to obtain the maximum reward.
  • an intelligent agent can be understood as an AI model, including a large number of parameters and calculation formulas (or calculation rules). Rewards can also be called return value, evaluation, etc.
  • Reinforcement learning can use the reinforcement signal (i.e. reward) provided by the environment to evaluate the quality of the action, rather than telling the reinforcement learning system how to produce the correct action. Since the external environment provides little information, the agent must rely on its own experience to learn. In this way, the agent acquires knowledge in an action-evaluation (i.e., reward) environment and improves its action plan to adapt to the environment.
  • reinforcement learning algorithms include deep Q-learning (DQN), proximal policy optimization (PPO), etc.
  • determining the first reward value based on the actions of multiple sites includes: determining the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites. It can be seen that by determining the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, the calculation of the reward value can be combined with the mutual influence between users, and can also be combined with the time corresponding to the actions of different sites, enriching the Determining the relevant information of the return value improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
  • the actions of multiple sites correspond to the same time. It can be seen that because the actions of multiple sites correspond to are the same, so when the access point determines the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, it can improve the accuracy of the reward value, thereby allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.
  • the first reward value is the reward value corresponding to the first time
  • the first time is the time corresponding to the action of the first site. It can be seen that because the return value is the return value corresponding to a certain time, the site can learn the actions and environmental status corresponding to that time, which in turn allows the site to improve the actual application effect after using the return value for reinforcement learning training.
  • sending the first report value to the first station includes: sending a broadcast frame to the first station, where the broadcast frame includes the first report value. It can be seen that because the first report value is carried by the broadcast frame, other stations can also receive the broadcast frame.
  • the broadcast frame may be, for example, a beacon frame or a trigger frame.
  • the multiple sites also include a second site
  • the method further includes: if the first site and the second site send messages at the same time and cause transmission failure, determining the return value of the second site, and The return value of the second station is the same as the first return value; a broadcast frame is sent to the second station. It can be seen that when the return values of different sites are the same, by sending broadcast frames, different sites can obtain the return values, saving overhead.
  • sending the first report value to the first station includes: sending a response frame of the first message to the first station; wherein the response frame of the first message includes the first report value, and the first report value is sent to the first station.
  • a return value corresponds to the second message, and the second message is received after the first message. It can be seen that the return value corresponding to the second message can be carried in the response frame of the first message. Because the second message is received after the first message, delayed sending of the return value corresponding to the second message is achieved. , which provides more time for the calculation of the return value.
  • the correspondence between the first return value and the second message can be understood as: the first return value corresponds to the action of the first station in the second message.
  • the time corresponding to the action of the first station is the time when the access point receives the second message.
  • the response frame may be, for example, an acknowledgment (ACK) frame, a clear tosend (CTS) frame, or a block acknowledgment (block ACK, BA) frame, etc.
  • ACK acknowledgment
  • CTS clear tosend
  • block ACK block acknowledgment
  • BA block acknowledgment
  • the response frame of the first message further includes identification information of the second message or a timestamp of the second message. It can be seen that since the response frame of the first message also includes the identification information of the second message or the timestamp of the second message, the first station can learn which message the first return value corresponds to.
  • the identification information of the second message may be, for example, the index value of the second message.
  • the identification information of the second message may be, for example, the difference between the index value of the first message and the index value of the second message.
  • the index value of the first message is 10
  • the index value of the second message is 4, and the identification information of the second message can be 4 or 6.
  • the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the reception time of the first message and the reception time of the second message. difference.
  • d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station
  • N is the number of stations
  • d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations
  • a site is a site other than the first site among multiple sites.
  • the first return value is d 0 -(N-1)*d 1 .
  • the first report value is d 0 -(N-1)*d 1 .
  • the first report value is -N.
  • the first report value is -N.
  • a communication device in a second aspect, includes a processing module and a transceiver module.
  • the processing module is configured to determine a first reporting value based on actions of multiple sites.
  • the first reporting value is the value of the first site among the multiple sites.
  • Return value the first return value is used for reinforcement learning training at the first site;
  • the transceiver module is used for sending the first return value to the first site.
  • an action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
  • the processing module when determining the first reward value based on the actions of multiple sites, is configured to determine the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites. .
  • the actions of multiple sites correspond to the same time.
  • the first reward value is the reward value corresponding to the first time
  • the first time is the time corresponding to the action of the first site.
  • the transceiver module when sending the first report value to the first station, is configured to send a broadcast frame to the first station, where the broadcast frame includes the first report value.
  • the multiple sites also include a second site and a processing module, which is also used to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure.
  • the report value of the second station is the same as the first report value; the transceiver module is also used to send a broadcast frame to the second station.
  • the transceiver module when sending the first report value to the first site, is configured to send a response frame of the first message to the first site; wherein the response frame of the first message includes the A return value, the first return value corresponds to the second message, and the second message is received after the first message.
  • the response frame of the first message also includes identification information of the second message or a timestamp of the second message.
  • the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the reception time of the first message and the reception time of the second message. difference.
  • d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station
  • N is the number of stations
  • d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations
  • a site is a site other than the first site among multiple sites.
  • a chip in a third aspect, includes at least one processor and an interface.
  • the processor is used to read and execute instructions stored in the memory. When the instructions are executed, the chip executes the method described in any one of the first aspects. method.
  • a computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the method described in any one of the first aspects. method.
  • a communication device including a processor, a memory, an input interface and an output interface.
  • the input interface is used to receive information from other communication devices other than the communication device, and the output interface is used to send information to other communication devices other than the communication device.
  • the communication device outputs information, and the processor calls the computer program stored in the memory to implement the method as described in any one of the first aspects.
  • the communication device may be a chip that implements the method in the first aspect or a device containing the chip.
  • a sixth aspect provides a computer program product, which when a computer reads and executes the computer program product, causes the computer to execute the method described in any one of the first aspects.
  • a seventh aspect provides a communication system, including an access point and/or a station for implementing any one of the methods in the first aspect.
  • FIG. 1 is a network architecture diagram of a WLAN provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of reinforcement learning provided by an embodiment of the present application.
  • Figure 3 shows a schematic diagram of the hardware structure of a communication device applicable to embodiments of the present application
  • Figure 4 is a schematic flow chart of a reinforcement learning training method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a delayed feedback reward value provided by an embodiment of the present application.
  • Figure 6 is a beneficial effect diagram provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • At least one of the following or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, c can be one or more .
  • words such as “first” and “second” are used to distinguish network elements from identical or similar items that have basically the same function. . Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order, and words such as "first” and “second” do not limit the number and execution order.
  • the embodiments of the present application can be applied to wireless local area network (WLAN) scenarios and can be applied to IEEE 802.11 system standards, such as 802.11a/b/g, 802.11n, 802.11ac, 802.11ax, or Its next generation, such as 802.11be or next generation standards.
  • IEEE 802.11 system standards such as 802.11a/b/g, 802.11n, 802.11ac, 802.11ax, or Its next generation, such as 802.11be or next generation standards.
  • the embodiments of this application can also be applied to the Internet of Things (IoT), Vehicle to X (V2X), narrowband Internet of things (NB-IoT) systems, and other short-distance communication systems (Such as bluetooth, ultra wide band (UWB)), etc.
  • IoT Internet of Things
  • V2X Vehicle to X
  • NB-IoT narrowband Internet of things
  • UWB ultra wide band
  • LTE long term evolution
  • FDD frequency division duplex
  • TDD time division duplex
  • UMTS universal mobile telecommunication system
  • WiMAX global interoperability for microwave access
  • WLAN starts with the 802.11a/g standard and goes through 802.11n, 802.11ac, 802.11ax and the 802.11be and Wi-Fi 8 that are being discussed today.
  • 802.11n can also be called high throughput (HT);
  • 802.11ac can also be called very high throughput (VHT);
  • 802.11ax can also be called high efficient (HE) or Wi -Fi 6;
  • 802.11be can also be called extremely high throughput (EHT) or Wi-Fi 7, while standards before HT, such as 802.11a/b/g, are collectively called non-high throughput (EHT).
  • EHT extremely high throughput
  • Figure 1 is a network architecture diagram of a WLAN provided by an embodiment of the present application.
  • Figure 1 takes the WLAN as an example including 1 wireless access point (AP) and 2 stations (STAs).
  • a STA associated with an AP can receive wireless frames sent by the AP and can also send wireless frames to the AP.
  • the embodiments of the present application are also applicable to the communication between APs.
  • each AP can communicate with each other through a distributed system (DS).
  • DS distributed system
  • the embodiments of the present application are also applicable to the communication between STAs. . It should be understood that the number of APs and STAs in Figure 1 is only an example, and may be more or less.
  • the STA involved in the embodiments of this application can be various user terminals, user devices, access devices, subscriber stations, subscriber units, mobile stations, user agents, user equipment or other names with wireless communication functions, where the user terminal can Including various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to wireless modems with wireless communication functions, as well as various forms of user equipment (UE), mobile stations, MS), terminal, terminal equipment, portable communications device, handset, portable computing device, entertainment device, gaming device or system, global positioning system device or anything configured for network communications via a wireless medium Other suitable equipment, etc.
  • STA can be a router, switch, bridge, etc.
  • sites or STA are collectively called sites or STA.
  • the APs and STAs involved in the embodiments of this application may be APs and STAs applicable to the IEEE 802.11 system standard.
  • AP is a device deployed in a wireless communication network to provide wireless communication functions for its associated STAs.
  • the AP can be used as the hub of the communication system. It is usually a network-side product that supports the MAC and PHY of the 802.11 system standard.
  • it can be a base station. , routers, gateways, repeaters, communication servers, switches or bridges and other communication equipment, wherein the base stations may include various forms of macro base stations, micro base stations, relay stations, etc.
  • the above-mentioned devices are collectively referred to as APs.
  • STA usually supports media access control (media access control) of 802.11 system standard. control, MAC) and physical layer (physical, PHY) terminal products, such as mobile phones, laptops, etc.
  • the embodiments of this application can also be applied to a scenario where one node performs data transmission with one or more nodes; it can also be applied to a single-user uplink/downlink data transmission scenario, or a multi-user uplink/downlink data transmission scenario; it can also be used Applied to device-to-device (D2D) data transmission scenarios.
  • Any of the above nodes can be AP or STA.
  • the wireless communication system can be a wireless local area network or a cellular network.
  • This solution can be implemented by a communication device in the wireless communication system or a chip or processor in the communication device.
  • the communication device can be a device that supports multiple channels.
  • Wireless communication devices whose links transmit in parallel are, for example, called multi-link devices or multi-band devices. Compared with devices that only support single-link transmission, multi-link devices have higher transmission efficiency and higher throughput.
  • Multi-link devices include one or more affiliated STAs (affiliated STAs). An affiliated STA is a logical site and can work on one link.
  • the affiliated station can be an access point (Access Point, AP) or a non-access point station (non-Access Point Station, non-AP STA).
  • the multi-link device whose site is AP can be called multi-link AP or multi-link AP device or AP multi-link device (AP multi-link device), and the site it belongs to is non-
  • the multi-link device of AP STA can be called multi-link STA or multi-link STA device or STA multi-link device. It should be understood that each station in the multi-link device can work on one link respectively, but multiple stations are allowed to work on the same link.
  • the AP and STA can have certain artificial intelligence (AI) capabilities.
  • AI artificial intelligence
  • they can use neural networks for reasoning and decision-making, and can also perform neural network training.
  • reinforcement learning can be, for example, deep reinforcement learning (DRL).
  • DRL deep reinforcement learning
  • Reinforcement learning is used to describe and solve the problem of an agent learning strategies to maximize returns or achieve specific goals during its interaction with the environment.
  • a common model of reinforcement learning is the Markov decision process (MDP).
  • MDP is a mathematical model for analyzing decision-making problems.
  • Reinforcement learning is where the agent learns in a "trial and error” manner, and the rewards obtained by interacting with the environment through actions (actions) guide behavior. The goal is to enable the agent to obtain the maximum reward.
  • an intelligent agent can be understood as an AI model, including a large number of parameters and calculation formulas (or calculation rules). Rewards can also be called return value, evaluation, etc.
  • Reinforcement learning can use the reinforcement signal (i.e. reward) provided by the environment to evaluate the quality of the action, rather than telling the reinforcement learning system how to produce the correct action. Since the external environment provides little information, the agent must rely on its own experience to learn. In this way, the agent acquires knowledge in an action-evaluation (i.e., reward) environment and improves its action plan to adapt to the environment.
  • reinforcement learning algorithms include deep Q-learning (DQN), proximal policy optimization (PPO), etc.
  • reinforcement learning mainly contains five elements: agent, environment, state, action and reward. Among them, the input of the agent is the state and the output is the action.
  • the training process of reinforcement learning is: through multiple interactions between the agent and the environment, the actions, states, and rewards of each interaction are obtained; these multiple groups (actions, states, rewards) are used as training data to train the agent once. Using the above process, the agent is trained for the next round until the convergence conditions are met.
  • the process of obtaining the actions, status, and rewards of an interaction is shown in Figure 2.
  • the current state S0 of the environment is input to the agent, and the action A0 output by the agent is obtained.
  • the relevant performance indicators of the environment under the action A0 Calculate the reward R0 of this interaction.
  • the action A0, state S0 and reward R0 of this interaction are obtained. Record the action A0, state S0 and reward R0 of this interaction for subsequent use in training the agent.
  • the next state S1 of the environment under the action A0 can also be recorded in order to realize the next interaction between the agent and the environment.
  • the action is the action of the station
  • the action of the station may include at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
  • the status may be one or more of the carrier sensing results, such as channel quality, packet loss rate, etc.
  • the status may be the load condition of the channel, etc.
  • the station's action is to perform power control for the station, the status may be one or more of the station's location, channel quality, throughput, etc.
  • the station's action is to perform rate adaptation for the station, the status may be one or more of the carrier sensing results, such as channel quality, packet loss rate, etc.
  • each device (such as AP, STA, etc.) in Figure 1 can be implemented by one device, can also be implemented by multiple devices, or can be a functional module in one device. This is not the case in the embodiment of this application. Specific limitations. It can be understood that the above functions can be either network elements in hardware devices, software functions running on dedicated hardware, or virtualization functions instantiated on a platform (eg, cloud platform).
  • a platform eg, cloud platform
  • FIG. 3 is a schematic diagram of the hardware structure of a communication device applicable to embodiments of the present application.
  • the communication device 300 includes at least one processor 301, communication line 302, and memory 303 and at least one communication interface 304.
  • the processor 301 may be a general central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), a neural network processor (neural-network processing unit, NPU), etc. one or more.
  • the processor 301 may also be one or more integrated circuits used to control the execution of the program of the present application.
  • Communication line 302 may include a path for communicating information between the above-mentioned components.
  • the communication interface 304 is any transceiver-like device (such as an antenna, etc.) used to communicate with other devices or communication networks, such as Ethernet, RAN, wireless local area networks (WLAN), etc.
  • transceiver-like device such as an antenna, etc.
  • WLAN wireless local area networks
  • the memory 303 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions.
  • a dynamic storage device can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other medium for access, but not limited to this.
  • the memory may exist independently and be connected to the processor through a communication line 302 . Memory can also be integrated with the processor.
  • the memory provided by the embodiment of the present application may generally be non-volatile.
  • the memory 303 is used to store computer execution instructions for executing the solution of the present application, and is controlled by the processor 301 for execution.
  • the processor 301 is used to execute computer execution instructions stored in the memory 303, thereby implementing the methods provided by the following embodiments of the application.
  • the computer-executed instructions in the embodiments of the present application may also be called application codes, which are not specifically limited in the embodiments of the present application.
  • the processor 301 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 3 .
  • the communication device 300 may include multiple processors, such as the processor 301 and the processor 307 in FIG. 3 . Each of these processors may be a single-CPU processor or a multi-CPU processor.
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • the communication device 300 may also include an output device 305 and an input device 306.
  • Output device 305 communicates with processor 301 and can display information in a variety of ways.
  • the output device 305 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector. wait.
  • Input device 306 communicates with processor 301 and can receive user input in a variety of ways.
  • the input device 306 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.
  • the processor 301 can read the software program in the memory 303, interpret and execute the instructions of the software program, and process the data of the software program.
  • the processor 301 performs baseband processing on the data to be sent, and then outputs the baseband signal to the radio frequency circuit.
  • the radio frequency circuit performs radio frequency processing on the baseband signal and then sends the radio frequency signal out in the form of electromagnetic waves through the antenna.
  • the radio frequency circuit receives the radio frequency signal through the antenna, converts the radio frequency signal into a baseband signal, and outputs the baseband signal to the processor 301.
  • the processor 301 converts the baseband signal into data and performs processing on the data. deal with.
  • the radio frequency circuit and the antenna can be arranged independently of the processor that performs baseband processing.
  • the radio frequency circuit and the antenna can be arranged independently of the communication device in a remote arrangement.
  • the neural network processor may include, for example, a training module and an inference module not shown in Figure 3.
  • the inputs of the training module may include, for example, actions, states, reward values, etc., and the outputs may be neural network parameters.
  • the trained neural network parameters can be fed back to the inference module.
  • the neural network processor can interact with various modules of the communication device 300, such as controlling the transmission of data on the communication interface to save energy; or interacting with the antenna to control the orientation of the antenna.
  • the communication device 300 may also include a media access control (media access control, MAC) not shown in FIG. 3 .
  • the neural network processor can also interact with the MAC to control channel access, channel selection, and spatial multiplexing decisions.
  • the above-mentioned communication device 300 may be a general-purpose device or a special-purpose device.
  • the communication device 300 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, an embedded device, or a device with a similar structure as shown in Figure 3 equipment.
  • PDA personal digital assistant
  • the embodiment of the present application does not limit the type of communication device 300.
  • Figure 4 is a schematic flow chart of a reinforcement learning training method provided by an embodiment of the present application. As shown in Figure 4, the method includes but is not limited to the following steps:
  • the access point determines a first reward value based on the actions of multiple stations, and the first reward value is the reward value of the first station among the multiple stations.
  • the first site can be any site among multiple sites. This means that for any one of the multiple sites, the access point determines the reward value of that site based on the actions of the multiple sites. For example, the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3; the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3. The return value of site 2; the access point determines the return value of site 3 based on the actions of site 1, the action of site 2, and the action of site 3.
  • the access point can learn the action of the first station in any of the following ways. It should be understood that whether the access point uses method 1.1 or method 1.2 to learn the action of the first station may depend on the implementation of the access point, a prior agreement, or a standard definition.
  • the access point receives the action of the first station sent by the first station. For example, the packet sent by the first station is received by the access point. Because the packet includes the action of the first station, the access point can learn the action of the first station. That is, when the packet sent by the first station is received by the access point, the access point learns the action of the first station through the packet. In another example, the message sent by the first station is not received by the access point. Because the message is lost, the access point cannot learn the action of the first station, so the first station can re-send the lost message to the access point. action. That is, when the packet sent by the first station is not received by the access point, the access point learns the action of the first station through the action of losing the packet.
  • Method 1.2 The access point determines the action of the first station by itself. For example, the first station does not send packets, so the access point does not receive the packets sent by the first station. At this time, the access point can determine the action of the first station by itself.
  • the packet sent by the first station is received by the access point, and the packet includes one or more of the rate information and time length information of the packet.
  • the packet header of the packet includes one or more of the packet's rate information, time length information, etc. Therefore, the access point can determine the action of the first station based on one or more of the packet rate information, time length information, and the like. It should be understood that when the access point determines the action of the first station by itself, the action of the first station may be empty.
  • either method 1.1 or 1.2 enables the access point to learn the action of the site, thereby preparing for subsequent access points to determine the return value.
  • the actions of different sites in multiple sites can be exactly the same, partially the same, or completely different, which is not limited here.
  • the action of site #1 is to initiate channel access
  • the action of site #2 is to initiate channel access
  • the action of site #3 is to initiate channel access. So all three sites behave exactly the same.
  • the action of station #1 is to initiate channel access
  • the action of station #2 is to initiate channel access
  • the action of station #3 is to perform power control. So the action portion is the same for all three sites.
  • the action of station #1 is to initiate channel access, the action of station #2 is to perform rate adaptation, and the action of station #3 is to perform power control. So the actions of the three sites are completely different.
  • the first reward value can be used at the first site for reinforcement learning training.
  • step 401 may include: the access point determines the first report value based on the actions of multiple stations and the times corresponding to the actions of the multiple stations. It can be seen that by determining the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, the calculation of the reward value can be combined with the mutual influence between users, and can also be combined with the time corresponding to the actions of different sites, enriching the Determining the relevant information of the return value improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
  • the time corresponding to the action of the first station may be, for example, when the access point receives the packet. time of writing.
  • the time corresponding to the action of the first station may be, for example, the sending time of the lost packet.
  • the time corresponding to the action of the first station may be, for example, the time when the first station initiates channel access.
  • the actions of multiple sites correspond to the same time. It can be seen that because the actions of multiple sites correspond to the same time, the access point can improve the accuracy of the return value when determining the return value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, thus making The site can improve the actual application effect after using the return value for reinforcement learning training.
  • the first reward value is the reward value corresponding to the first time
  • the first time is the time corresponding to the action of the first site. It can be seen that because the return value is the return value corresponding to a certain time, the site can learn the actions and environmental status corresponding to that time, which in turn allows the site to improve the actual application effect after using the return value for reinforcement learning training.
  • d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station
  • N is the number of stations
  • d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations
  • a site is a site other than the first site among multiple sites.
  • the first return value is d 0 -(N-1)*d 1 .
  • the first report value is d 0 -(N-1)*d 1 .
  • the first report value is -N.
  • the first report value is -N.
  • the calculation of the return value can be combined with the interaction between users, which improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
  • the method may also include step 402.
  • the access point sends the first report value to the first station.
  • the first station receives the first report value sent by the access point.
  • step 402 can be implemented in any of the following ways. It should be understood that whether the access point uses method 2.1 or method 2.2 to send the first return value may depend on the implementation of the access point, prior agreement, or standard definition.
  • Method 2.1 The access point sends a broadcast frame or a multicast frame to the first station.
  • the first station receives the broadcast frame or multicast frame sent by the access point.
  • the broadcast frame or the multicast frame includes the first report value
  • the multicast frame may also include the address of the first station.
  • the broadcast frame may be, for example, a beacon frame or a trigger frame. It should be understood that for the situation of method 2.1, the access point can use the above method 1.1 to learn the action of the first station through the action of losing the packet, or the access point can use the above method 1.2 to learn the action of the first station. It can be seen that because the first report value is carried by the broadcast frame or the multicast frame, other stations can also receive the broadcast frame or the multicast frame.
  • Method 2.2 The access point sends a response frame of the first message to the first station.
  • the first station receives the response frame of the first message sent by the access point.
  • the response frame of the first message includes a first return value, the first return value corresponds to the second message, and the second message is received after the first message.
  • the correspondence between the first return value and the second message can be understood as: the first return value corresponds to the action of the first station in the second message.
  • the time corresponding to the action of the first station is the time when the access point receives the second message.
  • the response frame may be, for example, an acknowledgment (ACK) frame, a clear tosend (CTS) frame, or a block acknowledgment (block ACK, BA) frame, etc.
  • FIG. 5 is a schematic diagram of a delayed feedback reward value provided by an embodiment of the present application.
  • station 1 sends message 1 to the access point, and message 1 includes the action of station 1.
  • the access point sends a response frame of message 1 to station 1.
  • station 1 sends message 2 to the access point.
  • the access point sends a response frame of message 2 to station 1.
  • the response frame of message 2 includes the return value corresponding to message 1.
  • the return value corresponding to message 1 is determined based on the actions of multiple stations.
  • the station 1 is one of multiple sites. That is, because it takes time for the access point to calculate the report value corresponding to message 1, the access point can carry the report value corresponding to message 1 in the response frame of message 2.
  • the return value corresponding to the second message can be carried in the response frame of the first message. Because the second message is received after the first message, delayed sending of the second message is achieved. The corresponding return value, which provides more time for the calculation of the return value.
  • the multiple sites may also include a second site.
  • the method also includes: if the first site and the second site send messages at the same time and cause transmission failure, the access point determines the report of the second site. value, the report value of the second station is the same as the first report value; the access point sends a broadcast frame to the second station.
  • the first station and the second station send packets at the same time and cause transmission failure, which can be understood as: the first station and the second station send packets at the same time, causing the first station to fail to transmit, and the second station also fails to transmit. It can be seen that when the return values of different sites are the same, by sending broadcast frames, different sites can obtain the return values, saving overhead.
  • the actions of station 1 and station 2 are channel access, that is, station 1 and station 2 send packets to the access point at the same time. Due to the conflict, the packet transmission fails.
  • the AP For the behavior of site 2 at time t, the return value can be set to a large negative value, such as -100.
  • the return value of site 1 is the return value of site 2.
  • the most cost-effective way is to broadcast, which allows both site 1 and site 2 to obtain the reward value.
  • the response frame of the first message may also include any of the following. It should be understood that whether the access point carries the first type or the first type in the response frame of the first message may depend on the implementation of the access point, pre-agreement or standard definition.
  • the identification information of the second message may be, for example, the index value of the second message.
  • the identification information of the second message may be, for example, the difference between the index value of the first message and the index value of the second message.
  • the index value of the first message is 10
  • the index value of the second message is 4, and the identification information of the second message can be 4 or 6.
  • the timestamp of the second and second packets may be, for example, the reception time of the second message.
  • the timestamp of the second message may be, for example, the difference between the reception time of the first message and the reception time of the second message.
  • the reception time of the message can be understood as the time when the access point receives the message.
  • the first station can learn which message the first return value corresponds to.
  • the specific information carried by the response frame of the first message can be referred to Table 1 or Table 2, for example.
  • the first The response frame of the message includes the first report value and the identification information of the second message.
  • the response frame of the first message includes the first report value and the timestamp of the second message.
  • the method may also include step 403.
  • the first site performs reinforcement learning training based on the first return value.
  • step 403 may include: the first station obtains the status and action corresponding to the first reward value; and the first station performs reinforcement learning training based on the first reward value, status and action.
  • the first station performing reinforcement learning training based on the first reward value, status and action can be understood as: the first station performs reinforcement learning training on the agent based on the first reward value, status and action.
  • different sites among multiple sites can use different reinforcement learning algorithms for reinforcement learning training.
  • site 1 uses DQN for reinforcement learning training
  • site 2 uses PPO for reinforcement learning training, etc.
  • Figure 6 is a beneficial effect diagram provided by the embodiment of the present application.
  • different sites among multiple sites use different deep learning algorithms.
  • the number of sites using DQN is 1, and the number of sites using PPO is also 1.
  • spectrum can still be shared fairly and efficiently.
  • the abscissa is the access delay (Delay) in seconds
  • the ordinate is the cumulative probability distribution (Probability). It can be seen that these stations use Distribution of access delays corresponding to sites with different algorithms.
  • the abscissa is time in seconds, and the ordinate is throughput.
  • the total throughput of these sites tends to be stable between 0.8 and 1, and the throughput of each site is at 0.4. It tends to be stable between 0.6 and 0.6, which means that the throughput of each site is not very different, so these sites can share the spectrum fairly and efficiently.
  • the above-mentioned implementation devices include corresponding hardware structures and/or software modules for executing each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
  • Embodiments of the present application can divide access points, sites, etc. into functional modules according to the above method examples.
  • each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiment of the present application is schematic and is only a logical function division. In actual implementation, there may be other division methods.
  • FIG. 7 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • the communication device 700 can be applied to the method shown in FIG. 4 .
  • the communication device 700 includes: a processing module 701 and a transceiver module 702 .
  • the processing module 701 may be one or more processors, and the transceiver module 702 may be a transceiver or a communication interface.
  • the communication device can be used to implement the site or access point involved in any of the above method embodiments, or to implement the functions of the network element involved in any of the above method embodiments.
  • the network element or network function can be a network element in a hardware device, a software function running on dedicated hardware, or a virtualized function instantiated on a platform (eg, cloud platform).
  • the communication device 700 may also include a storage module 703 for storing program codes and data of the communication device 700 .
  • the communication device serves as an access point or is a chip applied in the access point, and performs the steps performed by the access point in the above method embodiment.
  • the transceiver module 702 is used to support communication with sites, etc.
  • the transceiver module specifically performs the sending and/or receiving actions performed by the access point in Figure 4, such as supporting the access point to perform step 402, and/or as described herein.
  • the processing module 701 may be used to support the communication device 700 to perform processing actions in the above method embodiments, for example, to support the access point to perform one or more steps in step 401, etc., and/or other processes of the technology described herein.
  • the processing module 701 is used to determine a first reward value based on the actions of multiple sites.
  • the first reward value is the reward value of the first site among the multiple sites.
  • the first reward value is used for strengthening the first site. Learning and training; sending and receiving module 702, used to send the first reward value to the first site.
  • the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
  • the processing module 701 when determining the first reward value based on the actions of multiple sites, is configured to determine the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites.
  • the actions of multiple sites correspond to the same time.
  • the first reward value is the reward value corresponding to the first time
  • the first time is the time corresponding to the action of the first site.
  • the transceiving module 702 when sending the first report value to the first station, is configured to send a broadcast frame to the first station, where the broadcast frame includes the first report value.
  • the multiple sites also include a second site.
  • the processing module 701 is also used to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure.
  • the report value is the same as the first report value; the transceiver module 702 is also used to send a broadcast frame to the second station.
  • the transceiver module 702 when sending the first report value to the first station, is configured to send a response frame of the first message to the first station; wherein the response frame of the first message includes the first report value, The first return value corresponds to the second message, and the second message is received after the first message.
  • the response frame of the first message also includes identification information of the second message or a timestamp of the second message.
  • the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.
  • d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station
  • N is the number of stations
  • d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations
  • a site is a site other than the first site among multiple sites.
  • the transceiver module 702 may be a communication interface, pin or circuit, etc.
  • the communication interface can be used to input data to be processed to the processor, and can output the processing results of the processor to the outside.
  • the communication interface can be a general purpose input output (GPIO) interface, which can communicate with multiple peripheral devices (such as display (LCD), camera (camara), radio frequency (RF) module, antenna, etc. etc.) connection.
  • GPIO general purpose input output
  • the communication interface is connected to the processor through a bus.
  • the processing module 701 may be a processor, and the processor may execute computer execution instructions stored in the storage module, so that the chip executes the method involved in the embodiment of FIG. 4 .
  • the processor may include a controller, arithmetic unit, and a register.
  • the controller is mainly responsible for decoding instructions and sending control signals for operations corresponding to the instructions.
  • the arithmetic unit is mainly responsible for performing fixed-point or floating-point arithmetic operations, shift operations, and logical operations. It can also perform address operations and conversions.
  • Registers are mainly responsible for storing register operands and intermediate operation results temporarily stored during instruction execution.
  • the hardware architecture of the processor can be application specific integrated circuits (ASIC) architecture, microprocessor without interlocked piped stages architecture (MIPS) architecture, advanced reduced instructions Set machine (advanced RISC machines, ARM) architecture or network processor (network processor, NP) architecture, etc.
  • the processor can be single-core or multi-core.
  • the storage module can be a storage module within the chip, such as a register, cache, etc.
  • the storage module can also be a storage module located outside the chip, such as Read Only Memory (ROM) or other types of static storage devices that can store static information and instructions, Random Access Memory (Random Access Memory, RAM), etc. .
  • ROM Read Only Memory
  • RAM Random Access Memory
  • processors and the interface can be realized through hardware design, software design, or a combination of software and hardware. There are no restrictions here.
  • Embodiments of the present application also provide a communication device, including a processor, a memory, an input interface and an output interface.
  • the input interface is used to receive information from other communication devices other than the communication device, and the output interface is used to send information to other communication devices other than the communication device.
  • Other communication devices output information, and the processor calls the computer program stored in the memory to implement the embodiment shown in Figure 4.
  • An embodiment of the present application also provides a chip.
  • the chip includes at least one processor and an interface.
  • the processor is configured to read and execute instructions stored in the memory. When the instructions are executed, the chip executes the embodiment shown in Figure 4 .
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the embodiment shown in Figure 4.
  • An embodiment of the present application also provides a computer program product.
  • a computer reads and executes the computer program product, the computer executes the embodiment shown in Figure 4 .
  • each network element unit in various embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or software network element unit.
  • the above integrated unit is implemented in the form of a software network element unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium and includes a number of instructions. So that a computer device (which can be a personal computer, a terminal device, a cloud server, or a network device, etc.) executes all or part of the steps of the above methods in various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code.
  • U disk mobile hard disk
  • read-only memory ROM, Read-Only Memory
  • RAM random access memory
  • magnetic disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The present application provides a reinforcement learning training method and a related device. The method comprises: determining a first reward value according to the actions of a plurality of stations, wherein the first reward value is a reward value of a first station in the plurality of stations, and the first reward value is used for the first station to perform reinforcement learning training; and sending the first reward value to the first station. It can be seen that a reward value is determined according to the actions of a plurality of stations, such that the calculation of the reward value can be performed by considering the mutual influence between users, thereby improving the accuracy of the reward value. Thus, the actual application effect can be improved after the station uses the reward value to perform reinforcement learning training. The present application can be applied to WLAN systems such as EHT, Wi-Fi 7, or Wi-Fi 8.

Description

一种强化学习的训练方法及相关装置A training method and related devices for reinforcement learning
本申请要求在2022年8月12日提交中国专利局、申请号为202210968171.8、申请名称为“一种强化学习的训练方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on August 12, 2022, with application number 202210968171.8 and the application title "A training method for reinforcement learning and related devices", the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本申请涉及计算机技术、通信技术领域,尤其涉及一种强化学习的训练方法及相关装置。The present application relates to the fields of computer technology and communication technology, and in particular, to a reinforcement learning training method and related devices.
背景技术Background technique
强化学习是用于实现序列决策的通用方法,智能体(agent)以“试错”的方式进行学习,通过动作(action)与环境进行交互获得的回报(reward)值指导行为,目标是使智能体获得最大的回报值。目前,往往需要利用动作、环境状态和回报值进行强化学习训练。但是,在现有方案中,得到的回报值精准性低,进而导致在利用动作、环境状态和回报值进行强化学习训练后在实际应用时效果不好。Reinforcement learning is a general method used to implement sequence decision-making. The agent learns in a "trial and error" manner, and the reward value obtained by interacting with the environment through actions guides behavior. The goal is to make intelligence The body obtains the maximum return value. At present, it is often necessary to use actions, environmental states and reward values for reinforcement learning training. However, in the existing scheme, the accuracy of the reward value obtained is low, which leads to poor practical application results after using actions, environmental states and reward values for reinforcement learning training.
发明内容Contents of the invention
本申请提供了一种强化学习的训练方法及相关装置,可以提高回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。This application provides a reinforcement learning training method and related devices, which can improve the accuracy of the return value, thereby enabling the site to improve the actual application effect after using the return value for reinforcement learning training.
第一方面,提供一种强化学习的训练方法,该方法包括:根据多个站点的动作,确定第一回报值,第一回报值为多个站点中第一站点的回报值,第一回报值用于第一站点进行强化学习训练;向第一站点发送第一回报值。可以看出,通过根据多个站点的动作确定回报值,使得回报值的计算可以结合用户间的相互影响,提高了回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。In a first aspect, a reinforcement learning training method is provided. The method includes: determining a first reward value based on the actions of multiple sites. The first reward value is the reward value of the first site among the multiple sites. The first reward value Used for reinforcement learning training at the first site; sending the first return value to the first site. It can be seen that by determining the reward value based on the actions of multiple sites, the calculation of the reward value can be combined with the interaction between users, improving the accuracy of the reward value, and thus allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.
可选的,一个站点的动作包括以下至少一项:站点发起信道接入、站点进行信道选择、站点进行功率控制、站点进行速率自适应。Optionally, the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
应理解的,第一站点可以为多个站点中的任意一个站点。这意味着,针对多个站点中的任意一个站点,接入点均是根据多个站点的动作确定该站点的回报值。示例性的,接入点根据站点1的动作、站点2的动作和站点3的动作,确定站点1的回报值;接入点根据站点1的动作、站点2的动作和站点3的动作,确定站点2的回报值;接入点根据站点1的动作、站点2的动作和站点3的动作,确定站点3的回报值。It should be understood that the first site can be any site among multiple sites. This means that for any one of the multiple sites, the access point determines the reward value of that site based on the actions of the multiple sites. For example, the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3; the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3. The return value of site 2; the access point determines the return value of site 3 based on the actions of site 1, the action of site 2, and the action of site 3.
可选的,多个站点中不同站点的动作可以完全相同、部分相同或完全不同,在此不做限定。示例性的,站点#1的动作为发起信道接入,站点#2的动作为发起信道接入,站点#3的动作为发起信道接入。因此三个站点的动作完全相同。又示例性的,站点#1的动作为发起信道接入,站点#2的动作为发起信道接入,站点#3的动作为进行功率控制。因此三个站点的动作部分相同。又示例性的,站点#1的动作为发起信道接入,站点#2的动作为进行速率自适应,站点#3的动作为进行功率控制。因此三个站点的动作完全不同。Optionally, the actions of different sites in multiple sites can be exactly the same, partially the same, or completely different, which is not limited here. For example, the action of site #1 is to initiate channel access, the action of site #2 is to initiate channel access, and the action of site #3 is to initiate channel access. So all three sites behave exactly the same. In another example, the action of station #1 is to initiate channel access, the action of station #2 is to initiate channel access, and the action of station #3 is to perform power control. So the action portion is the same for all three sites. In another example, the action of station #1 is to initiate channel access, the action of station #2 is to perform rate adaptation, and the action of station #3 is to perform power control. So the actions of the three sites are completely different.
强化学习(reinforcement learning,RL)用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习的常见模型是马尔可夫决策过程(markov decision process,MDP)。MDP是一种分析决策问题的数学模型。强化学习是智能体以“试错”的方式进行学习,通过动作(action)与环境进行交互获得的奖励(reward)指导行为,目标是使智能体获得最大的奖励。应理解的,在本申请中,智能体可以理解为一种AI模型,包括大量的参数和计算公式(或计算规则)。奖励又可以称为回报值、评价等。Reinforcement learning (RL) is used to describe and solve the problem of an agent learning strategies to maximize returns or achieve specific goals during its interaction with the environment. A common model of reinforcement learning is the Markov decision process (MDP). MDP is a mathematical model for analyzing decision-making problems. Reinforcement learning is where the agent learns in a "trial and error" manner, and the rewards obtained by interacting with the environment through actions (actions) guide behavior. The goal is to enable the agent to obtain the maximum reward. It should be understood that in this application, an intelligent agent can be understood as an AI model, including a large number of parameters and calculation formulas (or calculation rules). Rewards can also be called return value, evaluation, etc.
强化学习可以由环境提供的强化信号(即奖励)对动作的好坏作一种评价,而不是告诉强化学习系统如何去产生正确的动作。由于外部环境提供的信息很少,智能体必须靠自身的经历进行学习。通过这种方式,智能体在行动-评价(即奖励)的环境中获得知识,改进行动方案以适应环境。常见的强化学习算法有深度Q学习(deep Q-learning,DQN)、近端策略优化(proximal policy optimization,PPO)等。Reinforcement learning can use the reinforcement signal (i.e. reward) provided by the environment to evaluate the quality of the action, rather than telling the reinforcement learning system how to produce the correct action. Since the external environment provides little information, the agent must rely on its own experience to learn. In this way, the agent acquires knowledge in an action-evaluation (i.e., reward) environment and improves its action plan to adapt to the environment. Common reinforcement learning algorithms include deep Q-learning (DQN), proximal policy optimization (PPO), etc.
可选的,结合第一方面,根据多个站点的动作,确定第一回报值,包括:根据多个站点的动作和多个站点的动作对应的时间,确定第一回报值。可以看出,通过根据多个站点的动作和多个站点的动作对应的时间确定回报值,使得回报值的计算可以结合用户间的相互影响,还可以结合不同站点的动作对应的时间,丰富了确定回报值的相关信息,提高了回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Optionally, combined with the first aspect, determining the first reward value based on the actions of multiple sites includes: determining the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites. It can be seen that by determining the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, the calculation of the reward value can be combined with the mutual influence between users, and can also be combined with the time corresponding to the actions of different sites, enriching the Determining the relevant information of the return value improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
可选的,结合第一方面,多个站点的动作对应的时间相同。可以看出,因为多个站点的动作对应的时 间相同,所以接入点在根据多个站点的动作和多个站点的动作对应的时间确定回报值时,可以提高回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Optionally, combined with the first aspect, the actions of multiple sites correspond to the same time. It can be seen that because the actions of multiple sites correspond to are the same, so when the access point determines the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, it can improve the accuracy of the reward value, thereby allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.
可选的,结合第一方面,第一回报值为第一时间对应的回报值,第一时间为第一站点的动作对应的时间。可以看出,因为回报值为某个时间对应的回报值,所以使得站点可以获知该时间对应的动作和环境状态,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Optionally, combined with the first aspect, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site. It can be seen that because the return value is the return value corresponding to a certain time, the site can learn the actions and environmental status corresponding to that time, which in turn allows the site to improve the actual application effect after using the return value for reinforcement learning training.
可选的,结合第一方面,向所述第一站点发送所述第一回报值,包括:向第一站点发送广播帧,广播帧包括第一回报值。可以看出,因为第一回报值由广播帧携带,所以可以使得其他站点也收到广播帧。Optionally, combined with the first aspect, sending the first report value to the first station includes: sending a broadcast frame to the first station, where the broadcast frame includes the first report value. It can be seen that because the first report value is carried by the broadcast frame, other stations can also receive the broadcast frame.
其中,广播帧例如可以为信标帧或触发(trigger)帧等。The broadcast frame may be, for example, a beacon frame or a trigger frame.
可选的,结合第一方面,多个站点还包括第二站点,该方法还包括:若第一站点和第二站点同时发送报文并导致传输失败,则确定第二站点的回报值,第二站点的回报值与第一回报值相同;向第二站点发送广播帧。可以看出,在不同站点的回报值相同的情况下,通过发送广播帧,使得不同站点都可以获取到回报值,节省了开销。Optionally, combined with the first aspect, the multiple sites also include a second site, and the method further includes: if the first site and the second site send messages at the same time and cause transmission failure, determining the return value of the second site, and The return value of the second station is the same as the first return value; a broadcast frame is sent to the second station. It can be seen that when the return values of different sites are the same, by sending broadcast frames, different sites can obtain the return values, saving overhead.
可选的,结合第一方面,向第一站点发送第一回报值,包括:向第一站点发送第一报文的响应帧;其中,第一报文的响应帧包括第一回报值,第一回报值与第二报文对应,第二报文在第一报文之后接收。可以看出,第二报文对应的回报值可以在第一报文的响应帧中携带,因为第二报文在第一报文之后接收,所以实现了延迟发送第二报文对应的回报值,这为回报值的计算提供了更多的时间。Optionally, combined with the first aspect, sending the first report value to the first station includes: sending a response frame of the first message to the first station; wherein the response frame of the first message includes the first report value, and the first report value is sent to the first station. A return value corresponds to the second message, and the second message is received after the first message. It can be seen that the return value corresponding to the second message can be carried in the response frame of the first message. Because the second message is received after the first message, delayed sending of the return value corresponding to the second message is achieved. , which provides more time for the calculation of the return value.
其中,第一回报值与第二报文对应可以理解为:第一回报值与第二报文中第一站点的动作对应。第一站点的动作对应的时间为接入点接收第二报文的时间。Wherein, the correspondence between the first return value and the second message can be understood as: the first return value corresponds to the action of the first station in the second message. The time corresponding to the action of the first station is the time when the access point receives the second message.
在本申请中,响应帧例如可以为确认(acknowledgment,ACK)帧、清除发送(clear tosend,CTS)帧或块确认(block ACK,BA)等。In this application, the response frame may be, for example, an acknowledgment (ACK) frame, a clear tosend (CTS) frame, or a block acknowledgment (block ACK, BA) frame, etc.
可选的,结合第一方面,第一报文的响应帧还包括第二报文的标识信息或第二报文的时间戳。可以看出,由于第一报文的响应帧还包括第二报文的标识信息或第二报文的时间戳,使得第一站点可以获知第一回报值具体是哪个报文对应的回报值。Optionally, combined with the first aspect, the response frame of the first message further includes identification information of the second message or a timestamp of the second message. It can be seen that since the response frame of the first message also includes the identification information of the second message or the timestamp of the second message, the first station can learn which message the first return value corresponds to.
在一可能的实施方式中,第二报文的标识信息例如可以为第二报文的索引值。在另一可能的实施方式中,第二报文的标识信息例如可以为第一报文的索引值与第二报文的索引值之间的差值。如,第一报文的索引值为10,第二报文的索引值为4,第二报文的标识信息可以为4或6。In a possible implementation, the identification information of the second message may be, for example, the index value of the second message. In another possible implementation, the identification information of the second message may be, for example, the difference between the index value of the first message and the index value of the second message. For example, the index value of the first message is 10, the index value of the second message is 4, and the identification information of the second message can be 4 or 6.
可选的,结合第一方面,第二报文的时间戳为第二报文的接收时间;或,第二报文的时间戳为第一报文的接收时间与第二报文的接收时间的差值。Optionally, combined with the first aspect, the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the reception time of the first message and the reception time of the second message. difference.
可选的,结合第一方面,第一回报值 Optional, combined with the first aspect, the first return value
其中,d0为第一站点距离最近一次收到第一站点的确认帧的时间间隔,N为站点的数量,d1为第一站点距离最近一次监听到其他站点的确认帧的时间间隔,其他站点为多个站点中除第一站点的站点。可以看出,回报值的计算可以结合用户间的相互影响,提高了回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Among them, d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station, N is the number of stations, d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations, and others A site is a site other than the first site among multiple sites. It can be seen that the calculation of the return value can be combined with the interaction between users, which improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
应理解的,在本申请中,当第一站点传输报文成功,且其他站点传输报文成功时,第一回报值为d0-(N-1)*d1。当第一站点传输报文成功,且其他站点传输报文失败时,第一回报值为d0-(N-1)*d1。当第一站点传输报文失败,且其他站点传输报文失败时,第一回报值为-N。当第一站点传输报文失败,且其他站点传输报文成功时,第一回报值为-N。It should be understood that in this application, when the first station successfully transmits the message and other stations successfully transmit the message, the first return value is d 0 -(N-1)*d 1 . When the first station successfully transmits the message and other stations fail to transmit the message, the first report value is d 0 -(N-1)*d 1 . When the first station fails to transmit the packet and other stations fail to transmit the packet, the first report value is -N. When the first station fails to transmit the packet and the other station successfully transmits the packet, the first report value is -N.
第二方面,提供一种通信装置,该装置包括处理模块和收发模块,处理模块,用于根据多个站点的动作,确定第一回报值,第一回报值为多个站点中第一站点的回报值,第一回报值用于第一站点进行强化学习训练;收发模块,用于向第一站点发送第一回报值。In a second aspect, a communication device is provided. The device includes a processing module and a transceiver module. The processing module is configured to determine a first reporting value based on actions of multiple sites. The first reporting value is the value of the first site among the multiple sites. Return value, the first return value is used for reinforcement learning training at the first site; the transceiver module is used for sending the first return value to the first site.
可选的,结合第二方面,一个站点的动作包括以下至少一项:站点发起信道接入、站点进行信道选择、站点进行功率控制、站点进行速率自适应。Optionally, combined with the second aspect, an action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
可选的,结合第二方面,在根据多个站点的动作,确定第一回报值时,处理模块,用于根据多个站点的动作和多个站点的动作对应的时间,确定第一回报值。 Optionally, combined with the second aspect, when determining the first reward value based on the actions of multiple sites, the processing module is configured to determine the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites. .
可选的,结合第二方面,多个站点的动作对应的时间相同。Optionally, combined with the second aspect, the actions of multiple sites correspond to the same time.
可选的,结合第二方面,第一回报值为第一时间对应的回报值,第一时间为第一站点的动作对应的时间。Optionally, combined with the second aspect, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site.
可选的,结合第二方面,在向第一站点发送第一回报值时,收发模块,用于向第一站点发送广播帧,广播帧包括第一回报值。Optionally, combined with the second aspect, when sending the first report value to the first station, the transceiver module is configured to send a broadcast frame to the first station, where the broadcast frame includes the first report value.
可选的,结合第二方面,多个站点还包括第二站点,处理模块,还用于若第一站点和第二站点同时发送报文并导致传输失败,则确定第二站点的回报值,第二站点的回报值与第一回报值相同;收发模块,还用于向第二站点发送广播帧。Optionally, combined with the second aspect, the multiple sites also include a second site and a processing module, which is also used to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure. The report value of the second station is the same as the first report value; the transceiver module is also used to send a broadcast frame to the second station.
可选的,结合第二方面,在向第一站点发送第一回报值时,收发模块,用于向第一站点发送第一报文的响应帧;其中,第一报文的响应帧包括第一回报值,第一回报值与第二报文对应,第二报文在第一报文之后接收。Optionally, combined with the second aspect, when sending the first report value to the first site, the transceiver module is configured to send a response frame of the first message to the first site; wherein the response frame of the first message includes the A return value, the first return value corresponds to the second message, and the second message is received after the first message.
可选的,结合第二方面,第一报文的响应帧还包括第二报文的标识信息或第二报文的时间戳。Optionally, combined with the second aspect, the response frame of the first message also includes identification information of the second message or a timestamp of the second message.
可选的,结合第二方面,第二报文的时间戳为第二报文的接收时间;或,第二报文的时间戳为第一报文的接收时间与第二报文的接收时间的差值。Optionally, combined with the second aspect, the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the reception time of the first message and the reception time of the second message. difference.
可选的,结合第二方面,第一回报值 Optional, combined with the second aspect, the first return value
其中,d0为第一站点距离最近一次收到第一站点的确认帧的时间间隔,N为站点的数量,d1为第一站点距离最近一次监听到其他站点的确认帧的时间间隔,其他站点为多个站点中除第一站点的站点。Among them, d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station, N is the number of stations, d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations, and others A site is a site other than the first site among multiple sites.
第三方面,提供一种芯片,芯片包括至少一个处理器和接口,处理器用于读取并执行存储器中存储的指令,当指令被运行时,使得芯片执行如第一方面任一项所述的方法。In a third aspect, a chip is provided. The chip includes at least one processor and an interface. The processor is used to read and execute instructions stored in the memory. When the instructions are executed, the chip executes the method described in any one of the first aspects. method.
第四方面,提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序包括程序指令,程序指令当被计算机执行时,使计算机执行如第一方面任一项所述的方法。In a fourth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the method described in any one of the first aspects. method.
第五方面,提供一种通信装置,包括处理器、存储器、输入接口和输出接口,输入接口用于接收来自通信装置之外的其它通信装置的信息,输出接口用于向通信装置之外的其它通信装置输出信息,处理器调用存储器中存储的计算机程序实现如第一方面中任一项所述的方法。In a fifth aspect, a communication device is provided, including a processor, a memory, an input interface and an output interface. The input interface is used to receive information from other communication devices other than the communication device, and the output interface is used to send information to other communication devices other than the communication device. The communication device outputs information, and the processor calls the computer program stored in the memory to implement the method as described in any one of the first aspects.
在一种可能的设计中,该通信装置可以是实现第一方面中方法的芯片或者包含芯片的设备。In a possible design, the communication device may be a chip that implements the method in the first aspect or a device containing the chip.
第六方面,提供一种计算机程序产品,当计算机读取并执行计算机程序产品时,使得计算机执行实现如第一方面中任一项所述的方法。A sixth aspect provides a computer program product, which when a computer reads and executes the computer program product, causes the computer to execute the method described in any one of the first aspects.
第七方面,提供一种通信系统,包括用于实现第一方面中任一项所述方法的接入点,和/或,站点。A seventh aspect provides a communication system, including an access point and/or a station for implementing any one of the methods in the first aspect.
附图说明Description of drawings
下面将对实施例描述中所需要使用的附图作简单地介绍。The following will briefly introduce the drawings needed to describe the embodiments.
其中:in:
图1为本申请实施例提供的一种WLAN的网络架构图;Figure 1 is a network architecture diagram of a WLAN provided by an embodiment of the present application;
图2为本申请实施例提供的一种强化学习原理图;Figure 2 is a schematic diagram of reinforcement learning provided by an embodiment of the present application;
图3所示为可适用于本申请实施例的一种通信装置的硬件结构示意图;Figure 3 shows a schematic diagram of the hardware structure of a communication device applicable to embodiments of the present application;
图4为本申请实施例提供的一种强化学习的训练方法的流程示意图;Figure 4 is a schematic flow chart of a reinforcement learning training method provided by an embodiment of the present application;
图5为本申请实施例提供的一种延迟反馈回报值的示意图;Figure 5 is a schematic diagram of a delayed feedback reward value provided by an embodiment of the present application;
图6为本申请实施例提供的有益效果图;Figure 6 is a beneficial effect diagram provided by the embodiment of the present application;
图7为本申请实施例提供的一种通信装置的结构示意图。FIG. 7 is a schematic structural diagram of a communication device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,本申请实施例中的术语“系统”和“网络”可被互换使用。除非另有说明,“/”表示前后关联的对象是一种“或”的关系,例如, A/B可以表示A或B;本申请中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。并且,在本申请的描述中,除非另有说明,“多个”是指两个或多于两个。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是一个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对网元和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Among them, the terms "system" and "network" in the embodiments of this application can be used interchangeably. Unless otherwise specified, "/" indicates that the related objects are in an "or" relationship, for example, A/B can mean A or B; "and/or" in this application is just an association relationship describing related objects, indicating that there can be three relationships, for example, A and/or B can mean: A alone exists , there are three situations: A and B exist at the same time, and B exists alone, where A and B can be singular or plural. Furthermore, in the description of this application, unless otherwise specified, "plurality" means two or more than two. "At least one of the following" or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, c can be one or more . In addition, in order to facilitate a clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as “first” and “second” are used to distinguish network elements from identical or similar items that have basically the same function. . Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not limit the number and execution order.
在本申请实施例中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference in describing an embodiment of the application to "one embodiment" or "some embodiments" or the like means that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Therefore, the phrases "in one embodiment", "in some embodiments", "in other embodiments", "in other embodiments", etc. appearing in different places in this specification are not necessarily References are made to the same embodiment, but rather to "one or more but not all embodiments" unless specifically stated otherwise. The terms “including,” “includes,” “having,” and variations thereof all mean “including but not limited to,” unless otherwise specifically emphasized.
以下的具体实施方式,对本申请的目标、技术方案和有益效果进行了进一步详细说明,所应理解的是,以下仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。The following specific implementations further describe the objectives, technical solutions and beneficial effects of the present application in detail. It should be understood that the following are only specific implementations of the present application and are not intended to limit the scope of protection of the present application. Any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solutions of this application shall be included in the protection scope of this application.
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。In the various embodiments of this application, if there is no special explanation or logical conflict, the terms and/or descriptions between different embodiments are consistent and can be referenced to each other. The technical features in different embodiments are based on their inherent Logical relationships can be combined to form new embodiments.
应理解的,本申请实施例可以适用于无线局域网(wireless local area network,WLAN)的场景,可以适用于IEEE 802.11系统标准,例如802.11a/b/g、802.11n、802.11ac、802.11ax,或其下一代,例如802.11be或更下一代的标准中。或者本申请实施例也可以适用于物联网(internet of things,IoT)、车联网(Vehicle to X,V2X)、窄带物联网(narrow band internet of things,NB-IoT)系统、其他短距通信系统(如蓝牙(bluetooth)、超宽带(ultra wide band,UWB))等等。当然,本申请实施例还可以适用于其他可能的通信系统,例如,长期演进(long term evolution,LTE)系统、LTE频分双工(frequency division duplex,FDD)系统、LTE时分双工(time division duplex,TDD)、通用移动通信系统(universal mobile telecommunication system,UMTS)、全球互联微波接入(worldwide interoperability for microwave access,WiMAX)通信系统、以及未来的6G通信系统等。It should be understood that the embodiments of the present application can be applied to wireless local area network (WLAN) scenarios and can be applied to IEEE 802.11 system standards, such as 802.11a/b/g, 802.11n, 802.11ac, 802.11ax, or Its next generation, such as 802.11be or next generation standards. Or the embodiments of this application can also be applied to the Internet of Things (IoT), Vehicle to X (V2X), narrowband Internet of things (NB-IoT) systems, and other short-distance communication systems (Such as bluetooth, ultra wide band (UWB)), etc. Of course, the embodiments of the present application can also be applied to other possible communication systems, such as long term evolution (long term evolution, LTE) system, LTE frequency division duplex (FDD) system, LTE time division duplex (time division) system duplex (TDD), universal mobile telecommunication system (UMTS), global interoperability for microwave access (WiMAX) communication system, and future 6G communication system, etc.
下文以本申请实施例可以适用于WLAN的场景为例。应理解,WLAN从802.11a/g标准开始,历经802.11n、802.11ac、802.11ax和如今正在讨论的802.11be和Wi-Fi 8。其中802.11n也可称为高吞吐率(high throughput,HT);802.11ac也可称为非常高吞吐率(very high throughput,VHT);802.11ax也可称为高效(high efficient,HE)或者Wi-Fi 6;802.11be也可称为极高吞吐率(extremely high throughput,EHT)或者Wi-Fi 7,而对于HT之前的标准,如802.11a/b/g等统称叫做非高吞吐率(Non-HT)。The following takes a scenario where the embodiments of the present application are applicable to WLAN as an example. It should be understood that WLAN starts with the 802.11a/g standard and goes through 802.11n, 802.11ac, 802.11ax and the 802.11be and Wi-Fi 8 that are being discussed today. Among them, 802.11n can also be called high throughput (HT); 802.11ac can also be called very high throughput (VHT); 802.11ax can also be called high efficient (HE) or Wi -Fi 6; 802.11be can also be called extremely high throughput (EHT) or Wi-Fi 7, while standards before HT, such as 802.11a/b/g, are collectively called non-high throughput (EHT). -HT).
参见图1,图1为本申请实施例提供的一种WLAN的网络架构图。图1以该WLAN包括1个无线接入点(access point,AP)和2个站点(station,STA)为例。与AP关联的STA,能够接收该AP发送的无线帧,也能够向该AP发送无线帧。另外,本申请实施例同样适用于AP与AP之间的通信,例如各个AP之间可通过分布式系统(distributed system,DS)相互通信,本申请实施例也适用于STA与STA之间的通信。应理解,图1中的AP和STA的数量仅是举例,还可以更多或者更少。Refer to Figure 1, which is a network architecture diagram of a WLAN provided by an embodiment of the present application. Figure 1 takes the WLAN as an example including 1 wireless access point (AP) and 2 stations (STAs). A STA associated with an AP can receive wireless frames sent by the AP and can also send wireless frames to the AP. In addition, the embodiments of the present application are also applicable to the communication between APs. For example, each AP can communicate with each other through a distributed system (DS). The embodiments of the present application are also applicable to the communication between STAs. . It should be understood that the number of APs and STAs in Figure 1 is only an example, and may be more or less.
本申请实施例涉及到的STA可以是各种具有无线通信功能的用户终端、用户装置,接入装置,订户站,订户单元,移动站,用户代理,用户装备或其他名称,其中,用户终端可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其它处理设备,以及各种形式的用户设备(user equipment,UE),移动台(mobile station,MS),终端(terminal),终端设备(terminal equipment),便携式通信设备,手持机,便携式计算设备,娱乐设备,游戏设备或系统,全球定位系统设备或被配置为经由无线介质进行网络通信的任何其他合适的设备等。例如STA可以是路由器、交换机和网桥等,在此,为了描述方便,上面提到的设备统称为站点或STA。The STA involved in the embodiments of this application can be various user terminals, user devices, access devices, subscriber stations, subscriber units, mobile stations, user agents, user equipment or other names with wireless communication functions, where the user terminal can Including various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to wireless modems with wireless communication functions, as well as various forms of user equipment (UE), mobile stations, MS), terminal, terminal equipment, portable communications device, handset, portable computing device, entertainment device, gaming device or system, global positioning system device or anything configured for network communications via a wireless medium Other suitable equipment, etc. For example, STA can be a router, switch, bridge, etc. Here, for the convenience of description, the above-mentioned devices are collectively called sites or STA.
本申请实施例所涉及到的AP和STA可以为适用于IEEE 802.11系统标准的AP和STA。AP是部署在无线通信网络中为其关联的STA提供无线通信功能的装置,该AP可用作该通信系统的中枢,通常为支持802.11系统标准的MAC和PHY的网络侧产品,例如可以为基站、路由器、网关、中继器,通信服务器,交换机或网桥等通信设备,其中,所述基站可以包括各种形式的宏基站,微基站,中继站等。在此,为了描述方便,上面提到的设备统称为AP。STA通常为支持802.11系统标准的介质访问控制(media access  control,MAC)和物理层(physical,PHY)的终端产品,例如手机、笔记本电脑等。The APs and STAs involved in the embodiments of this application may be APs and STAs applicable to the IEEE 802.11 system standard. AP is a device deployed in a wireless communication network to provide wireless communication functions for its associated STAs. The AP can be used as the hub of the communication system. It is usually a network-side product that supports the MAC and PHY of the 802.11 system standard. For example, it can be a base station. , routers, gateways, repeaters, communication servers, switches or bridges and other communication equipment, wherein the base stations may include various forms of macro base stations, micro base stations, relay stations, etc. Here, for convenience of description, the above-mentioned devices are collectively referred to as APs. STA usually supports media access control (media access control) of 802.11 system standard. control, MAC) and physical layer (physical, PHY) terminal products, such as mobile phones, laptops, etc.
本申请实施例还可以应用于一个节点与一个或多个节点进行数据传输的场景中;也可以应用于单用户的上行/下行数据传输场景,多用户的上行/下行数据传输场景中;还可以应用于设备到设备(device to device,D2D)的数据传输场景中。上述任一节点可以为AP或STA。The embodiments of this application can also be applied to a scenario where one node performs data transmission with one or more nodes; it can also be applied to a single-user uplink/downlink data transmission scenario, or a multi-user uplink/downlink data transmission scenario; it can also be used Applied to device-to-device (D2D) data transmission scenarios. Any of the above nodes can be AP or STA.
本方案可以应用于无线通信系统。该无线通信系统可以为无线局域网(Wireless local area network)或蜂窝网,本方案可以由无线通信系统中的通信设备或通信设备中的芯片或处理器实现,该通信设备可以是一种支持多条链路并行进行传输的无线通信设备,例如,称为多链路设备(multi-link device)或多频段设备(multi-band device)。相比于仅支持单条链路传输的设备来说,多链路设备具有更高的传输效率和更高的吞吐量。多链路设备包括一个或多个隶属的站点STA(affiliated STA),隶属的STA是一个逻辑上的站点,可以工作在一条链路上。其中,隶属的站点可以为接入点(Access Point,AP)或非接入点站点(non-Access Point Station,non-AP STA)。为描述方便,本申请将隶属的站点为AP的多链路设备可以称为多链路AP或多链路AP设备或AP多链路设备(AP multi-link device),隶属的站点为non-AP STA的多链路设备可以称为多链路STA或多链路STA设备或STA多链路设备(STA multi-link device)。应理解的,多链路设备中各个站点可以分别工作在一条链路上,但允许多个站点工作在同一条链路上。This solution can be applied to wireless communication systems. The wireless communication system can be a wireless local area network or a cellular network. This solution can be implemented by a communication device in the wireless communication system or a chip or processor in the communication device. The communication device can be a device that supports multiple channels. Wireless communication devices whose links transmit in parallel are, for example, called multi-link devices or multi-band devices. Compared with devices that only support single-link transmission, multi-link devices have higher transmission efficiency and higher throughput. Multi-link devices include one or more affiliated STAs (affiliated STAs). An affiliated STA is a logical site and can work on one link. Among them, the affiliated station can be an access point (Access Point, AP) or a non-access point station (non-Access Point Station, non-AP STA). For the convenience of description, in this application, the multi-link device whose site is AP can be called multi-link AP or multi-link AP device or AP multi-link device (AP multi-link device), and the site it belongs to is non- The multi-link device of AP STA can be called multi-link STA or multi-link STA device or STA multi-link device. It should be understood that each station in the multi-link device can work on one link respectively, but multiple stations are allowed to work on the same link.
需要说明的,在本申请中,AP和STA可以具有一定的人工智能(artificial intelligence,AI)能力,如可以使用神经网络进行推理决策,还可以进行神经网络的训练等。应理解的,在本申请中,主要涉及到强化学习的训练等,强化学习例如可以为深度强化学习(deep reinforcement learning,DRL)。It should be noted that in this application, the AP and STA can have certain artificial intelligence (AI) capabilities. For example, they can use neural networks for reasoning and decision-making, and can also perform neural network training. It should be understood that in this application, it mainly involves the training of reinforcement learning, and reinforcement learning can be, for example, deep reinforcement learning (DRL).
强化学习(reinforcement learning,RL)用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习的常见模型是马尔可夫决策过程(markov decision process,MDP)。MDP是一种分析决策问题的数学模型。强化学习是智能体以“试错”的方式进行学习,通过动作(action)与环境进行交互获得的奖励(reward)指导行为,目标是使智能体获得最大的奖励。应理解的,在本申请中,智能体可以理解为一种AI模型,包括大量的参数和计算公式(或计算规则)。奖励又可以称为回报值、评价等。Reinforcement learning (RL) is used to describe and solve the problem of an agent learning strategies to maximize returns or achieve specific goals during its interaction with the environment. A common model of reinforcement learning is the Markov decision process (MDP). MDP is a mathematical model for analyzing decision-making problems. Reinforcement learning is where the agent learns in a "trial and error" manner, and the rewards obtained by interacting with the environment through actions (actions) guide behavior. The goal is to enable the agent to obtain the maximum reward. It should be understood that in this application, an intelligent agent can be understood as an AI model, including a large number of parameters and calculation formulas (or calculation rules). Rewards can also be called return value, evaluation, etc.
强化学习可以由环境提供的强化信号(即奖励)对动作的好坏作一种评价,而不是告诉强化学习系统如何去产生正确的动作。由于外部环境提供的信息很少,智能体必须靠自身的经历进行学习。通过这种方式,智能体在行动-评价(即奖励)的环境中获得知识,改进行动方案以适应环境。常见的强化学习算法有深度Q学习(deep Q-learning,DQN)、近端策略优化(proximal policy optimization,PPO)等。Reinforcement learning can use the reinforcement signal (i.e. reward) provided by the environment to evaluate the quality of the action, rather than telling the reinforcement learning system how to produce the correct action. Since the external environment provides little information, the agent must rely on its own experience to learn. In this way, the agent acquires knowledge in an action-evaluation (i.e., reward) environment and improves its action plan to adapt to the environment. Common reinforcement learning algorithms include deep Q-learning (DQN), proximal policy optimization (PPO), etc.
参见图2,图2为本申请实施例提供的一种强化学习原理图。如图2所示,强化学习主要包含五个元素:智能体、环境(environment)、状态(state)、动作(action)与奖励(reward)。其中,智能体的输入为状态,输出为动作。强化学习的训练过程为:通过智能体与环境进行多次交互,获得每次交互的动作、状态、奖励;将这多组(动作,状态,奖励)作为训练数据,对智能体进行一次训练。采用上述过程,对智能体进行下一轮次训练,直至满足收敛条件。Refer to Figure 2, which is a schematic diagram of reinforcement learning provided by an embodiment of the present application. As shown in Figure 2, reinforcement learning mainly contains five elements: agent, environment, state, action and reward. Among them, the input of the agent is the state and the output is the action. The training process of reinforcement learning is: through multiple interactions between the agent and the environment, the actions, states, and rewards of each interaction are obtained; these multiple groups (actions, states, rewards) are used as training data to train the agent once. Using the above process, the agent is trained for the next round until the convergence conditions are met.
示例性的,获得一次交互的动作、状态、奖励的过程如图2所示,将环境当前状态S0输入至智能体,获得智能体输出的动作A0,根据环境在动作A0作用下的相关性能指标计算本次交互的奖励R0,至此,获得本次交互的动作A0、状态S0与奖励R0。记录本次交互的动作A0、状态S0与奖励R0,以备后续用来训练智能体。还可以记录环境在动作A0作用下的下一个状态S1,以便实现智能体与环境的下一次交互。For example, the process of obtaining the actions, status, and rewards of an interaction is shown in Figure 2. The current state S0 of the environment is input to the agent, and the action A0 output by the agent is obtained. According to the relevant performance indicators of the environment under the action A0 Calculate the reward R0 of this interaction. At this point, the action A0, state S0 and reward R0 of this interaction are obtained. Record the action A0, state S0 and reward R0 of this interaction for subsequent use in training the agent. The next state S1 of the environment under the action A0 can also be recorded in order to realize the next interaction between the agent and the environment.
在本申请中,动作为站点的动作,站点的动作可以包括以下至少一项:站点发起信道接入、站点进行信道选择、站点进行功率控制、站点进行速率自适应。In this application, the action is the action of the station, and the action of the station may include at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
可选的,当站点的动作为站点发起信道接入时,状态可以是载波侦听结果,如信道质量、丢包率等中的一个或多个。当站点的动作为站点进行信道选择时,状态可以是信道的负载情况等。当站点的动作为站点进行功率控制时,状态可以是站点的位置、信道质量、吞吐等中的一个或多个。当当站点的动作为站点进行速率自适应时,状态可以是载波侦听结果,如信道质量、丢包率等中的一个或多个。Optionally, when the station's action is that the station initiates channel access, the status may be one or more of the carrier sensing results, such as channel quality, packet loss rate, etc. When the station's action is to select a channel for the station, the status may be the load condition of the channel, etc. When the station's action is to perform power control for the station, the status may be one or more of the station's location, channel quality, throughput, etc. When the station's action is to perform rate adaptation for the station, the status may be one or more of the carrier sensing results, such as channel quality, packet loss rate, etc.
可选的,图1中的各设备(例如AP、STA等)可以由一个设备实现,也可以由多个设备共同实现,还可以是一个设备内的一个功能模块,本申请实施例对此不作具体限定。可以理解的是,上述功能既可以是硬件设备中的网络元件,也可以是在专用硬件上运行的软件功能,或者是平台(例如,云平台)上实例化的虚拟化功能。Optionally, each device (such as AP, STA, etc.) in Figure 1 can be implemented by one device, can also be implemented by multiple devices, or can be a functional module in one device. This is not the case in the embodiment of this application. Specific limitations. It can be understood that the above functions can be either network elements in hardware devices, software functions running on dedicated hardware, or virtualization functions instantiated on a platform (eg, cloud platform).
例如,图1中的各设备均可以通过图3中的通信装置300来实现。图3所示为可适用于本申请实施例的一种通信装置的硬件结构示意图。该通信装置300包括至少一个处理器301,通信线路302,存储器303 以及至少一个通信接口304。For example, each device in Figure 1 can be implemented by the communication device 300 in Figure 3 . FIG. 3 is a schematic diagram of the hardware structure of a communication device applicable to embodiments of the present application. The communication device 300 includes at least one processor 301, communication line 302, and memory 303 and at least one communication interface 304.
处理器301可以是通用中央处理器(central processing unit,CPU)、微处理器、特定应用集成电路(application-specific integrated circuit,ASIC)、神经网络处理器(neural-network processing unit,NPU)等中一个或多个。处理器301还可以是一个或多个用于控制本申请方案程序执行的集成电路。The processor 301 may be a general central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), a neural network processor (neural-network processing unit, NPU), etc. one or more. The processor 301 may also be one or more integrated circuits used to control the execution of the program of the present application.
通信线路302可包括一通路,在上述组件之间传送信息。Communication line 302 may include a path for communicating information between the above-mentioned components.
通信接口304,是任何收发器一类的装置(如天线等),用于与其他设备或通信网络通信,如以太网,RAN,无线局域网(wireless local area networks,WLAN)等。The communication interface 304 is any transceiver-like device (such as an antenna, etc.) used to communicate with other devices or communication networks, such as Ethernet, RAN, wireless local area networks (WLAN), etc.
存储器303可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过通信线路302与处理器相连接。存储器也可以和处理器集成在一起。本申请实施例提供的存储器通常可以具有非易失性。The memory 303 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions. A dynamic storage device can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other medium for access, but not limited to this. The memory may exist independently and be connected to the processor through a communication line 302 . Memory can also be integrated with the processor. The memory provided by the embodiment of the present application may generally be non-volatile.
其中,存储器303用于存储执行本申请方案的计算机执行指令,并由处理器301来控制执行。处理器301用于执行存储器303中存储的计算机执行指令,从而实现本申请下述实施例提供的方法。Among them, the memory 303 is used to store computer execution instructions for executing the solution of the present application, and is controlled by the processor 301 for execution. The processor 301 is used to execute computer execution instructions stored in the memory 303, thereby implementing the methods provided by the following embodiments of the application.
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。Optionally, the computer-executed instructions in the embodiments of the present application may also be called application codes, which are not specifically limited in the embodiments of the present application.
在一种可能的实施方式中,处理器301可以包括一个或多个CPU,例如图3中的CPU0和CPU1。In a possible implementation, the processor 301 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 3 .
在一种可能的实施方式中,通信装置300可以包括多个处理器,例如图3中的处理器301和处理器307。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。In a possible implementation, the communication device 300 may include multiple processors, such as the processor 301 and the processor 307 in FIG. 3 . Each of these processors may be a single-CPU processor or a multi-CPU processor. A processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
在一种可能的实施方式中,通信装置300还可以包括输出设备305和输入设备306。输出设备305和处理器301通信,可以以多种方式来显示信息。例如,输出设备305可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,或投影仪(projector)等。输入设备306和处理器301通信,可以以多种方式接收用户的输入。例如,输入设备306可以是鼠标、键盘、触摸屏设备或传感设备等。In a possible implementation, the communication device 300 may also include an output device 305 and an input device 306. Output device 305 communicates with processor 301 and can display information in a variety of ways. For example, the output device 305 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector. wait. Input device 306 communicates with processor 301 and can receive user input in a variety of ways. For example, the input device 306 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.
当通信装置开机后,处理器301可以读取存储器303中的软件程序,解释并执行软件程序的指令,处理软件程序的数据。当需要通过无线发送数据时,处理器301对待发送的数据进行基带处理后,输出基带信号至射频电路,射频电路将基带信号进行射频处理后将射频信号通过天线以电磁波的形式向外发送。当有数据发送到通信装置时,射频电路通过天线接收到射频信号,将射频信号转换为基带信号,并将基带信号输出至处理器301,处理器301将基带信号转换为数据并对该数据进行处理。When the communication device is turned on, the processor 301 can read the software program in the memory 303, interpret and execute the instructions of the software program, and process the data of the software program. When data needs to be sent wirelessly, the processor 301 performs baseband processing on the data to be sent, and then outputs the baseband signal to the radio frequency circuit. The radio frequency circuit performs radio frequency processing on the baseband signal and then sends the radio frequency signal out in the form of electromagnetic waves through the antenna. When data is sent to the communication device, the radio frequency circuit receives the radio frequency signal through the antenna, converts the radio frequency signal into a baseband signal, and outputs the baseband signal to the processor 301. The processor 301 converts the baseband signal into data and performs processing on the data. deal with.
在另一种实现中,所述的射频电路和天线可以独立于进行基带处理的处理器而设置,例如在分布式场景中,射频电路和天线可以独立于通信装置,呈拉远式的布置。In another implementation, the radio frequency circuit and the antenna can be arranged independently of the processor that performs baseband processing. For example, in a distributed scenario, the radio frequency circuit and the antenna can be arranged independently of the communication device in a remote arrangement.
可选的,神经网络处理器例如可以包括图3中未示出的训练模块和推理模块,训练模块的输入例如可以包括动作、状态、回报值等,输出为神经网络参数。一般来说,训练好的神经网络参数可以反馈到推理模块。应理解的,神经网络处理器可以与通信装置300的各个模块交互,如控制通信接口的数据的传输,以节能;或,与天线交互,控制天线的朝向。在一可能的实施方式中,该通信装置300还可以包括图3中未示出的媒体接入控制(media access control,MAC)。神经网络处理器还可以与MAC交互,控制信道接入、信道选择和空间复用决策等。Optionally, the neural network processor may include, for example, a training module and an inference module not shown in Figure 3. The inputs of the training module may include, for example, actions, states, reward values, etc., and the outputs may be neural network parameters. Generally speaking, the trained neural network parameters can be fed back to the inference module. It should be understood that the neural network processor can interact with various modules of the communication device 300, such as controlling the transmission of data on the communication interface to save energy; or interacting with the antenna to control the orientation of the antenna. In a possible implementation, the communication device 300 may also include a media access control (media access control, MAC) not shown in FIG. 3 . The neural network processor can also interact with the MAC to control channel access, channel selection, and spatial multiplexing decisions.
上述的通信装置300可以是一个通用设备或者是一个专用设备。在具体实现中,通信装置300可以是台式机、便携式电脑、网络服务器、掌上电脑(personal digital assistant,PDA)、移动手机、平板电脑、无线终端设备、嵌入式设备或有图3中类似结构的设备。本申请实施例不限定通信装置300的类型。The above-mentioned communication device 300 may be a general-purpose device or a special-purpose device. In a specific implementation, the communication device 300 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, an embedded device, or a device with a similar structure as shown in Figure 3 equipment. The embodiment of the present application does not limit the type of communication device 300.
以下结合附图,说明本申请实施例提供的技术方案。The following describes the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
参见图4,图4为本申请实施例提供的一种强化学习的训练方法的流程示意图。如图4所示,该方法包括但不限于以下步骤:Referring to Figure 4, Figure 4 is a schematic flow chart of a reinforcement learning training method provided by an embodiment of the present application. As shown in Figure 4, the method includes but is not limited to the following steps:
401、接入点根据多个站点的动作,确定第一回报值,第一回报值为多个站点中第一站点的回报值。 401. The access point determines a first reward value based on the actions of multiple stations, and the first reward value is the reward value of the first station among the multiple stations.
应理解的,第一站点可以为多个站点中的任意一个站点。这意味着,针对多个站点中的任意一个站点,接入点均是根据多个站点的动作确定该站点的回报值。示例性的,接入点根据站点1的动作、站点2的动作和站点3的动作,确定站点1的回报值;接入点根据站点1的动作、站点2的动作和站点3的动作,确定站点2的回报值;接入点根据站点1的动作、站点2的动作和站点3的动作,确定站点3的回报值。It should be understood that the first site can be any site among multiple sites. This means that for any one of the multiple sites, the access point determines the reward value of that site based on the actions of the multiple sites. For example, the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3; the access point determines the reward value of site 1 based on the actions of site 1, site 2, and site 3. The return value of site 2; the access point determines the return value of site 3 based on the actions of site 1, the action of site 2, and the action of site 3.
可选的,本申请中,接入点获知第一站点的动作,可以通过以下任意一种方式。应理解的,接入点具体采用方式1.1,还是方式1.2获知第一站点的动作,可以取决于接入点的实现、预先的约定或者标准的定义。Optionally, in this application, the access point can learn the action of the first station in any of the following ways. It should be understood that whether the access point uses method 1.1 or method 1.2 to learn the action of the first station may depend on the implementation of the access point, a prior agreement, or a standard definition.
方式1.1、在步骤401之前,接入点接收第一站点发送的第一站点的动作。示例性的,第一站点发送的报文被接入点接收,因为报文中包括第一站点的动作,所以接入点可以获知第一站点的动作。即在第一站点发送的报文被接入点接收的情况下,接入点通过报文获知第一站点的动作。又示例性的,第一站点发送的报文未被接入点接收,因为报文丢失,接入点无法获知第一站点的动作,所以第一站点可以重新向接入点发送丢失报文的动作。即在第一站点发送的报文未被接入点接收的情况下,接入点通过丢失报文的动作获知第一站点的动作。Method 1.1: Before step 401, the access point receives the action of the first station sent by the first station. For example, the packet sent by the first station is received by the access point. Because the packet includes the action of the first station, the access point can learn the action of the first station. That is, when the packet sent by the first station is received by the access point, the access point learns the action of the first station through the packet. In another example, the message sent by the first station is not received by the access point. Because the message is lost, the access point cannot learn the action of the first station, so the first station can re-send the lost message to the access point. action. That is, when the packet sent by the first station is not received by the access point, the access point learns the action of the first station through the action of losing the packet.
方式1.2、接入点自行确定第一站点的动作。示例性的,第一站点不发送报文,因此,接入点也不会收到第一站点发送的报文,此时可以接入点可以自行确定第一站点的动作。又示例性的,第一站点发送的报文被接入点接收,该报文包括报文的速率信息、时间长度信息等一个或多个。如该报文的包头中包括报文的速率信息、时间长度信息等一个或多个。因此接入点可以根据报文的速率信息、时间长度信息等一个或多个确定第一站点的动作。应理解的,在接入点自行确定第一站点的动作时,第一站点的动作可以为空。Method 1.2: The access point determines the action of the first station by itself. For example, the first station does not send packets, so the access point does not receive the packets sent by the first station. At this time, the access point can determine the action of the first station by itself. In another example, the packet sent by the first station is received by the access point, and the packet includes one or more of the rate information and time length information of the packet. For example, the packet header of the packet includes one or more of the packet's rate information, time length information, etc. Therefore, the access point can determine the action of the first station based on one or more of the packet rate information, time length information, and the like. It should be understood that when the access point determines the action of the first station by itself, the action of the first station may be empty.
可以看出,方式1.1和方式1.2中的任意一种方式,实现了接入点获知站点的动作,进而为后续接入点确定回报值做准备。It can be seen that either method 1.1 or 1.2 enables the access point to learn the action of the site, thereby preparing for subsequent access points to determine the return value.
可选的,多个站点中不同站点的动作可以完全相同、部分相同或完全不同,在此不做限定。示例性的,站点#1的动作为发起信道接入,站点#2的动作为发起信道接入,站点#3的动作为发起信道接入。因此三个站点的动作完全相同。又示例性的,站点#1的动作为发起信道接入,站点#2的动作为发起信道接入,站点#3的动作为进行功率控制。因此三个站点的动作部分相同。又示例性的,站点#1的动作为发起信道接入,站点#2的动作为进行速率自适应,站点#3的动作为进行功率控制。因此三个站点的动作完全不同。Optionally, the actions of different sites in multiple sites can be exactly the same, partially the same, or completely different, which is not limited here. For example, the action of site #1 is to initiate channel access, the action of site #2 is to initiate channel access, and the action of site #3 is to initiate channel access. So all three sites behave exactly the same. In another example, the action of station #1 is to initiate channel access, the action of station #2 is to initiate channel access, and the action of station #3 is to perform power control. So the action portion is the same for all three sites. In another example, the action of station #1 is to initiate channel access, the action of station #2 is to perform rate adaptation, and the action of station #3 is to perform power control. So the actions of the three sites are completely different.
其中,第一回报值可以用于第一站点进行强化学习训练。The first reward value can be used at the first site for reinforcement learning training.
可选的,步骤401可以包括:接入点根据多个站点的动作和多个站点的动作对应的时间,确定第一回报值。可以看出,通过根据多个站点的动作和多个站点的动作对应的时间确定回报值,使得回报值的计算可以结合用户间的相互影响,还可以结合不同站点的动作对应的时间,丰富了确定回报值的相关信息,提高了回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Optionally, step 401 may include: the access point determines the first report value based on the actions of multiple stations and the times corresponding to the actions of the multiple stations. It can be seen that by determining the reward value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, the calculation of the reward value can be combined with the mutual influence between users, and can also be combined with the time corresponding to the actions of different sites, enriching the Determining the relevant information of the return value improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
需要说明的,在第一站点发送的报文被接入点接收的情况下,因为报文中包括第一站点的动作,所以第一站点的动作对应的时间例如可以是接入点接收该报文的时间。在第一站点发送的报文未被接入点接收的情况下,第一站点的动作对应的时间例如可以是丢失报文的发送时间。在接入点自行确定第一站点的动作的情况下,第一站点的动作对应的时间例如可以是第一站点发起信道接入的时间。It should be noted that when the packet sent by the first station is received by the access point, because the packet includes the action of the first station, the time corresponding to the action of the first station may be, for example, when the access point receives the packet. time of writing. When the packet sent by the first station is not received by the access point, the time corresponding to the action of the first station may be, for example, the sending time of the lost packet. In the case where the access point determines the action of the first station by itself, the time corresponding to the action of the first station may be, for example, the time when the first station initiates channel access.
其中,多个站点的动作对应的时间相同。可以看出,因为多个站点的动作对应的时间相同,所以接入点在根据多个站点的动作和多个站点的动作对应的时间确定回报值时,可以提高回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Among them, the actions of multiple sites correspond to the same time. It can be seen that because the actions of multiple sites correspond to the same time, the access point can improve the accuracy of the return value when determining the return value based on the actions of multiple sites and the time corresponding to the actions of multiple sites, thus making The site can improve the actual application effect after using the return value for reinforcement learning training.
可选的,第一回报值为第一时间对应的回报值,第一时间为第一站点的动作对应的时间。可以看出,因为回报值为某个时间对应的回报值,所以使得站点可以获知该时间对应的动作和环境状态,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。Optionally, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site. It can be seen that because the return value is the return value corresponding to a certain time, the site can learn the actions and environmental status corresponding to that time, which in turn allows the site to improve the actual application effect after using the return value for reinforcement learning training.
在本申请中,第一回报值 In this application, the first return value
其中,d0为第一站点距离最近一次收到第一站点的确认帧的时间间隔,N为站点的数量,d1为第一站点距离最近一次监听到其他站点的确认帧的时间间隔,其他站点为多个站点中除第一站点的站点。Among them, d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station, N is the number of stations, d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations, and others A site is a site other than the first site among multiple sites.
应理解的,在本申请中,当第一站点传输报文成功,且其他站点传输报文成功时,第一回报值为 d0-(N-1)*d1。当第一站点传输报文成功,且其他站点传输报文失败时,第一回报值为d0-(N-1)*d1。当第一站点传输报文失败,且其他站点传输报文失败时,第一回报值为-N。当第一站点传输报文失败,且其他站点传输报文成功时,第一回报值为-N。It should be understood that in this application, when the first station successfully transmits the message and other stations successfully transmit the message, the first return value is d 0 -(N-1)*d 1 . When the first station successfully transmits the message and other stations fail to transmit the message, the first report value is d 0 -(N-1)*d 1 . When the first station fails to transmit the packet and other stations fail to transmit the packet, the first report value is -N. When the first station fails to transmit the packet and the other station successfully transmits the packet, the first report value is -N.
可以看出,回报值的计算可以结合用户间的相互影响,提高了回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。It can be seen that the calculation of the return value can be combined with the interaction between users, which improves the accuracy of the return value, thereby allowing the site to improve the actual application effect after using the return value for reinforcement learning training.
可选的,该方法还可以包括步骤402。Optionally, the method may also include step 402.
402、接入点向第一站点发送第一回报值。402. The access point sends the first report value to the first station.
相应的,第一站点接收接入点发送的第一回报值。Correspondingly, the first station receives the first report value sent by the access point.
可选的,在本申请中,步骤402可以通过以下任意一种方式实现。应理解的,接入点具体采用方式2.1,还是方式2.2发送第一回报值,可以取决于接入点的实现、预先的约定或者标准的定义。Optionally, in this application, step 402 can be implemented in any of the following ways. It should be understood that whether the access point uses method 2.1 or method 2.2 to send the first return value may depend on the implementation of the access point, prior agreement, or standard definition.
方式2.1、接入点向第一站点发送广播帧或组播帧,相应的,第一站点接收接入点发送的广播帧或组播帧。其中,广播帧或组播帧包括第一回报值,该组播帧还可以包括第一站点的地址。广播帧例如可以为信标帧或触发(trigger)帧等。应理解的,对于方式2.1这种情况,接入点可以采用上述方式1.1中通过丢失报文的动作获知第一站点的动作,或,接入点可以采用上述方式1.2获知第一站点的动作。可以看出,因为第一回报值由广播帧或组播帧携带,所以可以使得其他站点也收到广播帧或组播帧。Method 2.1: The access point sends a broadcast frame or a multicast frame to the first station. Correspondingly, the first station receives the broadcast frame or multicast frame sent by the access point. Wherein, the broadcast frame or the multicast frame includes the first report value, and the multicast frame may also include the address of the first station. The broadcast frame may be, for example, a beacon frame or a trigger frame. It should be understood that for the situation of method 2.1, the access point can use the above method 1.1 to learn the action of the first station through the action of losing the packet, or the access point can use the above method 1.2 to learn the action of the first station. It can be seen that because the first report value is carried by the broadcast frame or the multicast frame, other stations can also receive the broadcast frame or the multicast frame.
方式2.2、接入点向第一站点发送第一报文的响应帧,相应的,第一站点接收接入点发送的第一报文的响应帧。其中,第一报文的响应帧包括第一回报值,第一回报值与第二报文对应,第二报文在第一报文之后接收。Method 2.2: The access point sends a response frame of the first message to the first station. Correspondingly, the first station receives the response frame of the first message sent by the access point. Wherein, the response frame of the first message includes a first return value, the first return value corresponds to the second message, and the second message is received after the first message.
其中,第一回报值与第二报文对应可以理解为:第一回报值与第二报文中第一站点的动作对应。第一站点的动作对应的时间为接入点接收第二报文的时间。另外,在本申请中,响应帧例如可以为确认(acknowledgment,ACK)帧、清除发送(clear tosend,CTS)帧或块确认(block ACK,BA)等。Wherein, the correspondence between the first return value and the second message can be understood as: the first return value corresponds to the action of the first station in the second message. The time corresponding to the action of the first station is the time when the access point receives the second message. In addition, in this application, the response frame may be, for example, an acknowledgment (ACK) frame, a clear tosend (CTS) frame, or a block acknowledgment (block ACK, BA) frame, etc.
示例性的,参见图5,图5为本申请实施例提供的一种延迟反馈回报值的示意图。如图5所示,在步骤501中,站点1向接入点发送报文1,报文1包括站点1的动作。在步骤502中,接入点向站点1发送报文1的响应帧。在步骤503中,站点1向接入点发送报文2。在步骤504中,接入点向站点1发送报文2的响应帧,报文2的响应帧包括报文1对应的回报值,报文1对应的回报值根据多个站点的动作确定,站点1为多个站点中的一个。即因为接入点计算报文1对应的回报值需要时间,所以接入点可以在报文2的响应帧中携带报文1对应的回报值。For example, see FIG. 5 , which is a schematic diagram of a delayed feedback reward value provided by an embodiment of the present application. As shown in Figure 5, in step 501, station 1 sends message 1 to the access point, and message 1 includes the action of station 1. In step 502, the access point sends a response frame of message 1 to station 1. In step 503, station 1 sends message 2 to the access point. In step 504, the access point sends a response frame of message 2 to station 1. The response frame of message 2 includes the return value corresponding to message 1. The return value corresponding to message 1 is determined based on the actions of multiple stations. The station 1 is one of multiple sites. That is, because it takes time for the access point to calculate the report value corresponding to message 1, the access point can carry the report value corresponding to message 1 in the response frame of message 2.
可以看出,方式2.2中,第二报文对应的回报值可以在第一报文的响应帧中携带,因为第二报文在第一报文之后接收,所以实现了延迟发送第二报文对应的回报值,这为回报值的计算提供了更多的时间。It can be seen that in method 2.2, the return value corresponding to the second message can be carried in the response frame of the first message. Because the second message is received after the first message, delayed sending of the second message is achieved. The corresponding return value, which provides more time for the calculation of the return value.
可选的,针对方式2.1,多个站点还可以包括第二站点,该方法还包括:若第一站点和第二站点同时发送报文并导致传输失败,接入点则确定第二站点的回报值,第二站点的回报值与第一回报值相同;接入点向第二站点发送广播帧。其中,第一站点和第二站点同时发送报文并导致传输失败,可以理解为:第一站点和第二站点同时发送报文导致第一站点传输失败,第二站点也传输失败。可以看出,在不同站点的回报值相同的情况下,通过发送广播帧,使得不同站点都可以获取到回报值,节省了开销。Optionally, for method 2.1, the multiple sites may also include a second site. The method also includes: if the first site and the second site send messages at the same time and cause transmission failure, the access point determines the report of the second site. value, the report value of the second station is the same as the first report value; the access point sends a broadcast frame to the second station. Among them, the first station and the second station send packets at the same time and cause transmission failure, which can be understood as: the first station and the second station send packets at the same time, causing the first station to fail to transmit, and the second station also fails to transmit. It can be seen that when the return values of different sites are the same, by sending broadcast frames, different sites can obtain the return values, saving overhead.
示例性的,在时刻t,站点1和站点2的动作均为信道接入,即站点1和站点2同时向接入点发送报文,由于冲突导致报文发送失败,AP为了惩罚站点1和站点2在t时刻的行为,可以将回报值设置为较大的负值,例如-100,此时其中,为站点1的回报值,为站点2的回报值。这种情况下,最节省开销的方式是进行广播,这使得站点1和站点2都可以获取到回报值。For example, at time t, the actions of station 1 and station 2 are channel access, that is, station 1 and station 2 send packets to the access point at the same time. Due to the conflict, the packet transmission fails. In order to punish station 1 and station 2, the AP For the behavior of site 2 at time t, the return value can be set to a large negative value, such as -100. At this time in, is the return value of site 1, is the return value of site 2. In this case, the most cost-effective way is to broadcast, which allows both site 1 and site 2 to obtain the reward value.
可选的,针对方式2.2,第一报文的响应帧还可以包括以下任意一种。应理解的,接入点具体在第一报文的响应帧中携带第一种,还是第一种,可以取决于接入点的实现、预先的约定或者标准的定义。Optionally, for method 2.2, the response frame of the first message may also include any of the following. It should be understood that whether the access point carries the first type or the first type in the response frame of the first message may depend on the implementation of the access point, pre-agreement or standard definition.
第一种、第二报文的标识信息。在一可能的实施方式中,第二报文的标识信息例如可以为第二报文的索引值。在另一可能的实施方式中,第二报文的标识信息例如可以为第一报文的索引值与第二报文的索引值之间的差值。如,第一报文的索引值为10,第二报文的索引值为4,第二报文的标识信息可以为4或6。Identification information of the first and second packets. In a possible implementation, the identification information of the second message may be, for example, the index value of the second message. In another possible implementation, the identification information of the second message may be, for example, the difference between the index value of the first message and the index value of the second message. For example, the index value of the first message is 10, the index value of the second message is 4, and the identification information of the second message can be 4 or 6.
第二种、第二报文的时间戳。在一可能的实施方式中,第二报文的时间戳例如可以为第二报文的接收时间。在另一可能的实施方式中,第二报文的时间戳例如可以为第一报文的接收时间与第二报文的接收时间的差值。在本申请中,报文的接收时间可以理解为接入点接收报文的时间。The timestamp of the second and second packets. In a possible implementation, the timestamp of the second message may be, for example, the reception time of the second message. In another possible implementation, the timestamp of the second message may be, for example, the difference between the reception time of the first message and the reception time of the second message. In this application, the reception time of the message can be understood as the time when the access point receives the message.
可以看出,由于第一报文的响应帧还包括第二报文的标识信息或第二报文的时间戳,使得第一站点可以获知第一回报值具体是哪个报文对应的回报值。It can be seen that since the response frame of the first message also includes the identification information of the second message or the timestamp of the second message, the first station can learn which message the first return value corresponds to.
应理解的,本申请中,第一报文的响应帧具体携带的信息例如可以参考表1或表2。在表1中,第一 报文的响应帧包括第一回报值和第二报文的标识信息。在表2中,第一报文的响应帧包括第一回报值和第二报文的时间戳。It should be understood that in this application, the specific information carried by the response frame of the first message can be referred to Table 1 or Table 2, for example. In Table 1, the first The response frame of the message includes the first report value and the identification information of the second message. In Table 2, the response frame of the first message includes the first report value and the timestamp of the second message.
表1
Table 1
表2
Table 2
可选的,该方法还可以包括步骤403。Optionally, the method may also include step 403.
403、第一站点根据第一回报值进行强化学习训练。403. The first site performs reinforcement learning training based on the first return value.
可选的,步骤403可以包括:第一站点获取第一回报值对应的状态和动作;第一站点根据第一回报值、状态和动作进行强化学习训练。其中,第一站点根据第一回报值、状态和动作进行强化学习训练可以理解为:第一站点根据第一回报值、状态和动作对智能体进行强化学习训练。Optionally, step 403 may include: the first station obtains the status and action corresponding to the first reward value; and the first station performs reinforcement learning training based on the first reward value, status and action. Wherein, the first station performing reinforcement learning training based on the first reward value, status and action can be understood as: the first station performs reinforcement learning training on the agent based on the first reward value, status and action.
可选的,在本申请中,多个站点中不同站点可以采用不同的强化学习算法进行强化学习训练,如站点1使用DQN进行强化学习训练,站点2使用PPO进行强化学习训练等。Optionally, in this application, different sites among multiple sites can use different reinforcement learning algorithms for reinforcement learning training. For example, site 1 uses DQN for reinforcement learning training, site 2 uses PPO for reinforcement learning training, etc.
可以看出,通过根据多个站点的动作确定回报值,使得回报值的计算可以结合用户间的相互影响,提高了回报值的准确性,进而使得站点在利用回报值进行强化学习训练后可以提升实际应用效果。It can be seen that by determining the reward value based on the actions of multiple sites, the calculation of the reward value can be combined with the interaction between users, improving the accuracy of the reward value, and thus allowing the site to improve after using the reward value for reinforcement learning training. Practical application effect.
参见图6,图6为本申请实施例提供的有益效果图。在图6中,多个站点中不同站点使用不同的深度学习算法,如使用DQN的站点的数量为1,使用PPO的站点的数量也为1。在这些站点发起信道接入的场景下,仍然能够公平且高效的共享频谱。具体的,在图6的6-1中,横坐标为接入时延(Delay),单位为秒,纵坐标为累积概率分布(Probability),可以看出这些站点在发起信道接入时,使用不同算法的站点对应的接入时延的分布情况。在图6的6-2中,横坐标为时间(Time),单位为秒,纵坐标为吞吐(Throughtput),这些站点的总吞吐在0.8至1之间趋于稳定,各个站点的吞吐在0.4至0.6之间趋于稳定,也就是说各个站点的吞吐差别不大,所以这些站点能够公平且高效的共享频谱。Referring to Figure 6, Figure 6 is a beneficial effect diagram provided by the embodiment of the present application. In Figure 6, different sites among multiple sites use different deep learning algorithms. For example, the number of sites using DQN is 1, and the number of sites using PPO is also 1. In the scenario where these stations initiate channel access, spectrum can still be shared fairly and efficiently. Specifically, in 6-1 of Figure 6, the abscissa is the access delay (Delay) in seconds, and the ordinate is the cumulative probability distribution (Probability). It can be seen that these stations use Distribution of access delays corresponding to sites with different algorithms. In 6-2 of Figure 6, the abscissa is time in seconds, and the ordinate is throughput. The total throughput of these sites tends to be stable between 0.8 and 1, and the throughput of each site is at 0.4. It tends to be stable between 0.6 and 0.6, which means that the throughput of each site is not very different, so these sites can share the spectrum fairly and efficiently.
上述主要从各个设备之间交互的角度对本申请提供的方案进行了介绍。可以理解的是,上述实现各设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The above mainly introduces the solution provided by this application from the perspective of interaction between various devices. It can be understood that, in order to realize the above functions, the above-mentioned implementation devices include corresponding hardware structures and/or software modules for executing each function. Persons skilled in the art should easily realize that, with the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
本申请实施例可以根据上述方法示例对接入点、站点等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中,上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。Embodiments of the present application can divide access points, sites, etc. into functional modules according to the above method examples. For example, each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module. , The above integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiment of the present application is schematic and is only a logical function division. In actual implementation, there may be other division methods.
在采用集成的模块的情况下,参见图7,图7为本申请实施例提供的一种通信装置的结构示意图。该通信装置700可应用于上述图4所示的方法中,如图7所示,该通信装置700包括:处理模块701和收发模块702。处理模块701可以是一个或多个处理器,收发模块702可以是收发器或者通信接口。该通信装置可用于实现上述任一方法实施例中涉及站点或接入点,或用于实现上述任一方法实施例中涉及网元的功能。该网元或者网络功能既可以是硬件设备中的网络元件,也可以是在专用硬件上运行的软件功能,或者是平台(例如,云平台)上实例化的虚拟化功能。可选的,该通信装置700还可以包括存储模块703,用于存储通信装置700的程序代码和数据。In the case of using an integrated module, see FIG. 7 , which is a schematic structural diagram of a communication device provided by an embodiment of the present application. The communication device 700 can be applied to the method shown in FIG. 4 . As shown in FIG. 7 , the communication device 700 includes: a processing module 701 and a transceiver module 702 . The processing module 701 may be one or more processors, and the transceiver module 702 may be a transceiver or a communication interface. The communication device can be used to implement the site or access point involved in any of the above method embodiments, or to implement the functions of the network element involved in any of the above method embodiments. The network element or network function can be a network element in a hardware device, a software function running on dedicated hardware, or a virtualized function instantiated on a platform (eg, cloud platform). Optionally, the communication device 700 may also include a storage module 703 for storing program codes and data of the communication device 700 .
一种实例,当该通信装置作为接入点或为应用于接入点中的芯片,并执行上述方法实施例中由接入点执行的步骤。收发模块702用于支持与站点等之间的通信,收发模块具体执行图4中由接入点执行的发送和/或接收的动作,例如支持接入点执行步骤402,和/或本文中所描述的技术的其他过程。处理模块701可用于支持通信装置700执行上述方法实施例中的处理动作,例如,支持接入点执行步骤401等中一个或多个步骤,和/或本文中所描述的技术的其他过程。In one example, when the communication device serves as an access point or is a chip applied in the access point, and performs the steps performed by the access point in the above method embodiment. The transceiver module 702 is used to support communication with sites, etc. The transceiver module specifically performs the sending and/or receiving actions performed by the access point in Figure 4, such as supporting the access point to perform step 402, and/or as described herein. Other procedures for the described technology. The processing module 701 may be used to support the communication device 700 to perform processing actions in the above method embodiments, for example, to support the access point to perform one or more steps in step 401, etc., and/or other processes of the technology described herein.
示例性的,处理模块701,用于根据多个站点的动作,确定第一回报值,第一回报值为多个站点中第一站点的回报值,第一回报值用于第一站点进行强化学习训练;收发模块702,用于向第一站点发送第一回报值。 Exemplarily, the processing module 701 is used to determine a first reward value based on the actions of multiple sites. The first reward value is the reward value of the first site among the multiple sites. The first reward value is used for strengthening the first site. Learning and training; sending and receiving module 702, used to send the first reward value to the first site.
可选的,一个站点的动作包括以下至少一项:站点发起信道接入、站点进行信道选择、站点进行功率控制、站点进行速率自适应。Optionally, the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate adaptation.
可选的,在根据多个站点的动作,确定第一回报值时,处理模块701,用于根据多个站点的动作和多个站点的动作对应的时间,确定第一回报值。Optionally, when determining the first reward value based on the actions of multiple sites, the processing module 701 is configured to determine the first reward value based on the actions of the multiple sites and the time corresponding to the actions of the multiple sites.
可选的,多个站点的动作对应的时间相同。Optionally, the actions of multiple sites correspond to the same time.
可选的,第一回报值为第一时间对应的回报值,第一时间为第一站点的动作对应的时间。Optionally, the first reward value is the reward value corresponding to the first time, and the first time is the time corresponding to the action of the first site.
可选的,在向第一站点发送第一回报值时,收发模块702,用于向第一站点发送广播帧,广播帧包括第一回报值。Optionally, when sending the first report value to the first station, the transceiving module 702 is configured to send a broadcast frame to the first station, where the broadcast frame includes the first report value.
可选的,多个站点还包括第二站点,处理模块701,还用于若第一站点和第二站点同时发送报文并导致传输失败,则确定第二站点的回报值,第二站点的回报值与第一回报值相同;收发模块702,还用于向第二站点发送广播帧。Optionally, the multiple sites also include a second site. The processing module 701 is also used to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure. The report value is the same as the first report value; the transceiver module 702 is also used to send a broadcast frame to the second station.
可选的,在向第一站点发送第一回报值时,收发模块702,用于向第一站点发送第一报文的响应帧;其中,第一报文的响应帧包括第一回报值,第一回报值与第二报文对应,第二报文在第一报文之后接收。Optionally, when sending the first report value to the first station, the transceiver module 702 is configured to send a response frame of the first message to the first station; wherein the response frame of the first message includes the first report value, The first return value corresponds to the second message, and the second message is received after the first message.
可选的,第一报文的响应帧还包括第二报文的标识信息或第二报文的时间戳。Optionally, the response frame of the first message also includes identification information of the second message or a timestamp of the second message.
可选的,第二报文的时间戳为第二报文的接收时间;或,第二报文的时间戳为第一报文的接收时间与第二报文的接收时间的差值。Optionally, the timestamp of the second message is the reception time of the second message; or, the timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.
可选的,第一回报值 Optional, first return value
其中,d0为第一站点距离最近一次收到第一站点的确认帧的时间间隔,N为站点的数量,d1为第一站点距离最近一次监听到其他站点的确认帧的时间间隔,其他站点为多个站点中除第一站点的站点。Among them, d 0 is the time interval between the first station and the last time it received the acknowledgment frame from the first station, N is the number of stations, d 1 is the time interval between the first station and the last time it heard the acknowledgment frame from other stations, and others A site is a site other than the first site among multiple sites.
在一种可能的实施方式中,当接入点或站点为芯片时,收发模块702可以是通信接口、管脚或电路等。通信接口可用于输入待处理的数据至处理器,并可以向外输出处理器的处理结果。具体实现中,通信接口可以是通用输入输出(general purpose input output,GPIO)接口,可以和多个外围设备(如显示器(LCD)、摄像头(camara)、射频(radio frequency,RF)模块、天线等等)连接。通信接口通过总线与处理器相连。In a possible implementation, when the access point or station is a chip, the transceiver module 702 may be a communication interface, pin or circuit, etc. The communication interface can be used to input data to be processed to the processor, and can output the processing results of the processor to the outside. In specific implementation, the communication interface can be a general purpose input output (GPIO) interface, which can communicate with multiple peripheral devices (such as display (LCD), camera (camara), radio frequency (RF) module, antenna, etc. etc.) connection. The communication interface is connected to the processor through a bus.
处理模块701可以是处理器,该处理器可以执行存储模块存储的计算机执行指令,以使该芯片执行图4实施例涉及的方法。The processing module 701 may be a processor, and the processor may execute computer execution instructions stored in the storage module, so that the chip executes the method involved in the embodiment of FIG. 4 .
进一步的,处理器可以包括控制器、运算器和寄存器。示例性的,控制器主要负责指令译码,并为指令对应的操作发出控制信号。运算器主要负责执行定点或浮点算数运算操作、移位操作以及逻辑操作等,也可以执行地址运算和转换。寄存器主要负责保存指令执行过程中临时存放的寄存器操作数和中间操作结果等。具体实现中,处理器的硬件架构可以是专用集成电路(application specific integrated circuits,ASIC)架构、无互锁管道阶段架构的微处理器(microprocessor without interlocked piped stages architecture,MIPS)架构、进阶精简指令集机器(advanced RISC machines,ARM)架构或者网络处理器(network processor,NP)架构等等。处理器可以是单核的,也可以是多核的。Further, the processor may include a controller, arithmetic unit, and a register. For example, the controller is mainly responsible for decoding instructions and sending control signals for operations corresponding to the instructions. The arithmetic unit is mainly responsible for performing fixed-point or floating-point arithmetic operations, shift operations, and logical operations. It can also perform address operations and conversions. Registers are mainly responsible for storing register operands and intermediate operation results temporarily stored during instruction execution. In specific implementation, the hardware architecture of the processor can be application specific integrated circuits (ASIC) architecture, microprocessor without interlocked piped stages architecture (MIPS) architecture, advanced reduced instructions Set machine (advanced RISC machines, ARM) architecture or network processor (network processor, NP) architecture, etc. The processor can be single-core or multi-core.
该存储模块可以为该芯片内的存储模块,如寄存器、缓存等。存储模块也可以是位于芯片外部的存储模块,如只读存储器(Read Only Memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(Random Access Memory,RAM)等。The storage module can be a storage module within the chip, such as a register, cache, etc. The storage module can also be a storage module located outside the chip, such as Read Only Memory (ROM) or other types of static storage devices that can store static information and instructions, Random Access Memory (Random Access Memory, RAM), etc. .
需要说明的,处理器、接口各自对应的功能既可以通过硬件设计实现,也可以通过软件设计来实现,还可以通过软硬件结合的方式来实现,这里不作限制。It should be noted that the corresponding functions of the processor and the interface can be realized through hardware design, software design, or a combination of software and hardware. There are no restrictions here.
本申请实施例还提供一种通信装置,包括处理器、存储器、输入接口和输出接口,输入接口用于接收来自通信装置之外的其它通信装置的信息,输出接口用于向通信装置之外的其它通信装置输出信息,处理器调用存储器中存储的计算机程序实现如图4所示实施例。Embodiments of the present application also provide a communication device, including a processor, a memory, an input interface and an output interface. The input interface is used to receive information from other communication devices other than the communication device, and the output interface is used to send information to other communication devices other than the communication device. Other communication devices output information, and the processor calls the computer program stored in the memory to implement the embodiment shown in Figure 4.
本申请实施例还提供一种芯片,芯片包括至少一个处理器和接口,处理器用于读取并执行存储器中存储的指令,当指令被运行时,使得芯片执行如图4所示实施例。An embodiment of the present application also provides a chip. The chip includes at least one processor and an interface. The processor is configured to read and execute instructions stored in the memory. When the instructions are executed, the chip executes the embodiment shown in Figure 4 .
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序包括程序指令,程序指令当被计算机执行时,使计算机执行如图4所示实施例。 Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the embodiment shown in Figure 4.
本申请实施例还提供一种计算机程序产品,当计算机读取并执行计算机程序产品时,使得计算机执行实现如图4所示实施例。An embodiment of the present application also provides a computer program product. When a computer reads and executes the computer program product, the computer executes the embodiment shown in Figure 4 .
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目标。另外,在本申请各个实施例中的各网元单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件网元单元的形式实现。The units described above as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the goals of the embodiments of the present application. In addition, each network element unit in various embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or software network element unit.
上述集成的单元如果以软件网元单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终端设备,云服务器,或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。 If the above integrated unit is implemented in the form of a software network element unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the part that contributes essentially to the technical solution of the present application, or all or part of the technical solution, can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes a number of instructions. So that a computer device (which can be a personal computer, a terminal device, a cloud server, or a network device, etc.) executes all or part of the steps of the above methods in various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. . The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various equivalent methods within the technical scope disclosed in the present application. Modification or replacement, these modifications or replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (24)

  1. 一种强化学习的训练方法,其特征在于,所述方法包括:A training method for reinforcement learning, characterized in that the method includes:
    根据多个站点的动作,确定第一回报值,所述第一回报值为所述多个站点中第一站点的回报值,所述第一回报值用于所述第一站点进行强化学习训练;Determine a first reward value based on the actions of multiple sites. The first reward value is the reward value of the first site among the multiple sites. The first reward value is used for the first site to perform reinforcement learning training. ;
    向所述第一站点发送所述第一回报值。Send the first reward value to the first site.
  2. 根据权利要求1所述的方法,其特征在于,一个站点的动作包括以下至少一项:所述站点发起信道接入、所述站点进行信道选择、所述站点进行功率控制、所述站点进行速率自适应。The method according to claim 1, characterized in that the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate control. Adaptive.
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据多个站点的动作,确定第一回报值,包括:The method according to claim 1 or 2, characterized in that determining the first reward value according to the actions of multiple sites includes:
    根据所述多个站点的动作和所述多个站点的动作对应的时间,确定所述第一回报值。The first reward value is determined according to the actions of the multiple sites and the times corresponding to the actions of the multiple sites.
  4. 根据权利要求3所述的方法,其特征在于,所述多个站点的动作对应的时间相同。The method according to claim 3, characterized in that the actions of the plurality of stations correspond to the same time.
  5. 根据权利要求1-4任意一项所述的方法,其特征在于,所述第一回报值为第一时间对应的回报值,所述第一时间为所述第一站点的动作对应的时间。The method according to any one of claims 1 to 4, characterized in that the first reward value is a reward value corresponding to a first time, and the first time is a time corresponding to the action of the first site.
  6. 根据权利要求1-5任意一项所述的方法,其特征在于,所述向所述第一站点发送所述第一回报值,包括:The method according to any one of claims 1-5, wherein sending the first reward value to the first site includes:
    向所述第一站点发送广播帧,所述广播帧包括所述第一回报值。A broadcast frame is sent to the first station, where the broadcast frame includes the first reward value.
  7. 根据权利要求6所述的方法,其特征在于,所述多个站点还包括第二站点,所述方法还包括:The method according to claim 6, wherein the plurality of sites further includes a second site, and the method further includes:
    若所述第一站点和所述第二站点同时发送报文并导致传输失败,则确定所述第二站点的回报值,所述第二站点的回报值与所述第一回报值相同;If the first station and the second station send messages at the same time and cause transmission failure, then determine the return value of the second station, and the return value of the second station is the same as the first return value;
    向所述第二站点发送所述广播帧。Send the broadcast frame to the second station.
  8. 根据权利要求1-5任意一项所述的方法,其特征在于,所述向所述第一站点发送所述第一回报值,包括:The method according to any one of claims 1-5, wherein sending the first reward value to the first site includes:
    向所述第一站点发送第一报文的响应帧;Send a response frame of the first message to the first station;
    其中,所述第一报文的响应帧包括所述第一回报值,所述第一回报值与第二报文对应,所述第二报文在所述第一报文之后接收。Wherein, the response frame of the first message includes the first return value, the first return value corresponds to the second message, and the second message is received after the first message.
  9. 根据权利要求8所述的方法,其特征在于,所述第一报文的响应帧还包括所述第二报文的标识信息或所述第二报文的时间戳。The method according to claim 8, characterized in that the response frame of the first message further includes identification information of the second message or a timestamp of the second message.
  10. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, characterized in that:
    所述第二报文的时间戳为所述第二报文的接收时间;或,The timestamp of the second message is the reception time of the second message; or,
    所述第二报文的时间戳为所述第一报文的接收时间与所述第二报文的接收时间的差值。The timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.
  11. 根据权利要求1-10任意一项所述的方法,其特征在于,The method according to any one of claims 1-10, characterized in that,
    所述第一回报值 The first return value
    其中,d0为所述第一站点距离最近一次收到所述第一站点的确认帧的时间间隔,N为站点的数量,d1为所述第一站点距离最近一次监听到所述其他站点的确认帧的时间间隔,所述其他站点为所述多个站点中除所述第一站点的站点。Wherein, d 0 is the time interval since the first station last received the acknowledgment frame from the first station, N is the number of stations, and d 1 is the last time the first station heard the other stations. The time interval of the acknowledgment frame, the other stations are the stations among the multiple stations except the first station.
  12. 一种通信装置,其特征在于,所述装置包括处理模块和收发模块,A communication device, characterized in that the device includes a processing module and a transceiver module,
    处理模块,用于根据多个站点的动作,确定第一回报值,所述第一回报值为所述多个站点中第一站点的回报值,所述第一回报值用于所述第一站点进行强化学习训练;A processing module configured to determine a first return value based on actions of multiple sites, where the first return value is the return value of a first site among the multiple sites, and the first return value is used for the first site. The site conducts reinforcement learning training;
    收发模块,用于向所述第一站点发送所述第一回报值。A transceiver module, configured to send the first report value to the first site.
  13. 根据权利要求12所述的装置,其特征在于,一个站点的动作包括以下至少一项:所述站点发起信道接入、所述站点进行信道选择、所述站点进行功率控制、所述站点进行速率自适应。The device according to claim 12, wherein the action of a station includes at least one of the following: the station initiates channel access, the station performs channel selection, the station performs power control, and the station performs rate control. Adaptive.
  14. 根据权利要求12或13所述的装置,其特征在于,在根据多个站点的动作,确定第一回报值时,所述处理模块,用于根据所述多个站点的动作和所述多个站点的动作对应的时间,确定所述第一回报值。The device according to claim 12 or 13, characterized in that, when determining the first reward value according to the actions of multiple sites, the processing module is configured to determine the first reward value based on the actions of the multiple sites and the multiple The first reward value is determined based on the time corresponding to the site's action.
  15. 根据权利要求14所述的装置,其特征在于,所述多个站点的动作对应的时间相同。The device according to claim 14, characterized in that the actions of the plurality of stations correspond to the same time.
  16. 根据权利要求14所述的装置,其特征在于,所述第一回报值为第一时间对应的回报值,所述第一 时间为所述第一站点的动作对应的时间。The device according to claim 14, characterized in that the first reward value is a reward value corresponding to a first time, and the first The time is the time corresponding to the action of the first site.
  17. 根据权利要求12-16任意一项所述的装置,其特征在于,在向所述第一站点发送所述第一回报值时,所述收发模块,用于向所述第一站点发送广播帧,所述广播帧包括所述第一回报值。The device according to any one of claims 12-16, characterized in that, when sending the first report value to the first station, the transceiver module is used to send a broadcast frame to the first station. , the broadcast frame includes the first report value.
  18. 根据权利要求17所述的装置,其特征在于,The device according to claim 17, characterized in that:
    所述多个站点还包括第二站点,所述处理模块,还用于若所述第一站点和所述第二站点同时发送报文并导致传输失败,则确定所述第二站点的回报值,所述第二站点的回报值与所述第一回报值相同;The multiple sites also include a second site, and the processing module is also configured to determine the return value of the second site if the first site and the second site send messages at the same time and cause transmission failure. , the return value of the second site is the same as the first return value;
    所述收发模块,还用于向所述第二站点发送所述广播帧。The transceiver module is also used to send the broadcast frame to the second station.
  19. 根据权利要求12-16任意一项所述的装置,其特征在于,在向所述第一站点发送所述第一回报值时,所述收发模块,用于向所述第一站点发送第一报文的响应帧;The device according to any one of claims 12 to 16, characterized in that, when sending the first report value to the first site, the transceiver module is used to send the first report value to the first site. The response frame of the message;
    其中,所述第一报文的响应帧包括所述第一回报值,所述第一回报值与第二报文对应,所述第二报文在所述第一报文之后接收。Wherein, the response frame of the first message includes the first return value, the first return value corresponds to the second message, and the second message is received after the first message.
  20. 根据权利要求19所述的装置,其特征在于,所述第一报文的响应帧还包括所述第二报文的标识信息或所述第二报文的时间戳。The device according to claim 19, wherein the response frame of the first message further includes identification information of the second message or a timestamp of the second message.
  21. 根据权利要求19所述的装置,其特征在于,The device according to claim 19, characterized in that:
    所述第二报文的时间戳为所述第二报文的接收时间;或,The timestamp of the second message is the reception time of the second message; or,
    所述第二报文的时间戳为所述第一报文的接收时间与所述第二报文的接收时间的差值。The timestamp of the second message is the difference between the reception time of the first message and the reception time of the second message.
  22. 根据权利要求12-21任意一项所述的装置,其特征在于,The device according to any one of claims 12-21, characterized in that,
    所述第一回报值 The first return value
    其中,d0为所述第一站点距离最近一次收到所述第一站点的确认帧的时间间隔,N为站点的数量,d1为所述第一站点距离最近一次监听到所述其他站点的确认帧的时间间隔,所述其他站点为所述多个站点中除所述第一站点的站点。Wherein, d 0 is the time interval since the first station last received the acknowledgment frame from the first station, N is the number of stations, and d 1 is the last time the first station heard the other stations. The time interval of the acknowledgment frame, the other stations are the stations among the multiple stations except the first station.
  23. 一种芯片,其特征在于,所述芯片包括至少一个处理器和接口,所述处理器用于读取并执行存储器中存储的指令,当所述指令被运行时,使得所述芯片执行如权利要求1-11任一项所述的方法。A chip, characterized in that the chip includes at least one processor and an interface, the processor is used to read and execute instructions stored in a memory, and when the instructions are executed, the chip executes the claims The method described in any one of 1-11.
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被计算机执行时,使所述计算机执行如权利要求1-11任一项所述的方法。 A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program includes program instructions. When executed by a computer, the program instructions cause the computer to execute the claims. The method described in any one of 1-11.
PCT/CN2023/104247 2022-08-12 2023-06-29 Reinforcement learning training method and related device WO2024032228A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210968171.8 2022-08-12
CN202210968171.8A CN117651346A (en) 2022-08-12 2022-08-12 Training method for reinforcement learning and related device

Publications (1)

Publication Number Publication Date
WO2024032228A1 true WO2024032228A1 (en) 2024-02-15

Family

ID=89850674

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/104247 WO2024032228A1 (en) 2022-08-12 2023-06-29 Reinforcement learning training method and related device

Country Status (2)

Country Link
CN (1) CN117651346A (en)
WO (1) WO2024032228A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210153219A1 (en) * 2019-11-19 2021-05-20 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for associating user equipment in a cellular network via multi-agent reinforcement learning
CN113316154A (en) * 2021-05-26 2021-08-27 重庆邮电大学 Authorized and unauthorized D2D communication resource joint intelligent distribution method
CN113613332A (en) * 2021-07-14 2021-11-05 广东工业大学 Spectrum resource allocation method and system based on cooperative distributed DQN (differential Quadrature reference network) combined simulated annealing algorithm
CN113923794A (en) * 2021-11-12 2022-01-11 中国人民解放军国防科技大学 Distributed dynamic spectrum access method based on multi-agent reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210153219A1 (en) * 2019-11-19 2021-05-20 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for associating user equipment in a cellular network via multi-agent reinforcement learning
CN113316154A (en) * 2021-05-26 2021-08-27 重庆邮电大学 Authorized and unauthorized D2D communication resource joint intelligent distribution method
CN113613332A (en) * 2021-07-14 2021-11-05 广东工业大学 Spectrum resource allocation method and system based on cooperative distributed DQN (differential Quadrature reference network) combined simulated annealing algorithm
CN113923794A (en) * 2021-11-12 2022-01-11 中国人民解放军国防科技大学 Distributed dynamic spectrum access method based on multi-agent reinforcement learning

Also Published As

Publication number Publication date
CN117651346A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
JP7483931B2 (en) Link processing method, multi-link device, and computer-readable storage medium
US10638423B2 (en) Group wake-up and keep-alive indication
CN109315017A (en) A kind of communication means that realizing dual-card dual-standby dual-pass and terminal
US20140269468A1 (en) Systems and methods for wireless band switching
US10484309B2 (en) Channel access based on uplink virtual queues
JP7528272B2 (en) STR capability indication method and related device
US20190386793A1 (en) Access-category-based multi-user trigger frames
EP4333533A1 (en) Computing power resource scheduling method and related apparatus
EP3537621B1 (en) Multi-user mimo preference-indication signaling
EP4181619A1 (en) Multi-link setup in wireless communication system
US20230354276A1 (en) Time resource allocation and receiving method and related apparatus
US20230345536A1 (en) Channel access method and apparatus
WO2020133491A1 (en) Capability reporting method and terminal device
WO2024032228A1 (en) Reinforcement learning training method and related device
EP4443940A1 (en) Communication method, apparatus and system
CN110876186A (en) Power saving for non-trigger based ranging
WO2022001532A1 (en) Cell selection method and apparatus
CN117083815A (en) Panel state processing method, communication device and storage medium
CN115499936A (en) Channel access method and related device
WO2023207644A1 (en) Channel contention method and apparatus
WO2024131889A1 (en) Communication method and apparatus, and chip and module device
WO2024125498A1 (en) Model updating method and apparatus, and device and storage medium
WO2023237039A1 (en) Channel access method and related product
WO2024169586A1 (en) Communication method and apparatus
WO2024125509A1 (en) Model updating method and apparatus, device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23851455

Country of ref document: EP

Kind code of ref document: A1