CN113039834B

CN113039834B - Wireless communication device, wireless communication system, and computer-readable recording medium

Info

Publication number: CN113039834B
Application number: CN201880099516.2A
Authority: CN
Inventors: 小林卓矢; 泽健太郎; 横山阳介; 山内尚久
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2024-03-01
Anticipated expiration: 2038-11-29
Also published as: CN113039834A; WO2020110250A1; JPWO2020110250A1; JP6632778B1

Abstract

An action value acquisition unit (310) acquires an action value in reinforcement learning in which adjustment of an evaluation threshold value as a threshold value for path selection is used as an action. A communication path control unit (320) updates an action value table indicating the action value of each of the groups of evaluation thresholds and adjustment methods, based on the acquired action values. Then, the communication path control unit adjusts the evaluation threshold based on the updated action value table, and selects a communication path using the adjusted evaluation threshold. A wireless communication unit (330) performs wireless communication via the selected communication path.

Description

Wireless communication device, wireless communication system, and computer-readable recording medium

Technical Field

The present invention relates to wireless communication control.

Background

With the growing interest in IoT (Internet of Things ), applications of wireless multi-hop networks using multi-hop functions have been developed for the purpose of monitoring factories, buildings, infrastructure, and the like.

In a wireless multi-hop network, a frame transmitted from a transmission source node is received by a relay node, and transmitted from the relay node to a transmission destination node. Therefore, the wireless multi-hop network is suitable for a system accommodating a large number of terminals in a wide area.

In addition, in the wireless multi-hop network, even if 1 communication path becomes unable to communicate due to the influence of attenuation or shadowing, other communication paths can be selected for communication. Therefore, the wireless multi-hop network has a feature of being resistant to obstacles.

However, in order to select an optimal relay path according to the radio wave environment around each node to satisfy the user's request such as the arrival rate and response time, a complicated path control method is required.

As a conventional technique, a path selection method has been proposed which considers the quality of a communication path from a transmission source node to a local node.

Patent document 1 proposes the following scheme: the received signal strength was evaluated in 3 stages using 2 thresholds, the link cost corresponding to the evaluation was calculated, and the path with the smallest link cost was selected. In this way, the communication path having the highest received signal strength is selected as the best relay path.

Prior art literature

Patent literature

Patent document 1: japanese patent laid-open publication No. 2011-30049

Non-patent literature

Non-patent document 1: IETF RFC6550, "IPv6 Routing Protocol for Low-Power and Lossy Networks"

Disclosure of Invention

In the method of patent document 1, the quality of the path is considered using a threshold value.

However, since the system integrator needs to determine the threshold value by debugging matching with the setting environment in the field, the determination of the threshold value requires a human hand.

In addition, the surrounding environment of the node is completely different depending on the place and time at which the node is installed. Therefore, the same threshold cannot be used for all nodes. When the same threshold is used for all nodes, the number of relays becomes excessive (or the number of relays becomes insufficient) with respect to the user's request in the node where the fluctuation range of the received signal strength is large (or small). As a result, there is a possibility that a transmission delay and a radio error occur.

The purpose of the present invention is to automatically and appropriately adjust a threshold value for path selection.

A wireless communication device of the present invention is provided with:

an action value acquisition unit that acquires an action value in reinforcement learning in which adjustment of an evaluation threshold value as a threshold value for path selection is used as an action;

a communication path control unit that updates an action value table indicating an evaluation threshold and an action value of each of the groups of adjustment methods based on the acquired action values, adjusts the evaluation threshold based on the updated action value table, and selects a communication path using the adjusted evaluation threshold; and

And a wireless communication unit that performs wireless communication via the selected communication path.

According to the present invention, the threshold value for path selection can be automatically and appropriately adjusted.

Drawings

Fig. 1 is a block diagram of a radio communication system 100 according to embodiment 1.

Fig. 2 is a block diagram of a wireless communication apparatus 200 according to embodiment 1.

Fig. 3 is a block diagram of a wireless communication apparatus 300 according to embodiment 1.

Fig. 4 is a flowchart of a wireless communication method (path selection) in embodiment 1.

Fig. 5 is a flowchart of the host processing (S120) in embodiment 1.

Fig. 6 is a flowchart of the sub-machine process (S130) in embodiment 1.

Fig. 7 is a diagram showing an action value table 391 in embodiment 1.

Fig. 8 is a flowchart of the threshold adjustment process (S134) in embodiment 1.

Fig. 9 is a timing chart showing an example of the operation of the wireless communication system 100 in embodiment 1.

Fig. 10 is another example of a flowchart of the host processing (S120) in embodiment 1.

Fig. 11 is another example of a flowchart of the sub-machine process (S130) in embodiment 1.

Fig. 12 is another example of a flowchart of the radio communication method (path selection) in embodiment 1.

Fig. 13 is a flowchart of the sub-machine process (S120B) in embodiment 1.

Fig. 14 is a flowchart of a wireless communication method (interval adjustment) in embodiment 2.

Fig. 15 is a flowchart of the sub-machine process (S230) in embodiment 2.

Fig. 16 is a diagram showing an action value table 392 in embodiment 2.

Fig. 17 is a flowchart of the interval adjustment process (S234) in embodiment 2.

Fig. 18 is another example of a flowchart of the sub-machine process (S230) in embodiment 2.

Fig. 19 is another example of a flowchart of a radio communication method (interval adjustment) in embodiment 2.

Fig. 20 is a flowchart of the sub-machine process (S220B) in embodiment 2.

Fig. 21 is a flowchart of a radio communication method (relay selection) in embodiment 3.

Fig. 22 is a flowchart of the sub-machine process (S330) in embodiment 3.

Fig. 23 is a diagram showing an action value table 393 in embodiment 3.

Fig. 24 is a flowchart of the relay selection process (S334) in embodiment 3.

Fig. 25 is another example of a flowchart of the sub-machine process (S330) in embodiment 3.

Fig. 26 is a flowchart of a radio communication method (relay selection) in embodiment 3. Other examples of (a) are described.

Fig. 27 is a flowchart of the sub-machine process (S320B) in embodiment 3.

Fig. 28 is a configuration diagram of a wireless communication apparatus 300 according to embodiment 4.

Fig. 29 is a flowchart of a wireless communication method (rate adjustment) in embodiment 4.

Fig. 30 is a flowchart of the sub-machine process (S430) in embodiment 4.

Fig. 31 is a diagram showing an action value table 394 in embodiment 4.

Fig. 32 is a flowchart of the rate adjustment process (S434) in embodiment 4.

Fig. 33 is another example of a flowchart of the sub-machine process (S430) in embodiment 4.

Fig. 34 is another example of a flowchart of a radio communication method (rate adjustment) in embodiment 4.

Fig. 35 is a flowchart of the sub-machine process (S420B) in embodiment 4.

Fig. 36 is a configuration diagram of a wireless communication apparatus 300 according to embodiment 5.

Fig. 37 is a flowchart of a wireless communication method (learning result confirmation) in embodiment 5.

Fig. 38 is a hardware configuration diagram of the wireless communication apparatus 200 in the embodiment.

Fig. 39 is a hardware configuration diagram of the wireless communication apparatus 300 in the embodiment.

(description of the reference numerals)

100: a wireless communication system; 200: a wireless communication device; 201: a processor; 202: a memory; 203: a wired interface; 204: a wireless interface; 205: a wireless antenna; 209: a processing circuit; 210: a wireless communication unit; 220: an information providing unit; 290: a storage unit; 300: a wireless communication device; 301: a processor; 302: a memory; 303: a wired interface; 304: a wireless interface; 305: a wireless antenna; 309: a processing circuit; 310: a movement value acquisition unit; 320: a communication path control unit; 330: a wireless communication unit; 340: a transmission rate control unit; 350: a learning result confirmation unit; 390: a storage unit; 391. 392, 393, 394: action value table.

Detailed Description

In the embodiments and drawings, the same reference numerals are given to the same elements or corresponding elements. The description of the elements with the same reference numerals as those of the description is omitted or simplified as appropriate. Arrows in the figure mainly indicate the flow of data or the flow of processing.

Embodiment 1.

Referring to fig. 1 to 13, a method of applying reinforcement learning to adjust a threshold value for a communication path will be described.

* Description of the Structure

The structure of the wireless communication system 100 is described with reference to fig. 1.

The wireless communication system 100 includes a plurality of wireless communication apparatuses.

A wireless communication apparatus that operates as a "host" is referred to as a "wireless communication apparatus 200".

Each wireless communication apparatus that operates as a "slave" or a "relay" is referred to as a "wireless communication apparatus 300".

In fig. 1, a wireless communication apparatus 300A operates as a slave, and each of the wireless communication apparatuses (300B to 300G) operates as a relay.

The plurality of wireless communication apparatuses 300 constitute a multi-hop network.

The host (200) manages a multi-hop network.

The slave unit (300A) communicates with the master unit (200) via 1 or more relay units (300B-300G).

The structure of the wireless communication apparatus 200 is described with reference to fig. 2.

The wireless communication apparatus 200 is a computer including hardware such as a processor 201, a memory 202, a wired interface 203, a wireless interface 204, and a wireless antenna 205. These pieces of hardware are connected to each other via signal lines.

The processor 201 is an IC (Integrated Circuit ) that performs arithmetic processing, and controls other hardware. For example, the processor 201 is a CPU (Central Processing Unit ), DSP (Digital Signal Processor, digital signal processor), or GPU (Graphics Processing Unit ).

The memory 202 is a storage device. For example, the Memory 202 is RAM (Random Access Memory ), ROM (Read Only Memory), HDD (Hard Disk Drive), flash Memory, or a combination thereof.

The wired interface 203 is an interface for performing wired communication. A cable is connected to the cable interface 203. For example, the wired interface 203 is a communication chip or NIC (Network Interface Card ).

The wireless interface 204 is an interface for performing wireless communication. A wireless antenna 205 is connected to the wireless interface 204. For example, the wireless interface 204 is a communication chip or NIC.

The wireless antenna 205 is an antenna for wireless communication.

The wireless communication device 200 performs wireless communication using the wireless interface 204 and the wireless antenna 205.

The wireless communication device 200 includes elements such as a wireless communication unit 210 and an information providing unit 220. These elements are implemented in software.

In the memory 202, a wireless communication program for causing a computer to function as the wireless communication section 210 and the information providing section 220 is stored. Further, in the memory 202, an OS (Operating System) is stored.

The processor 201 executes the wireless communication program while executing the OS.

The data resulting from the execution of the wireless communication program is stored in the memory 202, a register within the processor 201, or a cache memory within the processor 201.

The memory 202 functions as a storage unit 290.

The wireless communication apparatus 200 may include a plurality of processors instead of the processor 201. The multiple processors share the role of processor 201.

The wireless communication program can be recorded (stored) in a computer-readable manner on a nonvolatile recording medium such as an optical disk or a flash memory.

The structure of the wireless communication apparatus 300 is described with reference to fig. 3.

The wireless communication apparatus 300 is a computer including hardware such as a processor 301, a memory 302, a wired interface 303, a wireless interface 304, and a wireless antenna 305. These pieces of hardware are connected to each other via signal lines.

The processor 301 is an IC that performs arithmetic processing, and controls other hardware. For example, the processor 301 is a CPU, DSP or GPU.

Memory 302 is a storage device. For example, the memory 302 is RAM, ROM, HDD, flash memory, or a combination thereof.

The wired interface 303 is an interface for performing wired communication. A cable is connected to the cable interface 303. For example, the wired interface 303 is a communication chip or NIC.

The wireless interface 304 is an interface for performing wireless communication. A wireless antenna 305 is connected to the wireless interface 304. For example, the wireless interface 304 is a communication chip or NIC.

The wireless antenna 305 is an antenna for wireless communication.

The wireless communication device 300 performs wireless communication using the wireless interface 304 and the wireless antenna 305.

The wireless communication apparatus 300 includes elements such as an action value acquisition unit 310, a communication path control unit 320, and a wireless communication unit 330. These elements are implemented in software.

In the memory 302, a wireless communication program for causing a computer to function as the action value acquisition unit 310, the communication path control unit 320, and the wireless communication unit 330 is stored. Further, the memory 302 stores an OS.

The processor 301 executes the wireless communication program while executing the OS.

The data resulting from the execution of the wireless communication program is stored in the memory 302, a register within the processor 301, or a cache memory within the processor 301.

The memory 302 functions as a storage unit 390.

The wireless communication apparatus 300 may include a plurality of processors instead of the processor 301. The multiple processors share the role of processor 301.

* Description of actions

The operation of the wireless communication system 100 corresponds to a wireless communication method. In addition, the procedure of the wireless communication method corresponds to the procedure of the wireless communication program.

In the wireless communication system 100, reinforcement learning is applied for path control.

Reinforcement learning is the following method: for interactions given by actions to the environment, appropriate actions are learned so as to maximize future rewards. Reinforcement learning is one type of machine learning.

The subject of the action is referred to as "agent". In the wireless communication system 100, each wireless communication apparatus becomes an agent.

The interacted object is referred to as an "environment".

The interaction of an agent with an environment in the current state (state) is called "action".

An index of the benefit of comparing the result of the action in the current state is referred to as "reward".

The agent learns actions appropriate to the surrounding environment based on the consideration.

As a representative method of reinforcement Learning, Q Learning (Q-Learning) and TD Learning (TD-Learning) are known.

In embodiment 1, reinforcement learning is described using Q learning as a specific example. However, a method other than Q learning may be employed as reinforcement learning.

A wireless communication method (path selection) is described with reference to fig. 4.

The wireless communication method (path selection) is a method of performing wireless communication by applying reinforcement learning for path selection.

The threshold for path selection is referred to as an "evaluation threshold". In the path selection algorithm, an evaluation threshold is used.

The state in reinforcement learning for path selection is an evaluation threshold.

An action in reinforcement learning for path selection is adjustment (increase and decrease) of the evaluation threshold.

In step S110, wireless communication is performed between the host and the slave.

Specifically, the wireless communication unit 330 of the slave unit performs wireless communication with the master unit via the communication path selected in the previous path selection process (S130). In wireless communication, communication of a wireless frame is performed. The host is a communication counterpart of the slave.

In step S120, the host provides action value information for path selection to the slave.

The action value information for route selection is information for specifying an action value in reinforcement learning for route selection. Specifically, the action value information is a reward in reinforcement learning.

According to fig. 5, a procedure of the host processing (S120) is explained.

In step S121, the information providing section 220 detects the providing timing.

The providing timing is timing of providing action value information for route selection.

Specifically, the information providing unit 220 detects that a certain period has elapsed since the last providing timing. The certain period is appropriately determined. When the wireless communication system 100 is applied to a power condition monitoring system in a factory, for example, 3 minutes is determined as a fixed period.

In step S122, the information providing unit 220 calculates a communication quality value of wireless communication between the host and the slave unit.

The communication quality value is a value indicating the communication quality of wireless communication.

Specifically, the information providing section 220 calculates PER and transmission delay time. PER is a shorthand for Packet Error Rate.

Conventionally, PER and transmission delay time are calculated separately.

In step S123, the information providing unit 220 calculates a reward in reinforcement learning for route selection based on the calculated communication quality value.

For example, the information providing unit 220 calculates the consideration by the calculation formula (1).

"r" is the reward.

"A" is PER.

"B" is the transmission delay time.

"β" is a parameter value. The parameter value β is predetermined by the user. Specifically, the parameter value β is selected from a range of 0 to 1. For example, in a case where only the transmission delay time is considered in order to select a communication path having a short transmission delay time, the user selects "0" as the parameter value β.

"δ" is a parameter value. The parameter value delta is predetermined by the user. Specifically, the parameter value δ is selected from a range of 0 to 1. For example, when only the transmission delay time is considered in order to select a communication path having a short transmission delay time, the user selects "1" as the parameter value δ.

[ mathematics 1]

In step S124, the information providing unit 220 transmits the calculated consideration to the slave unit. The transmitted reward is action value information.

Specifically, the information providing unit 220 transmits the communication frame to which the consideration is set to the slave unit. The transmitted communication frame arrives at the slave unit via 1 or more relay units.

Returning to fig. 4, step S130 is described.

In step S130, the slave unit selects a communication path based on the action value information for path selection.

After step S130, wireless communication is performed between the host and the slave via the selected communication path.

Specifically, the slave unit operates as follows.

The action value obtaining unit 310 receives action value information for route selection, and obtains action value based on the received action value information.

The communication path control unit 320 updates the action value table for path selection based on the acquired action value. The action value table for path selection represents the action value of each of the group of the evaluation threshold and the adjustment method. Then, the communication path control unit 320 adjusts the evaluation threshold based on the updated action value table, and selects a communication path using the adjusted evaluation threshold.

The wireless communication unit 330 performs wireless communication with the host via the selected communication path.

According to fig. 6, a process of the microcomputer processing (S130) is described.

In step S131, the action value acquisition unit 310 receives a reward in reinforcement learning for route selection. The received consideration is action value information.

Specifically, the action value acquisition unit 310 receives a communication frame in which a reward is set.

In step S132, the action value acquisition unit 310 calculates an action value in reinforcement learning for route selection based on the received consideration.

Specifically, the action value acquisition unit 310 calculates the Q value in Q learning. The calculated Q value is the action value.

For example, the action value obtaining unit 310 calculates the Q value by the calculation formula (2).

“S _t "indicates the state of the environment at time t.

“a _t "indicates an action at time t.

"Q (St, at)" is state S _t Action a _t Is of value (c).

“S _t+1 "means action a _t The state of the latter environment. By action a _t State from S _t "transition to" S _t+1 ”。

“r _t+1 "means based on the orientation state S _t+1 Is a transition reward of (a).

"γ" is a parameter value called the discount rate. The discount rate gamma is predetermined by the user. Specifically, the discount rate γ is selected from the range of 0< γ.ltoreq.1.

“maxQ(S _t+1 A) "is the pass state S _t+1 The maximum value obtained in action a below. Action a is in state S _t+1 The lower value is the largest action. maxQ (S) _t+1 A) is selected from the action value table.

"α" is a learning coefficient. The learning coefficient α is predetermined by the user. Specifically, the learning coefficient α is selected from a range of 0< α+.1.

[ math figure 2]

Q(S _t ，a _t )←Q(S _t ，a _t )+α(r _t+1 +γmaxQ(S _t+1 ，a)-Q(S _t ，a _t ))…(2)

By calculation ofFormula (2) according to the passing state S _t Action a _t The obtained reward r _t+1 Update state S _t Action a _t Action value Q (S) _t ，a _t )。

If compared with state S _t Action a _t Action value Q (S) _t ，a _t ) Based on "reward r _t+1 An evaluation value Q of the optimal action "max a" in the subsequent state of the "+" action a "(S _t+1 ，max a _t+1 ) Larger, then Q (S _t ，a _t ) And becomes larger. If conversely smaller, Q (S _t ，a _t ) And also becomes smaller. That is, the consideration for returning the value of a certain action in a certain state immediately as a result is made to approach the value of an optimal action in a next state based on the action.

From fig. 7, an action value table 391 is illustrated.

The action value table 391 is an action value table used in reinforcement learning for path selection.

The action value table 391 represents the action value Q of each of the groups of evaluation threshold and adjustment methods.

The specific evaluation threshold is a value compared with the received signal strength for each communication path. The received signal strength is the signal strength at the time of receiving the frame. That is, the specific evaluation threshold is the signal strength. "dBm" is the unit of signal strength.

Specific adjustment methods are to raise the evaluation threshold, to lower the evaluation threshold, or to not change the evaluation threshold. That is, the adjustment method is a three-dimensional action.

Q(S _{_n} , +.f) is in state S _{_n} And the action value of the case of increasing the evaluation threshold is increased.

Q(S _{_n} , +.gtoreq.) is in state S _{_n} The action value in the case of lowering the evaluation threshold is lowered.

Q(S _{_n} , →) is in state S _{_n} The action value in the case where the evaluation threshold is not changed.

maxQ (S) in the above formula (2) _t+1 A) is from action value Table 391.

Specifically, the communication path control unit 320 extracts Q from the action value table 391 (S _t+1 ，↑)、Q(S _t+1 , ∈), and Q (S) _t+1 -fwdarw) the 3 action values Q. Then, the communication path control unit 320 selects the largest action value Q from the 3 extracted action values Q. The action value Q selected is maxQ (S _t+1 ，a)。

Returning to fig. 6, the description continues from step S133.

In step S133, the communication path control unit 320 updates the action value table 391 based on the calculated action value.

Specifically, the communication path control unit 320 calculates the action value Q (S _t ，a _t ) The action value table 391 is updated as follows.

First, the communication path control unit 320 selects the action value Q from the action value table 391 (S _t ，a _t )。

Then, the communication path control unit 320 uses the calculated action value Q (S _t ，a _t ) The selected action value Q is overwritten (S _t ，a _t )。

In step S134, the communication path control unit 320 adjusts the evaluation threshold.

Referring to fig. 8, the procedure of the threshold adjustment process (S134) will be described.

In step S1341, the communication path control unit 320 randomly selects whether or not to perform threshold adjustment based on the action value table 391.

For example, a method of selecting an action (adjustment method) that maximizes the Q value at "1-Epsilon" as in the Epsilon-greeny method may be applied. "ε" is a parameter value. The value epsilon is predetermined by the user. Specifically, the value ε is selected from the range of 0< ε.ltoreq.1. When the value ε is "0", the adjustment method that maximizes the Q value is necessarily selected based on the action value table 391. When the value ε is "1", the adjustment method must be randomly determined.

In step S1341, a result based on the new action is obtained, so reinforcement learning can be prevented from becoming trapped in a local solution.

If the threshold adjustment based on the action value table 391 is selected, the process advances to step S1342.

If the threshold adjustment based on the action value table 391 is selected not to be performed, the process advances to step S1343.

In step S1342, the communication path control unit 320 selects an adjustment method based on the action value table 391.

Specifically, the communication path control unit 320 selects the adjustment method as follows.

First, the communication path control unit 320 selects the last evaluation threshold and the action value of the group of adjustment methods for each adjustment method from the action value table 391. Thus, a plurality of action values corresponding to a plurality of adjustment methods are selected.

Next, the communication path control unit 320 selects the maximum action value from the plurality of selected action values.

Then, the communication path control unit 320 selects an adjustment method corresponding to the selected maximum action value.

In step S1343, the communication-path control unit 320 randomly selects an adjustment method.

In step S1344, the communication path control unit 320 adjusts the evaluation threshold according to the selected adjustment method.

Returning to fig. 6, step S135 is described.

In step S135, the communication path control unit 320 selects a communication path using the adjusted evaluation threshold.

Specifically, the communication path control unit 320 executes a conventional path selection algorithm by using the adjusted evaluation threshold value to select a communication path.

For example, the communication path control unit 320 selects a communication path by a path selection algorithm such as RPL standardized in IETF (see non-patent document 1). IETF is a shorthand for Internet Engineering Task Force. RPL is an acronym for IPv6 Routing Protocol for Low Power and Lossy Network.

The number of evaluation thresholds in embodiment 1 is equal to the number of thresholds used in the path selection algorithm.

A specific example of the operation of the wireless communication system 100 will be described with reference to fig. 9.

In step S1911, the communication path control unit 320 of the slave unit selects a communication path for relaying to the wireless communication device 300B.

In step S1912, the wireless communication unit 330 of the slave unit transmits APL data to the wireless communication device 300B. The wireless communication apparatus 300B receives APL data and transmits the received APL data to the host. Then, the wireless communication unit 210 of the host receives the APL data. APL is an acronym for application.

In step S1921, the period timer of the host expires.

In step S1922, the information providing section 220 of the host calculates the PER and the transfer delay time.

In step S1923, the information providing unit 220 of the host calculates a reward in reinforcement learning for route selection.

In step S1924, the information providing unit 220 of the host transmits the calculated consideration to the sub-machine.

In step S1931, the action value acquisition unit 310 of the slave unit receives the consideration. Then, the communication path control unit 320 of the slave unit updates the Q value set in the action value table 391.

In step S1932, the communication path control unit 320 of the slave unit refers to the action value table 391 to determine the next evaluation threshold.

In step S1933, the communication path control unit 320 of the slave unit selects a communication path using the determined evaluation threshold. Thereby, a communication path for relaying the wireless communication apparatus 300D is selected.

In step S1934, the wireless communication unit 330 of the slave unit transmits APL data to the wireless communication device 300D. The wireless communication apparatus 300D receives APL data and transmits the received APL data to the host. Then, the wireless communication unit 210 of the host receives the APL data.

Thereafter, the same processing as in step S1921 to step S1934 is repeated.

* Example of embodiment 1

The action value information may also be a communication quality value. The following describes a case where the action value information is a communication quality value.

The host processing (S120) is described with reference to fig. 10.

Step S121 and step S122 are as described with reference to fig. 5.

In step S123A, the information providing unit 220 transmits the communication quality value to the slave unit. The transmitted communication quality value is action value information. Step S123A corresponds to step S124 of fig. 5.

According to fig. 11, the processing of the seed machine is described (S130).

In step S131A, the action value acquisition unit 310 receives the communication quality value from the host. The received communication quality value is action value information. Step S131A corresponds to step S131 of fig. 6.

In step S132A, the action value acquisition unit 310 calculates a reward in reinforcement learning for route selection based on the received communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S133A to S136A are the same as steps S132 to S135 (see fig. 6).

The communication quality value may also be calculated by the sub-computer. In this case, the information providing unit 220 of the host is not required. The case where the sub-computer calculates the communication quality value will be described below.

A wireless communication method (path selection) is described with reference to fig. 12.

Step S110 is as described with reference to fig. 4.

In step S120B, the slave unit selects a communication path. Step S120B corresponds to step S130 (see fig. 4).

According to fig. 13, the process of the seed machine process (S120B) is described.

In step S121B, the action value acquisition unit 310 detects the acquisition timing. The acquisition timing corresponds to the supply timing in step S121 (see fig. 5).

In step S122B, the action value acquisition unit 310 calculates a communication quality value. The calculation method is the same as that in step S122 (refer to fig. 5).

In step S123B, the action value acquisition unit 310 calculates a reward in reinforcement learning for route selection based on the calculated communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S124B to S127B are the same as steps S132 to S135 (see fig. 6).

* Effects of embodiment 1

In embodiment 1, learning is used in the method for determining the threshold value, and the wireless device automatically determines the threshold value. Thus, since manual parameter adjustment is not required, the personnel cost for the system integrator can be reduced. In embodiment 1, a threshold suitable for setting an environment is determined for each wireless device. This reduces the number of wireless devices that perform excessive relay, and reduces the transmission delay. Alternatively, the PER of the radio device having reduced the shortage of the relay frequency becomes low.

* Supplementary to embodiment 1

The reinforcement learning is supplemented.

Basically, learning is started from a state in which the result caused by an action is not known at all. However, learning may be started in consideration of learning time. For example, learning can be started from a good start point by searching only values or the like that are actually possible. Alternatively, interpolation can be performed by function approximation for actions not being searched. The change range of the evaluation threshold may be fixed to-1 dBm or may be variable.

Embodiment 2.

Regarding the manner in which reinforcement learning is applied to adjust the transmission interval of the control frame, differences from embodiment 1 will be mainly described with reference to fig. 14 to 20.

* Description of the Structure

The configuration of the wireless communication system 100 is the same as that in embodiment 1 (see fig. 1 to 3).

* Description of actions

A wireless communication method (interval adjustment) will be described with reference to fig. 14.

The radio communication method (interval adjustment) is a method of performing radio communication by applying reinforcement learning to adjust the transmission interval of a control frame.

The control frame is a frame for communication to update the communication path.

The transmission interval of the control frame is a time interval in which the control frame is transmitted.

In RPL, communication paths are updated by periodically performing communication of control frames between wireless communication devices. In the control frame, a level indicating a path evaluation value is set. The rank is calculated based on information such as the number of hops and the received signal strength. The received signal strength is the signal strength at the time of receiving the frame.

In step S210, wireless communication is performed between the host and the slave.

For example, communication of control frames is periodically performed between the host and the slave. Then, the communication path between the host and the slave unit is updated according to the control frame in which the communication is performed. The update method is a method in the related art such as RPL.

In step S220, the host provides the action value information for interval adjustment to the slave unit.

The action value information for interval adjustment is information for specifying an action value in reinforcement learning for interval adjustment. Specifically, the action value information is a reward in reinforcement learning.

The procedure of the host process (S220) is the same as that of the host process (S120).

In step S230, the slave unit adjusts the transmission interval of the control frame based on the action value information for interval adjustment.

After step S230, the control frame is transmitted at the adjusted transmission interval.

Specifically, the slave unit operates as follows.

The action value obtaining unit 310 receives action value information for interval adjustment, and obtains action value based on the received action value information.

The communication path control unit 320 updates the action value table for interval adjustment based on the acquired action value. The action value table for interval adjustment indicates the action value of each of the groups of the transmission interval and adjustment method. Then, the communication path control unit 320 adjusts the transmission interval of the control frame based on the updated action value table.

The wireless communication unit 330 transmits the control frame at the adjusted transmission interval.

According to fig. 15, the process of the seed machine process (S230) is described.

In step S231, the action value acquisition unit 310 receives a reward in reinforcement learning for interval adjustment. The received consideration is action value information.

In step S232, the action value acquisition unit 310 calculates an action value in reinforcement learning for interval adjustment based on the received consideration.

The calculation method is the same as that in step S132 (refer to fig. 6). However, the state in reinforcement learning is the transmission interval of the control frame, and the action in reinforcement learning is adjustment of the transmission interval. In addition, an action value table 392 is used instead of the action value table 391.

In step S233, the communication path control unit 320 updates the action value table 392 based on the calculated action value.

The update method is the same as the method in step S133 (refer to fig. 6).

Referring to fig. 16, an action value table 392 is illustrated.

Action value table 392 is an action value table used in reinforcement learning for interval adjustment.

Action value table 392 represents an action value Q of each of the groups of transmission intervals and adjustment methods.

The "sec" of the transmission interval means "second". The transmission interval may be other than 60sec, 120sec, or 180 sec.

Specific adjustment methods are to lengthen the transmission interval, shorten the transmission interval, or not change the transmission interval.

Q(S _n , +.f) is in state S _n The action value of the case of extending the transmission interval is extended.

Q(S _n , +.gtoreq.) is in state S _n Action value in the case of shortening transmission interval is reduced.

Q(S _n , →) is in state S _n The action value of the case where the transmission interval is not changed.

State S _n The transmission interval of the control frame is n seconds.

Returning to fig. 15, step S234 is described.

In step S234, the communication path control unit 320 adjusts the transmission interval of the control frame.

Referring to fig. 17, the procedure of the interval adjustment process (S234) will be described.

In step S2341, communication path control unit 320 randomly selects whether or not to perform interval adjustment based on action value table 392.

The selection method is the same as that in step S1341 (see fig. 8).

If the interval adjustment based on the action value table 392 is selected, the process advances to step S2342.

If it is selected not to perform the interval adjustment based on the action value table 392, the process advances to step S2343.

In step S2342, the communication path control unit 320 selects an adjustment method based on the action value table 392.

The selection method is the same as that in step S1342 (see fig. 8). However, the evaluation threshold is replaced with a transmission interval of the control frame, and the adjustment of the evaluation threshold is replaced with the adjustment of the transmission interval.

In step S2343, the communication path control unit 320 randomly selects an adjustment method.

In step S2344, communication path control unit 320 adjusts the transmission interval of the control frame in accordance with the selected adjustment method.

* Example of embodiment 2

As in the example of embodiment 1, the action value information may be a communication quality value. The following describes a case where the action value information is a communication quality value.

The host process (S220) is the same as the host process (S120) of fig. 10.

According to fig. 18, the seed machine process is described (S230).

In step S231A, the action value acquisition unit 310 receives a communication quality value from the host. The received communication quality value is action value information. Step S231A corresponds to step S231 of fig. 15.

In step S232A, the action value acquisition unit 310 calculates a reward in reinforcement learning for interval adjustment based on the received communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S233A to S235A are the same as steps S232 to S234 (see fig. 15).

As in the example of embodiment 1, the sub-computer may calculate the communication quality value. In this case, the information providing unit 220 of the host is not required. The case where the sub-computer calculates the communication quality value will be described below.

A wireless communication method (interval adjustment) will be described with reference to fig. 19.

Step S210 is as described with reference to fig. 14.

In step S220B, the slave unit adjusts the transmission interval of the control frame. Step S220B corresponds to step S230 (see fig. 14).

According to fig. 20, the process of the seed machine process (S220B) is described.

In step S221B, the action value acquisition unit 310 detects the acquisition timing. The acquisition timing corresponds to the supply timing in step S121 (see fig. 5).

In step S222B, the action value acquisition unit 310 calculates a communication quality value. The calculation method is the same as that in step S122 (refer to fig. 5).

In step S223B, the action value acquisition unit 310 calculates a reward in reinforcement learning for interval adjustment based on the calculated communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S224B to S226B are the same as steps S232 to S234 (see fig. 15).

* Effects of embodiment 2

With embodiment 2, the transmission interval of the control frame can be changed to a value suitable for the surrounding environment of the slave unit. Furthermore, the utilization efficiency of the communication band is optimized. Specifically, it is possible to reduce the number of wireless devices that cannot transmit application data that is intended to be transmitted due to the excessive control frame occupation band. As a result, the transmission delay becomes small. Alternatively, the radio device can be reduced in the number of transmission intervals of the control frames, which is too long to track the change in the surrounding environment. As a result, radio errors can be reduced.

Embodiment 3.

The mode of applying reinforcement learning to select the first relay in the communication path will be mainly described with reference to fig. 21 to 27, which are different from embodiment 1 and embodiment 2.

* Description of the Structure

* Description of actions

A wireless communication method (relay selection) will be described with reference to fig. 21.

The wireless communication method (relay selection) is a method of performing wireless communication by applying reinforcement learning for relay selection.

In step S310, wireless communication is performed between the host and the slave.

Specifically, the wireless communication unit 330 of the slave unit performs wireless communication with the master unit via the relay unit selected in the previous relay unit selection process (S330).

In step S320, the host computer provides action value information for relay selection to the slave computer.

The action value information for relay selection is information for specifying an action value in reinforcement learning for relay selection. Specifically, the action value information is a reward in reinforcement learning.

The procedure of the host process (S320) is the same as that of the host process (S120).

In step S330, the slave unit selects a relay unit based on the action value information for relay unit selection.

After step S330, the slave unit performs wireless communication with the master unit via the selected relay unit.

Specifically, the slave unit operates as follows.

The action value obtaining unit 310 receives action value information for relay selection, and obtains action value based on the received action value information.

The communication path control unit 320 updates the action value table for relay selection based on the acquired action value. The action value table for relay selection indicates the action value of each of the previous relay and the next relay candidate groups. Then, the communication path control unit 320 selects the next relay device based on the updated action value table.

The wireless communication unit 330 performs wireless communication with the host via the selected relay.

According to fig. 22, a process of the sub-machine process (S330) is described.

In step S331, the action value acquisition unit 310 receives a reward in reinforcement learning for relay selection. The received consideration is action value information.

In step S332, the action value acquisition unit 310 calculates an action value in reinforcement learning for relay selection based on the received consideration.

The calculation method is the same as that in step S132 (refer to fig. 6). However, the state in reinforcement learning is the last relay, and the action in reinforcement learning is the selection of the relay. In addition, an action value table 393 is used instead of the action value table 391.

In step S333, the communication path control unit 320 updates the action value table 393 based on the calculated action value.

The update method is the same as the method in step S133 (refer to fig. 6).

Referring to fig. 23, an action value table 393 is illustrated.

The action value table 393 is an action value table used in reinforcement learning for relay selection.

The action value table 393 shows the action value Q of each of the previous relay and the next and subsequent relay candidate groups.

Q(S _X ，a _Y ) Is in state S _X Action a is performed downwards _Y Action value of the situation of (2).

State S _x The last relay is in a state of the wireless communication apparatus 300X.

Action a _Y Wireless communication apparatus 300Y is selected as the relay.

Returning to fig. 22, step S334 is described.

In step S334, the communication path control unit 320 selects the next and subsequent relay devices.

The procedure of the relay selection process (S334) is described with reference to fig. 24.

In step S3341, the communication path control unit 320 randomly selects whether or not to perform relay selection based on the action value table 393.

The selection method is the same as that in step S1341 (see fig. 8).

When the relay selection based on the action value table 393 is selected, the process advances to step S3342.

If the relay selection based on the action value table 393 is selected not to be performed, the process advances to step S3343.

In step S3342, the communication path control unit 320 selects the next and subsequent relay devices based on the action value table 393.

The selection method is the same as that in step S1342 (see fig. 8). However, the evaluation threshold is replaced with the previous relay, and the adjustment of the evaluation threshold is replaced with the selection of the relay.

In step S3343, the communication path control unit 320 randomly selects the next and subsequent relay devices.

* Example of embodiment 3

The host process (S320) is the same as the host process (S120) of fig. 10.

According to fig. 25, the processing of the seed machine is described (S330).

In step S331A, the action value acquisition unit 310 receives a communication quality value from the host computer. The received communication quality value is action value information. Step S331A corresponds to step S331 of fig. 22.

In step S332A, the action value acquisition unit 310 calculates a reward in reinforcement learning for relay selection based on the received communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Step S333A and step S334A are the same as step S332 and step S333 (see fig. 22).

A wireless communication method (interval adjustment) will be described with reference to fig. 26.

Step S310 is as described with reference to fig. 21.

In step S320B, the child machine selects a relay machine. Step S320B corresponds to step S330 (see fig. 21).

According to fig. 27, a process of the sub-machine process (S320B) is described.

In step S321B, the action value acquisition unit 310 detects the acquisition timing. The acquisition timing corresponds to the supply timing in step S121 (see fig. 5).

In step S322B, the action value acquisition unit 310 calculates a communication quality value. The calculation method is the same as that in step S322 (refer to fig. 5).

In step S323B, the action value acquisition unit 310 calculates a reward in reinforcement learning for relay selection based on the calculated communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S324B to S326B are the same as steps S332 to S334 (see fig. 22).

* Effects of embodiment 3

In embodiment 3, it is possible to determine a connection destination (relay) suitable for an environment in consideration of factors other than existing parameters without using a threshold value. In addition, since it is not necessary to install a complicated path control algorithm, the memory of the program is reduced, and the radio communication apparatus can be miniaturized and reduced in cost.

Embodiment 4.

Regarding the manner in which reinforcement learning is applied in order to adjust the transmission rate (transmission rate), the differences from embodiment modes 1 to 3 will be mainly described with reference to fig. 28 to 35.

* Description of the Structure

The configuration of the radio communication system 100 is the same as that in embodiment 1 (see fig. 1).

The configuration of the wireless communication apparatus 200 is the same as that in embodiment 1 (see fig. 2).

The structure of the wireless communication apparatus 300 is described with reference to fig. 28.

The wireless communication apparatus 300 further includes an element such as a transmission rate control unit 340. The transfer rate control section 340 is implemented by software.

The wireless communication program further causes the computer to function as the transfer rate control unit 340.

* Description of actions

The wireless communication method (rate adjustment) is described with reference to fig. 29.

The radio communication method (rate adjustment) is a method for performing radio communication by applying reinforcement learning to adjust a transmission rate.

In step S410, wireless communication is performed between the host and the slave.

Specifically, the wireless communication unit 330 of the slave unit performs wireless communication with the master unit at the transmission rate adjusted in the previous rate adjustment process (S430).

In step S420, the host provides action value information for rate adjustment to the slave unit.

The action value information for rate adjustment is information for determining an action value in reinforcement learning for rate adjustment. Specifically, the action value information is a reward in reinforcement learning.

The process of step S420 is the same as that of step S120, step S220 or step S320.

In step S430, the slave unit adjusts the transmission rate according to the action value information for rate adjustment.

After step S430, the slave unit performs wireless communication with the master unit at the adjusted transmission rate.

Specifically, the slave unit operates as follows.

The action value obtaining unit 310 receives action value information for rate adjustment, and obtains action value based on the received action value information.

The communication path control unit 320 updates the action value table for rate adjustment based on the acquired action value. The action value table for rate adjustment indicates the action value of each of the groups of the transmission rate and adjustment method. Then, the communication path control unit 320 adjusts the transmission rate based on the updated action value table.

The wireless communication unit 330 performs wireless communication with the host at the adjusted transmission rate.

According to fig. 30, a process of the microcomputer processing (S430) is described.

In step S431, the action value acquisition unit 310 receives a reward in reinforcement learning for rate adjustment. The received consideration is action value information.

In step S432, the action value acquisition unit 310 calculates an action value in reinforcement learning for rate adjustment based on the received consideration.

The calculation method is the same as that in step S132 (refer to fig. 6). However, the state in reinforcement learning is the transmission rate, and the action in reinforcement learning is adjustment of the transmission rate. In addition, an action value table 394 is used instead of the action value table 391.

In step S433, the communication path control unit 320 updates the action value table 394 based on the calculated action value.

The update method is the same as the method in step S133 (refer to fig. 6).

Referring to fig. 31, action value table 394 is illustrated.

Action value table 394 is an action value table used in reinforcement learning for rate adjustment.

Action value table 394 represents an action value Q of each of the groups of the transfer rate and adjustment method.

Specific adjustment methods are to increase the transmission rate, decrease the transmission rate, or not change the transmission rate.

Q(S _n , +.f) is in state S _n The action value of the case of increasing the transmission rate is given.

Q(S _n , +.gtoreq.) is in state S _n The action value of the case of lowering the transmission rate is reduced.

Q(S _n , →) is in state S _n The action value in the case of not changing the transmission rate.

State S _n Is a state where the transmission rate is nMbps.

Returning to fig. 30, step S434 is described.

In step S434, the communication path control unit 320 selects a transmission rate.

According to fig. 32, the procedure of the rate adjustment process (S434) is explained.

In step S4341, the communication path control unit 320 randomly selects whether or not to perform rate adjustment based on the action value table 394.

The selection method is the same as that in step S1341 (see fig. 8).

If the rate adjustment based on the action value table 394 is selected, the process advances to step S4342.

If it is selected that the interval adjustment by action value table 394 is not performed, the process proceeds to step S4343.

In step S4342, the communication path control unit 320 selects an adjustment method based on the action value table 394.

The selection method is the same as that in step S1342 (see fig. 8). However, the evaluation threshold is replaced with the transmission rate, and the adjustment of the evaluation threshold is replaced with the adjustment of the transmission rate.

In step S4343, the communication path control unit 320 randomly selects an adjustment method.

In step S4344, the communication path control unit 320 adjusts the transmission rate according to the selected adjustment method.

* Example of embodiment 4

The host process (S420) is the same as the host process (S120) of fig. 10.

According to fig. 33, the processing of the seed machine is described (S430).

In step S431A, the action value acquisition unit 310 receives a communication quality value from the host. The received communication quality value is action value information. Step S431A corresponds to step S431 of fig. 30.

In step S432A, the action value acquisition unit 310 calculates a reward in reinforcement learning for rate adjustment based on the received communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S433A to S435A are the same as steps S432 to S434 (refer to fig. 30).

A wireless communication method (rate adjustment) is described with reference to fig. 34.

Step S410 is as described with reference to fig. 29.

In step S420B, the slave unit adjusts the transmission rate. Step S420B corresponds to step S430 (see fig. 29).

According to fig. 35, a process of the seed machine process (S420B) is described.

In step S421B, the action value acquisition unit 310 detects the acquisition timing. The acquisition timing corresponds to the supply timing in step S121 (see fig. 5).

In step S422B, the action value acquisition unit 310 calculates a communication quality value. The calculation method is the same as that in step S122 (refer to fig. 5).

In step S423B, the action value acquisition unit 310 calculates a reward in reinforcement learning for rate adjustment based on the calculated communication quality value. The calculation method is the same as that in step S123 (refer to fig. 5).

Steps S424B to S426B are the same as steps S432 to S434 (see fig. 30).

* Effects of embodiment 4

An adaptive modulation and coding technique (AMC: adaptive Modulation and Cording) in which a modulation scheme and a coding scheme are adaptively changed according to communication quality is widely used.

For example, the transmission rate of the ieee802.11a wireless LAN is determined by the modulation scheme and the coding rate. As the transmission rates, there are defined 8 transmission rates of 54, 48, 36, 24, 18, 12, 9, and 6 Mbps. When the transmission rate is high, the noise level becomes high, so there is a possibility that an error occurs. In addition, when the transmission rate is low, the frequency band in which application data is transmitted is narrow, and thus there is a possibility that delay occurs. Thus, there is a trade-off in the adjustment of the transmission rate. The transmission rate is adaptively determined based on the received signal strength of the signal received by the wireless device. However, the fluctuation range of the received signal strength varies for each setting environment of the radio. Therefore, in some wireless devices, there is a possibility that an error occurs.

An object of embodiment 4 is to apply reinforcement learning to determine a transmission rate, and automatically determine an appropriate transmission rate for each setting environment of a wireless device.

The efficiency of communication band utilization can be optimized according to the installation environment of the wireless device. Specifically, a radio device having a high transmission rate despite poor communication quality is reduced, and radio errors are reduced. Alternatively, a radio device having a low transmission rate when the communication quality is good is reduced, and the transmission delay is reduced.

Embodiment 5.

The mode for confirming the result of reinforcement learning will be mainly described with reference to fig. 36 and 37, which are different from those of embodiment modes 1 to 4.

* Description of the Structure

The structure of the wireless communication apparatus 300 is described with reference to fig. 36.

The wireless communication apparatus 300 further includes elements such as a learning result confirmation unit 350. The learning result confirmation unit 350 is implemented by software.

The wireless communication program further causes the computer to function as the learning result confirmation unit 350.

* Description of actions

The wireless communication method (learning result confirmation) will be described with reference to fig. 37.

The wireless communication method (learning result confirmation) is a method of confirming the results of the reinforcement learning in each of embodiments 1 to 4.

In step S501, the learning result confirmation unit 350 acquires the communication quality value of the wireless communication system 100.

The communication quality value of the wireless communication system 100 is a value indicating the communication quality of the entire wireless communication system 100.

The learning result confirmation unit 350 acquires the communication quality value of the wireless communication system 100 as follows.

In the slave unit, the learning result confirmation unit 350 transmits the request frame to the master unit. The required frame is a frame for requiring a communication quality value of the wireless communication system 100.

In the host, the wireless communication unit 210 receives a request frame. Then, the information providing unit 220 calculates a communication quality value of the wireless communication system 100, and transmits a response frame to the slave unit. The response frame is a frame in which the communication quality value of the wireless communication system 100 is set. Specifically, the communication quality value is PER. PER is calculated in a conventional manner.

In the slave unit, the learning result confirmation unit 350 receives the response frame, and obtains the communication quality value of the wireless communication system 100 from the received response frame.

However, the learning result confirmation unit 350 may acquire the communication quality value of the wireless communication system 100 by another method. For example, the learning result confirmation unit 350 may calculate the communication quality value of the wireless communication system 100.

In step S502, the learning result confirmation unit 350 determines the influence of reinforcement learning on the communication quality of the wireless communication system 100 based on the acquired communication quality value.

For example, when the PER of the entire radio communication system 100 is greater than 20[% ], the learning result confirmation unit 350 determines that reinforcement learning negatively affects the communication quality of the radio communication system 100.

If it is determined that reinforcement learning negatively affects the communication quality of the wireless communication system 100, the process proceeds to step S503.

If it is determined that reinforcement learning does not negatively affect the communication quality of the wireless communication system 100, the process ends.

In step S503, the learning result confirmation unit 350 initializes the action value tables (391 to 394) to be used.

That is, the learning result confirmation unit 350 returns each value set in the action value table to the default value. Default values refer to initial values of parameters. The initial value of the parameter is preset in the storage unit 390. When the power is turned on, the wireless communication apparatus 300 reads an initial value of the parameter and starts an initial operation.

When a plurality of action value tables (391 to 394) are used, the learning result confirmation unit 350 may initialize a part of the action value tables or may initialize all the action value tables.

After step S503, the slave unit may stop reinforcement learning or may continue reinforcement learning. That is, the update of the action value table may be stopped or continued.

* Effects of embodiment 5

With embodiment 5, when learning has a negative influence on the system, the system can be returned to the original state.

* Supplementary of embodiments

The hardware configuration of the wireless communication apparatus 200 is described with reference to fig. 38.

The wireless communication apparatus 200 includes a processing circuit 209.

The processing circuit 209 is hardware for realizing the wireless communication section 210 and the information providing section 220.

The processing circuit 209 may be dedicated hardware or may be the processor 201 executing a program stored in the memory 202.

In the case where the processing circuit 209 is dedicated hardware, the processing circuit 209 is, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC, an FPGA, or a combination thereof.

ASIC is a acronym for Application Specific Integrated Circuit and FPGA is an acronym for Field Programmable Gate Array.

The wireless communication apparatus 200 may include a plurality of processing circuits instead of the processing circuit 209. The plurality of processing circuits share the role of the processing circuit 209.

In the wireless communication apparatus 200, a part of the functions may be realized by dedicated hardware, and the remaining functions may be realized by software or firmware.

As such, the processing circuitry 209 can be implemented in hardware, software, firmware, or a combination thereof.

The hardware configuration of the wireless communication apparatus 300 is described with reference to fig. 39.

The wireless communication apparatus 300 includes a processing circuit 309.

The processing circuit 309 is hardware for realizing the action value acquisition unit 310, the communication path control unit 320, the wireless communication unit 330, the transmission rate control unit 340, and the learning result confirmation unit 350.

The processing circuit 309 may be dedicated hardware or the processor 301 executing a program stored in the memory 302.

In the case where the processing circuit 309 is dedicated hardware, the processing circuit 309 is, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC, an FPGA, or a combination thereof.

The wireless communication apparatus 300 may include a plurality of processing circuits instead of the processing circuit 309. The plurality of processing circuits share the role of the processing circuit 309.

In the wireless communication apparatus 300, a part of the functions may be realized by dedicated hardware, and the remaining functions may be realized by software or firmware.

As such, the processing circuitry 309 can be implemented in hardware, software, firmware, or a combination thereof.

In each embodiment, each wireless communication apparatus (200, 300) may operate as a slave unit, a relay unit, and a master unit.

That is, the wireless communication apparatus 200 may include the elements (310 to 350) of the wireless communication apparatus 300, or the wireless communication apparatus 300 may include the elements (210 and 220) of the wireless communication apparatus 200.

The "part" may also be referred to as "processing" or "step" with respect to the elements (210, 220, 310 to 350) of each of the wireless communication apparatus 200 and the wireless communication apparatus 300.

The embodiments are examples of preferred embodiments, and are not intended to limit the technical scope of the present invention. The embodiments may be implemented partially or in combination with other embodiments. The procedure described using the flowcharts and the like may be changed as appropriate.

Claims

1. A wireless communication device is provided with:

an action value acquisition unit that acquires an action value in reinforcement learning in which an adjustment of an evaluation threshold for path selection is performed;

2. The wireless communications apparatus of claim 1, wherein,

the communication partner calculates a reward in the reinforcement learning based on the communication quality of the wireless communication with the wireless communication device, transmits the calculated reward,

the action value acquisition unit receives the transmitted consideration and calculates the action value from the received consideration.

3. The wireless communications apparatus of claim 1, wherein,

the action value obtaining unit calculates a reward in reinforcement learning based on communication quality of wireless communication with a communication partner, and calculates the action value based on the calculated reward.

4. The wireless communication device according to any one of claim 1 to 3, wherein,

the communication path control unit randomly selects whether to perform threshold adjustment based on the updated action value table,

when it is selected to perform threshold adjustment based on the updated action value table, the communication path control unit selects an adjustment method based on the updated action value table, adjusts the evaluation threshold according to the selected adjustment method,

when it is selected not to perform threshold adjustment based on the updated action value table, the communication path control unit randomly selects an adjustment method, and adjusts the evaluation threshold according to the selected adjustment method.

5. The wireless communication device according to any one of claim 1 to 3, wherein,

the wireless communication system includes a learning result confirmation unit that determines an influence of the reinforcement learning on the wireless communication system based on a communication quality in the wireless communication system including the wireless communication device, and initializes the action value table when it is determined that the reinforcement learning has a negative influence on the wireless communication system.

6. The wireless communication device according to any one of claim 1 to 3, wherein,

the wireless communication device is provided with a transmission rate control unit,

the transmission rate control unit updates an action value table indicating the transmission rate and the action value of each of the groups of the adjustment methods based on the acquired action values, adjusts the transmission rate based on the updated action value table,

the wireless communication unit performs wireless communication at the adjusted transmission rate.

7. A wireless communication system having a plurality of wireless communication devices including the wireless communication device of any one of claims 1 to 6.

8. A computer-readable recording medium having recorded thereon a wireless communication program for causing a computer to execute:

an action value acquisition process of acquiring an action value in reinforcement learning in which an adjustment of an evaluation threshold for path selection is performed;

a communication path control process of updating an action value table indicating an evaluation threshold and an action value of each of the groups of the adjustment methods based on the acquired action values, adjusting the evaluation threshold based on the updated action value table, and selecting a communication path using the adjusted evaluation threshold; and

And a wireless communication process of performing wireless communication via the selected communication path.