WO2021064766A1 - 制御装置、方法及びシステム - Google Patents
制御装置、方法及びシステム Download PDFInfo
- Publication number
- WO2021064766A1 WO2021064766A1 PCT/JP2019/038454 JP2019038454W WO2021064766A1 WO 2021064766 A1 WO2021064766 A1 WO 2021064766A1 JP 2019038454 W JP2019038454 W JP 2019038454W WO 2021064766 A1 WO2021064766 A1 WO 2021064766A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- learning
- action
- control
- control device
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/16—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/50—Network service management, e.g. ensuring proper service fulfilment according to agreements
- H04L41/5003—Managing SLA; Interaction between SLA and QoS
- H04L41/5019—Ensuring fulfilment of SLA
- H04L41/5025—Ensuring fulfilment of SLA by proactively reacting to service quality change, e.g. by reconfiguration after service quality degradation or upgrade
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
Definitions
- the present invention relates to control devices, methods and systems.
- Moving image data is distributed from a server on a network, the moving image data is played back on a terminal, and a robot or the like installed in a factory or the like is remotely controlled from the server.
- Patent Document 1 describes that it is possible to estimate the quality of the display waiting time from which the influence of individual web pages is removed.
- the quality of the display waiting time of the web page in the area and the time zone is estimated based on the traffic measurement data in the arbitrary area and the time zone.
- machine learning For example, it is being considered to apply machine learning to games such as chess and control of robots and the like.
- maximization of the score in the game is set as a reward, and the performance of machine learning is evaluated.
- the realization of the target motion is set as a reward, and the performance of machine learning is evaluated.
- learning performance is discussed by the sum of immediate reward and episode-based reward.
- the question is what to set as a reward.
- network control it is not possible to imagine the existence of a score that maximizes as in the case of applying machine learning to a game.
- the reward is set to maximize the throughput of the communication devices included in the network, it cannot be said that the setting is appropriate depending on the service or application.
- a main object of the present invention is to provide a control device, a method and a system that contribute to realizing efficient network control using machine learning.
- the learning unit includes a learning unit that learns an action for controlling a network and a storage unit that stores learning information generated by the learning unit.
- a control device is provided that determines the reward for an action performed on the network based on the steadyness of the network after the action is performed.
- the learning step includes a step of learning an action for controlling a network and a step of storing learning information generated by the learning, and the learning step is performed on the network.
- a method is provided in which the reward for an action performed on the action is determined based on the steadyness of the network after the action is performed.
- the learning means includes a learning means for learning an action for controlling a network and a storage means for storing learning information generated by the learning means.
- a system is provided in which the reward for an action performed on the network is determined based on the steadyness of the network after the action is performed.
- control devices, methods and systems that contribute to realizing efficient network control using machine learning are provided.
- other effects may be produced in place of or in combination with the effect.
- the control device 100 includes a learning unit 101 and a storage unit 102 (see FIG. 1).
- the learning unit 101 learns actions for controlling the network.
- the storage unit 102 stores the learning information generated by the learning unit 101.
- the learning unit 101 acts on the network (step S01 in FIG. 2).
- the learning unit 101 determines the reward of the action performed on the network based on the stationarity of the network after the action is performed, and learns the action for controlling the network (step S02 in FIG. 2).
- Network stability is important for services and applications provided by the network.
- the control device 100 determines the reward based on the stationarity of the state obtained by the action (change of the control parameter) performed on the network. That is, the control device 100 considers that the convergent state in which the network state is stable during machine learning (reinforcement learning) has high value, and in such a situation, gives a high reward to control the network. To learn. As a result, efficient network control using machine learning is realized.
- FIG. 3 is a diagram showing an example of a schematic configuration of a communication network system according to the first embodiment.
- the communication network system includes a terminal 10, a control device 20, and a server 30.
- the terminal 10 is a device having a communication function.
- Examples of the terminal 10 include a WEB camera, a surveillance camera, a drone, a smartphone, a robot, and the like.
- the purpose is not to limit the terminal 10 to the above-mentioned WEB camera or the like.
- the terminal 10 can be any device having a communication function.
- the terminal 10 communicates with the server 30 via the control device 20.
- Various applications and services are provided by the terminal 10 and the server 30.
- the server 30 analyzes the image data from the WEB camera and manages the materials of the factory and the like.
- the terminal 10 is a drone
- a control command is transmitted from the server 30 to the drone, and the drone transports luggage and the like.
- the terminal 10 is a smartphone
- the video is distributed from the server 30 to the smartphone, and the user watches the video using the smartphone.
- the control device 20 is, for example, a communication device such as a proxy server or a gateway, and is a device that controls a network including a terminal 10 and a server 30.
- the control device 20 controls the network by changing the values of the TCP (Transmission Control Protocol) parameter group and the buffer control parameter group.
- TCP Transmission Control Protocol
- buffer control For example, as a control of TCP parameters, changing the flow window size is exemplified.
- buffer control include changing parameters related to the minimum guaranteed bandwidth, RED (Random Early Detection) loss rate, loss start queue length, and buffer length in queue management of a plurality of buffers.
- control parameters parameters that affect communication (traffic) between the terminal 10 and the server 30, such as the above TCP parameters and parameters related to buffer control, are referred to as "control parameters”.
- the control device 20 controls the network by changing the control parameters.
- the network control by the control device 20 may be performed at the time of packet transfer of the own device (control device 20), or may be performed by instructing the terminal 10 or the server 30 to change the control parameters.
- control device 20 controls the network by changing the flow window size of the TCP session formed with the terminal 10.
- the control device 20 may control the network by changing the size of a buffer for storing packets received from the server 30 or changing the cycle of reading packets from the buffer.
- the control device 20 uses "machine learning” to control the network. More specifically, the control device 20 controls the network based on the learning model obtained by reinforcement learning.
- control device 20 may control the network based on learning information (Q table) obtained as a result of reinforcement learning called Q-learning.
- the "agent” is trained so as to maximize the “value” in the given "environment”.
- the network including the terminal 10 and the server 30 is the "environment”
- the control device 20 is trained so as to optimize the state of the network.
- the state s indicates what kind of state the environment (network) is in.
- traffic for example, throughput, average packet arrival interval, etc.
- the action a indicates an action that the agent (control device 20) can take with respect to the environment (network). For example, in the case of a communication network system, changing the setting of the TCP parameter group, turning on / off the function, and the like are exemplified as the action a.
- the reward r indicates how much evaluation can be obtained as a result of the agent (control device 20) executing the action a in a certain state s.
- the control device 20 is defined as a positive reward if the throughput increases as a result of changing a part of the TCP parameter group, and a negative reward if the throughput decreases.
- Q-learning learning proceeds so as to maximize the value in the future, instead of maximizing the reward (immediate reward) obtained at the present time (Q-table is constructed).
- the learning of the agent in Q learning is performed so as to maximize the value (Q value, state action value) when the action a in a certain state s is adopted.
- the Q value (state behavior value) is expressed as Q (s, a).
- Q-learning it is premised that the action of the agent to transition to a high-value state by the action has the same value as the transition destination. Based on such a premise, the Q value at the present time t can be expressed by the Q value at the next time point t + 1 (see equation (1)).
- Es t + 1 is the expected value relating to the state S t + 1
- Ea t + 1 denotes the expected value behavioral a t + 1.
- ⁇ is the discount rate.
- the Q value is updated according to the result of adopting the action a in a certain state s. Specifically, the Q value is updated according to the following equation (2).
- ⁇ is a parameter called the learning rate and controls the update of the Q value.
- "max" in the equation (2) is a function that outputs the maximum value of the possible actions a in the state St + 1.
- a method for the agent (control device 20) to select the action a a method called ⁇ -greedy can be adopted.
- an action is randomly selected with a probability ⁇ , and the most valuable action is selected with a probability 1- ⁇ .
- a Q-table as shown in FIG. 4 is generated.
- the control device 20 may control the network based on a learning model obtained as a result of reinforcement learning using deep learning called DQN (Deep Q Network).
- DQN Deep Q Network
- the action value function is expressed by the Q table, but in DQN, the action value function is expressed by deep learning.
- the optimal action value function is calculated by an approximate function using a neural network.
- the optimal action value function is a function that outputs the value of performing a certain action a in a certain state s.
- the neural network includes an input layer, an intermediate layer (hidden layer), and an output layer.
- the input layer inputs the state s. There is a corresponding weight in the link of each node in the middle layer.
- the output layer outputs the value of action a.
- the nodes of the input layer correspond to the network states S1 to S3.
- the state of the network input to the input layer is weighted by the intermediate layer and output to the output layer.
- the nodes of the output layer correspond to the actions A1 to A3 that the control device 20 can take.
- Node of the output layer outputs value of action value function Q (s t, a t) corresponding to each of the actions A1 ⁇ A3.
- connection parameter (weight) between the nodes that output the action value function is learned.
- the operation mode of the control device 20 includes two operation modes.
- the first operation mode is a learning mode for calculating a learning model.
- a Q table as shown in FIG. 4 is calculated.
- the control device 20 executes reinforcement learning by "DQN”
- the weight as shown in FIG. 6 is calculated.
- the second operation mode is a control mode in which the network is controlled using the learning model calculated in the learning mode. Specifically, the control device 20 in the control mode calculates the current network state s and selects the most valuable action a among the actions a that can be taken in the case of the state s. The control device 20 executes an operation (network control) corresponding to the selected action a.
- FIG. 7 is a diagram showing an example of a processing configuration (processing module) of the control device 20 according to the first embodiment.
- the control device 20 includes a packet transfer unit 201, a feature amount calculation unit 202, a network control unit 203, a reinforcement learning execution unit 204, and a storage unit 205.
- the packet transfer unit 201 is a means for receiving a packet transmitted from the terminal 10 or the server 30 and transferring the received packet to the opposite device.
- the packet transfer unit 201 performs packet transfer according to the control parameters notified from the network control unit 203.
- the packet transfer unit 201 performs packet transfer with the notified flow window size.
- the packet transfer unit 201 delivers a copy of the received packet to the feature amount calculation unit 202.
- the feature amount calculation unit 202 is a means for calculating the feature amount that characterizes the communication traffic between the terminal 10 and the server 30.
- the feature amount calculation unit 202 extracts a traffic flow that is a target of network control from the acquired packet.
- the traffic flow that is the target of network control is a group consisting of packets having the same source IP (Internet Protocol) address, destination IP address, port number, and the like.
- the feature amount calculation unit 202 calculates the feature amount from the extracted traffic flow. For example, the feature amount calculation unit 202 calculates throughput, average packet arrival interval, packet loss rate, jitter, and the like as feature amounts. The feature amount calculation unit 202 stores the calculated feature amount in the storage unit 205 together with the calculation time. Since existing techniques can be used for calculation of throughput and the like and are obvious to those skilled in the art, detailed description thereof will be omitted.
- the network control unit 203 is a means for controlling the network based on the behavior obtained from the learning model generated by the reinforcement learning execution unit 204.
- the network control unit 203 determines the control parameters to be notified to the packet transfer unit 201 based on the learning model obtained as a result of the reinforcement learning.
- the network control unit 203 is a module that mainly operates in the control mode.
- the network control unit 203 reads the latest (current time) feature amount from the storage unit 205.
- the network control unit 203 estimates (calculates) the state of the network to be controlled from the read feature amount.
- the network control unit 203 refers to a table (see FIG. 8) in which the feature amount F and the network state are associated with each other, and calculates the network state corresponding to the current feature amount F. Since the traffic is generated by the communication between the terminal 10 and the server 30, the network state can be regarded as the "traffic state". That is, in the disclosure of the present application, the "traffic state" and the “network state” can be interchanged with each other.
- the network control unit 203 refers to the Q table stored in the storage unit 205, and the value Q is the highest among the actions corresponding to the current network state. Get high behavior. For example, in the example of FIG. 4, the calculated traffic state is "state S1", and the value Q (S1, A3) of the values Q (S1, A1), Q (S1, A2), and Q (S1, A3). If A1) is the maximum, the action A1 is read out.
- the network control unit 203 inputs the current network state into the neural network as shown in FIG. 5 and acquires the most valuable action among the actions that can be taken. ..
- the network control unit 203 determines the control parameter according to the acquired action, and sets (notifies) the packet transfer unit 201.
- a table (see FIG. 9) in which actions and control contents are associated is stored in the storage unit 205, and the network control unit 203 determines a control parameter to be set in the packet transfer unit 201 with reference to the table. ..
- the network control unit 203 sends the control parameter corresponding to the change content to the packet transfer unit 201. Notice.
- the reinforcement learning execution unit 204 is a means for learning actions (control parameters) for controlling the network.
- the reinforcement learning execution unit 204 executes the Q-learning and the reinforcement learning by DQN described above to generate a learning model.
- the reinforcement learning execution unit 204 is a module that mainly operates in the learning mode.
- the reinforcement learning execution unit 204 calculates the network state s at the current time t from the feature amount stored in the storage unit 205.
- the reinforcement learning execution unit 204 selects the action a from the possible actions a in the calculated state s by a method such as the above-mentioned ⁇ -greedy method.
- the reinforcement learning execution unit 204 notifies the packet transfer unit 201 of the control content (updated value of the control parameter) corresponding to the selected action.
- the reinforcement learning execution unit 204 determines the reward according to the change of the network according to the above behavior. At that time, the reinforcement learning execution unit 204 determines the reward of the action performed on the network based on the stationarity of the network after the action is performed.
- the reinforcement learning execution unit 204 determines the reward based on whether or not the network is in a steady state as a result of taking the action a.
- the reinforcement learning execution unit 204 gives a positive reward when the reward rt + 1 described in the equation (2) or the equation (3) is determined if the network is in a steady state (if the network is stable). On the other hand, if the state of the network is unsteady (if the network is unstable), the reinforcement learning execution unit 204 gives a negative reward.
- Reinforcement learning execution unit 204 determines the steady state of the network by performing statistical processing on the time-series data related to the state of the network that fluctuates due to taking action on the network.
- the reinforcement learning execution unit 204 controls the network corresponding to the action a selected by the method such as the ⁇ -greedy method, and the feature amount from the next time t + 1 to the predetermined period before. Read the feature quantity time series data). The reinforcement learning execution unit 204 calculates an evaluation index indicating whether or not the network state is a steady state by performing statistical processing on the time-series data of the read feature amount.
- the reinforcement learning execution unit 204 models the time series data by an autoregressive model (AR) model.
- AR autoregressive model
- the time series data x1, x2, ..., XN are represented by the addition of the weighted past values (linear sum) as shown in the following equation (4). To do.
- x (t) is the feature quantity
- epsilon (t) is noise (white noise)
- c is not changed by the time constant
- w i represents the weight.
- i is a suffix for designating the past time
- p is an integer for specifying the time before the predetermined period.
- Reinforcement learning execution unit 204 estimates using time-series data read out weight w i represented by the above formula (4) from the storage unit 205. Specifically, reinforcement learning execution unit 204, maximum likelihood estimates the weights w i by a parameter estimation technique, such as Yule Walker. Since known techniques can be used for parameter estimation methods such as the maximum likelihood method and Yulewalker, detailed description thereof will be omitted.
- a parameter estimation technique such as Yule Walker. Since known techniques can be used for parameter estimation methods such as the maximum likelihood method and Yulewalker, detailed description thereof will be omitted.
- the reinforcement learning execution unit 204 performs a unit root test on the AR model obtained from the time series data. By performing the unit root test, the reinforcement learning execution unit 204 obtains the steady state (steady state) of the time series data. The reinforcement learning execution unit 204 can calculate the ratio of "steady" to "non-steady” by executing the unit root test. Since the unit root test can be realized by an existing algorithm and is obvious to those skilled in the art, a detailed description thereof will be omitted.
- the reinforcement learning execution unit 204 executes a threshold value process (for example, a process of determining whether the acquired value is equal to or less than the threshold value) for the steady state obtained by the unit root test, and the network state is in the steady state. Determine if it is in. That is, the reinforcement learning execution unit 204 determines whether the state of the network is in a transient "non-steady state" toward a steady state or in a “steady state” that converges around a specific value. To do.
- a threshold value process for example, a process of determining whether the acquired value is equal to or less than the threshold value
- the reinforcement learning execution unit 204 determines that the network state is "steady” if the steady state is equal to or higher than the threshold value.
- the reinforcement learning execution unit 204 determines that the network state is "unsteady” if the steady state is smaller than the threshold value.
- FIG. 10 is a diagram showing an example of time-series data of feature quantities.
- the reinforcement learning execution unit 204 performs a unit root test on the time series data shown in FIG. 10A, the network state is determined to be “unsteady”.
- the reinforcement learning execution unit 204 gives a negative reward (for example, -1) to the reward rt + 1 of the equation (2) and the equation (3), and updates the Q table and the weight.
- the reinforcement learning execution unit 204 performs a unit root test on the time series data shown in FIG. 10B, the network state is determined to be “steady”.
- the reinforcement learning execution unit 204 gives a positive reward (for example, +1) to the reward rt + 1 of the equation (2) and the equation (3), and updates the Q table and the weight.
- control device 20 The operation of the control device 20 according to the first embodiment in the control mode is summarized in the flowchart shown in FIG.
- the control device 20 acquires the packet and calculates the feature amount (step S101).
- the control device 20 identifies the state of the network based on the calculated feature amount (step S102).
- the control device 20 controls the network by the most valuable action according to the state of the network by using the learning model (step S103).
- control device 20 The operation of the control device 20 according to the first embodiment in the learning mode is summarized in the flowchart shown in FIG.
- the control device 20 acquires the packet and calculates the feature amount (step S201).
- the control device 20 identifies the state of the network based on the calculated feature amount (step S202).
- the control device 20 selects an action that can be taken in the current network state by the ⁇ -greedy method or the like (step S203).
- the control device 20 controls the network according to the selected action (step S204).
- the control device 20 determines the stationarity of the network using the time-series data of the feature amount (step S205).
- the control device 20 determines the reward based on the determination result (step S206), and updates the learning information (Q table, weight) (step S207).
- control device 20 will be specifically described for each type of the terminal 10.
- the average packet arrival interval of packets transmitted from the drone to the server 30 is selected as an index (feature amount) indicating the state of the network.
- the server 30 transmits a control packet (packet including a control command) to the drone.
- the average packet arrival interval of the response packets (affirmative response, negative response) from the drone to the control packet is selected as the feature amount.
- the control device 20 determines control parameters and controls the network so that the packet transmission / reception interval between the server 30 and the drone is stable.
- a packet read interval packet transmission interval
- a buffer that stores a control packet acquired from the server 30 can be considered.
- the reinforcement learning execution unit 204 learns a parameter for reading a control packet from the buffer so that the average packet arrival interval of the response packet transmitted from the drone to the server 30 is stable.
- the server 30 remotely controls a drone (control target)
- the packet size of the control packet and the response packet is not so large. Therefore, the throughput from the server 30 is high, but the packet transmission / reception is not stable (a situation in which a lot of information can be sent at one time but the packet arrival varies), but the throughput is low, but the packet transmission / reception is stable. Is more valuable in drone control.
- the control device 20 is suitable for an application of remote control of a drone by appropriately selecting a feature amount that characterizes a network state (traffic state) (for example, selecting an average packet arrival interval). Network control can be realized.
- a feature amount that characterizes a network state (traffic state) (for example, selecting an average packet arrival interval).
- Network control can be realized.
- the terminal is a WEB camera
- the condition (criteria) for determining the reward rt + 1 has been described, but the reward rt + 1 may be determined by adding other criteria to the stationarity.
- a case where the terminal 10 is a WEB camera is taken as an example, and a case where items other than “network stationarity” are taken into consideration in determining the reward rt + 1 will be described.
- the throughput of traffic flowing from the WEB camera to the server 30 is selected as an index (feature amount) indicating the state of the network.
- the reinforcement learning execution unit 204 calculates the learning model so that the throughput from the WEB camera to the server 30 stabilizes in the vicinity of the target value.
- the flow window size of the TCP session formed between the terminal 10 and the server 30 is set in the control parameter, and the behavior that realizes the above target (throughput is stable at the target value) is learned.
- the reinforcement learning execution unit 204 determines the stationarity of the network using the time-series data of the feature amount (throughput) calculated by the feature amount calculation unit 202.
- the reinforcement learning execution unit 204 determines the reward rt + 1 according to the range of the feature amount (throughput). For example, if the target value is the threshold value TH21 or more and the threshold value TH22 or less, the reinforcement learning execution unit 204 determines the reward rt + 1 according to the policy as shown in FIG.
- the network is controlled so that the throughput from the WEB camera is stable near the target value.
- the network state (throughput is stable near the target value) as shown in FIG. 14A can be realized by the network control by the control device 20.
- the network control by the control device 20.
- the reward rt + 1 in consideration of the throughput range, it is possible to avoid falling into the network state as shown in FIG. 14B.
- the state of the network is finally stable, but the throughput at the steady state deviates greatly from the target value.
- FIG. 13 shows a case where a positive reward is given if the throughput is within a predetermined range
- a positive reward may be given when the throughput is equal to or higher than a predetermined value (see FIG. 15).
- the reward rt + 1 may be determined as shown in FIG.
- the limit provided for the throughput may be determined in consideration of the resource (communication resource) of the control device 20. For example, when the flow window size is selected as the control parameter, it is considered that the throughput is stable at a high value if the window size is increased. However, in order to prepare a large flow window size, the memory (resource) consumption becomes large, and the resources that can be allocated to the other terminal 10 decrease.
- the control device 20 may determine the table update policy in consideration of the above-mentioned merits and demerits.
- the stationarity of the network is determined by one feature amount has been described, but the stationarity of the network may be determined by a plurality of feature amounts.
- the terminal 10 is a smartphone will be taken as an example, and a case where the stationarity of the network is determined by a plurality of feature quantities will be described.
- the feature amount calculation unit 202 calculates the throughput of traffic flowing from the server 30 to the smartphone and the average packet arrival interval.
- the reinforcement learning execution unit 204 determines the stationarity of the network from the two feature quantities. Specifically, the reinforcement learning execution unit 204 determines whether or not the throughput is stable based on the time-series data of the throughput. Similarly, the reinforcement learning execution unit 204 determines whether or not the average packet arrival interval is stable based on the time-series data of the average packet arrival interval.
- the reinforcement learning execution unit 204 determines that the network is in the steady state when both the throughput and the average packet arrival interval are in the steady state, gives a positive reward to the reward rt + 1 , and in other cases, a negative reward. give.
- the control device 20 estimates the state of the network using the feature amount that characterizes the traffic flowing through the network.
- the control device 20 determines the reward for the action according to the time-series change of the state obtained by the action (change of the control parameter) performed on the network. Therefore, a high reward is given to the "network stability" required at the service or application level provided by the network, and the network quality suitable for the application or the like can be improved. That is, in the disclosure of the present application, it is considered that the convergent state in which the network state is stable during reinforcement learning is highly valuable, and the learner can adapt to the environment (network) in such a situation. , The reward is decided.
- the state of the network is estimated from the feature amount (for example, throughput) that characterizes the traffic flowing through the network.
- the feature amount for example, throughput
- the network state is determined based on QoE (user experience quality) and QoC (control quality) in the terminal 10 will be described.
- the terminal 10 notifies the control device 20 of the image quality of the reproduced moving image, the bit rate, the number of interruptions (the number of times the buffer is emptied), the frame rate, and the like.
- the terminal 10 is referred to by the ITU (International Telecommunication Union) -T Recommendation P.M.
- the MOS (Mean Opinion Score) value defined in 1203 may be transmitted to the control device 20.
- the terminal 10 may notify the control device 20 of the initial waiting time until the page is displayed.
- the robot may notify the control device 20 of the reception interval of the control command, the work completion time, the number of successful works, and the like.
- the surveillance camera may notify the control device 20 of the authentication rate, the number of authentications, and the like of the monitoring target (for example, a human face, an object, etc.).
- the control device 20 may acquire a value indicating QoE in the terminal 10 (for example, the initial standby time or the like) from the terminal 10, determine the stationarity of the network based on the value, and determine the reward rt + 1. .. At that time, the control device 20 performs a unit root test on the time series data of QoE acquired from the terminal 10 in the same manner as the method described in the first embodiment, and evaluates the steady state of the network. Just do it.
- a value indicating QoE in the terminal 10 for example, the initial standby time or the like
- the control device 20 performs a unit root test on the time series data of QoE acquired from the terminal 10 in the same manner as the method described in the first embodiment, and evaluates the steady state of the network. Just do it.
- control device 20 may estimate the value indicating the QoE from the traffic flowing between the terminal 10 and the server 30.
- the control device 20 may estimate the bit rate from the throughput and determine the stationarity of the network based on the estimated value.
- the method described in Reference 1 below may be used. [Reference 1]: International Publication No. 2019/044065
- the control device 20 estimates the state of the network from the user experience quality (QoE) and the control quality (QoC), and is high when the user experience quality and the like are stable. You may give a reward. For example, consider the case where a user watches a moving image using a terminal. In this case, in the disclosure of the present application, it is determined that the network quality is higher in a network environment in which the frame rate is constant even at a low frame rate than in a network environment in which the frame rate changes frequently (environment in which the frame rate is not stable). ing. In other words, the control device 20 learns the control parameters that realize such high network quality by reinforcement learning.
- FIG. 16 is a diagram showing an example of the hardware configuration of the control device 20.
- the control device 20 can be configured by an information processing device (so-called computer), and includes the configuration illustrated in FIG.
- the control device 20 includes a processor 311, a memory 312, an input / output interface 313, a communication interface 314, and the like.
- the components such as the processor 311 are connected by an internal bus or the like so that they can communicate with each other.
- control device 20 may include hardware (not shown), or may not include an input / output interface 313 if necessary.
- number of processors 311 and the like included in the control device 20 is not limited to the example of FIG. 16, and for example, a plurality of processors 311 may be included in the control device 20.
- the processor 311 is a programmable device such as a CPU (Central Processing Unit), an MPU (Micro Processing Unit), and a DSP (Digital Signal Processor). Alternatively, the processor 311 may be a device such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit). The processor 311 executes various programs including an operating system (OS).
- OS operating system
- the memory 312 is a RAM (RandomAccessMemory), a ROM (ReadOnlyMemory), an HDD (HardDiskDrive), an SSD (SolidStateDrive), or the like.
- the memory 312 stores an OS program, an application program, and various data.
- the input / output interface 313 is an interface of a display device or an input device (not shown).
- the display device is, for example, a liquid crystal display or the like.
- the input device is, for example, a device that accepts user operations such as a keyboard and a mouse.
- the communication interface 314 is a circuit, module, or the like that communicates with another device.
- the communication interface 314 includes a NIC (Network Interface Card) and the like.
- the function of the control device 20 is realized by various processing modules.
- the processing module is realized, for example, by the processor 311 executing a program stored in the memory 312.
- the program can also be recorded on a computer-readable storage medium.
- the storage medium may be a non-transient such as a semiconductor memory, a hard disk, a magnetic recording medium, or an optical recording medium. That is, the present invention can also be embodied as a computer program product.
- the program can be downloaded via a network or updated using a storage medium in which the program is stored.
- the processing module may be realized by a semiconductor chip.
- terminal 10 and the server 30 can also be configured by an information processing device like the control device 20, and the basic hardware configuration thereof is not different from that of the control device 20, so the description thereof will be omitted.
- control device 20 may be separated into a device that controls the network and a device that generates a learning model.
- the storage unit 205 that stores the learning information (learning model) may be realized by an external database server or the like. That is, the disclosure of the present application may be implemented as a system including learning means, control means, storage means and the like.
- the degree of network stability is calculated by performing a unit root test on the time-series data of the feature amount.
- the steadyness of the network may be calculated by other indicators.
- the reinforcement learning execution unit 204 may calculate a standard deviation indicating the degree of variation in the data, and may determine that the network is in a steady state when the “mean-standard deviation” is equal to or greater than the threshold value.
- the stationarity (stability) of the network is determined using one threshold value, but the stationarity degree of the network may be calculated more finely using a plurality of threshold values.
- the stationarity of the network may be determined in four stages such as “extremely stable”, “stable”, “unstable”, and “extremely unstable”.
- the reward may be determined according to the degree of steadyness of the network.
- the terminal 10 may be a sensor device.
- the sensor device generates a communication pattern (communication traffic) according to the on / off model. That is, if the terminal 10 is a sensor device or the like, there may be cases where data (packets) flow through the network and cases where data (packets) do not flow (no communication state). Therefore, the control device 20 may determine the stationarity by the fluctuation pattern instead of performing the stationarity determination (unit root test) using the traffic (feature amount) time series data itself. The control device 20 may determine the stationarity of the network by using the time series data regarding the time interval in which the feature amount fluctuates.
- control device 20 may take measures such as not reflecting the non-communication state in the reward. That is, the control device 20 may give a reward for reinforcement learning when the network state is in the “communication state”.
- control device 20 may control a unit of 10 terminals or a group of a plurality of terminals 10 as a control target. That is, even if the same terminal 10 is used, different applications have different port numbers and the like, and are treated as different flows.
- the control device 20 may apply the same control (change of control parameters) to packets transmitted from the same terminal 10.
- the control device 20 may, for example, treat terminals 10 of the same type as one group and apply the same control to packets transmitted from terminals 10 belonging to the same group.
- the learning unit (101, 204) A control device (20, 100) that determines the reward for an action performed on the network based on the stationarity of the network after the action is performed.
- the control device (20, 100) according to Appendix 1, which gives a negative reward to the action performed on the network if the network after the action is performed is in an unsteady state.
- [Appendix 3] The learning unit (101, 204) The control device (20, 100) according to Appendix 1 or 2, which determines the stationarity of the network based on time-series data regarding the state of the network that fluctuates due to taking an action on the network.
- [Appendix 4] The control device (20) according to Appendix 3, wherein the learning unit (101, 204) estimates the state of the network from at least one of a feature amount, a user experience quality, and a control quality that characterize the traffic flowing through the network. , 100).
- [Appendix 5] The control device according to any one of Supplementary note 1 to 4, further comprising a control unit (203) that controls the network based on the behavior obtained from the learning model generated by the learning units (101, 204). 20, 100).
- Steps to learn actions to control the network A step of storing the learning information generated by the learning, and Including The learning step is A method in which the reward for an action performed on the network is determined based on the stationarity of the network after the action is performed.
- the learning step is If the network after the action is performed is steady, the action performed on the network is positively rewarded.
- the learning step is The method according to Appendix 6 or 7, wherein the stationarity of the network is determined based on time-series data regarding the state of the network that fluctuates due to taking an action on the network.
- [Appendix 10] The method according to any one of Supplementary note 6 to 9, further comprising a step of controlling the network based on the behavior obtained from the learning model generated by the learning step.
- [Appendix 11] Learning means (101, 204) that learn behaviors to control networks, and Includes storage means (102, 205), which stores the learning information generated by the learning means.
- the learning means (101, 204) A system that determines the reward for an action performed on the network based on the stationarity of the network after the action is performed.
- the learning means (101, 204) If the network after the action is performed is steady, the action performed on the network is positively rewarded.
- the learning means (101, 204) The system according to Appendix 11 or 12, wherein the stationarity of the network is determined based on time-series data regarding the state of the network that fluctuates due to taking action on the network.
- the system according to Appendix 13 wherein the learning means (101, 204) estimates the state of the network from at least one of a feature amount, a user experience quality, and a control quality that characterize the traffic flowing through the network.
- Appendix 15 The system according to any one of Supplementary note 11 to 14, further comprising a control means (203) that controls the network based on the behavior obtained from the learning model generated by the learning means (101, 204).
- the process of learning behavior to control the network The process of storing the learning information generated by the learning and To execute, The learning process is A program that determines the reward for an action performed on the network based on the stationarity of the network after the action is performed.
Landscapes
- Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Environmental & Geological Engineering (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021550731A JP7259978B2 (ja) | 2019-09-30 | 2019-09-30 | 制御装置、方法及びシステム |
| US17/641,920 US20220337489A1 (en) | 2019-09-30 | 2019-09-30 | Control apparatus, method, and system |
| PCT/JP2019/038454 WO2021064766A1 (ja) | 2019-09-30 | 2019-09-30 | 制御装置、方法及びシステム |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2019/038454 WO2021064766A1 (ja) | 2019-09-30 | 2019-09-30 | 制御装置、方法及びシステム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021064766A1 true WO2021064766A1 (ja) | 2021-04-08 |
Family
ID=75336997
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2019/038454 Ceased WO2021064766A1 (ja) | 2019-09-30 | 2019-09-30 | 制御装置、方法及びシステム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20220337489A1 (https=) |
| JP (1) | JP7259978B2 (https=) |
| WO (1) | WO2021064766A1 (https=) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115208518A (zh) * | 2022-07-15 | 2022-10-18 | 腾讯科技(深圳)有限公司 | 数据传输控制方法、装置及计算机可读存储介质 |
| WO2023228256A1 (ja) * | 2022-05-23 | 2023-11-30 | 日本電信電話株式会社 | 体感品質劣化推定装置、機械学習方法、体感品質劣化推定方法及びプログラム |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11875478B2 (en) * | 2020-08-28 | 2024-01-16 | Nvidia Corporation | Dynamic image smoothing based on network conditions |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2009027303A (ja) * | 2007-07-18 | 2009-02-05 | Univ Of Electro-Communications | 通信装置および通信方法 |
| JP2013106202A (ja) * | 2011-11-14 | 2013-05-30 | Fujitsu Ltd | パラメータ設定装置、コンピュータプログラム及びパラメータ設定方法 |
| JP2019041338A (ja) * | 2017-08-28 | 2019-03-14 | 日本電信電話株式会社 | 無線通信システム、無線通信方法および集中制御局 |
| US20190141113A1 (en) * | 2017-11-03 | 2019-05-09 | Salesforce.Com, Inc. | Simultaneous optimization of multiple tcp parameters to improve download outcomes for network-based mobile applications |
| WO2019176997A1 (ja) * | 2018-03-14 | 2019-09-19 | 日本電気株式会社 | トラヒック分析装置、方法及びプログラム |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP5772345B2 (ja) * | 2011-07-25 | 2015-09-02 | 富士通株式会社 | パラメータ設定装置、コンピュータプログラム及びパラメータ設定方法 |
| CN114884738A (zh) * | 2017-11-17 | 2022-08-09 | 华为技术有限公司 | 一种识别加密数据流的方法及装置 |
| US11509703B2 (en) * | 2018-09-26 | 2022-11-22 | Vmware, Inc. | System and method for widescale adaptive bitrate selection |
| KR101990326B1 (ko) * | 2018-11-28 | 2019-06-18 | 한국인터넷진흥원 | 감가율 자동 조정 방식의 강화 학습 방법 |
| US11360757B1 (en) * | 2019-06-21 | 2022-06-14 | Amazon Technologies, Inc. | Request distribution and oversight for robotic devices |
-
2019
- 2019-09-30 JP JP2021550731A patent/JP7259978B2/ja active Active
- 2019-09-30 WO PCT/JP2019/038454 patent/WO2021064766A1/ja not_active Ceased
- 2019-09-30 US US17/641,920 patent/US20220337489A1/en not_active Abandoned
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2009027303A (ja) * | 2007-07-18 | 2009-02-05 | Univ Of Electro-Communications | 通信装置および通信方法 |
| JP2013106202A (ja) * | 2011-11-14 | 2013-05-30 | Fujitsu Ltd | パラメータ設定装置、コンピュータプログラム及びパラメータ設定方法 |
| JP2019041338A (ja) * | 2017-08-28 | 2019-03-14 | 日本電信電話株式会社 | 無線通信システム、無線通信方法および集中制御局 |
| US20190141113A1 (en) * | 2017-11-03 | 2019-05-09 | Salesforce.Com, Inc. | Simultaneous optimization of multiple tcp parameters to improve download outcomes for network-based mobile applications |
| WO2019176997A1 (ja) * | 2018-03-14 | 2019-09-19 | 日本電気株式会社 | トラヒック分析装置、方法及びプログラム |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023228256A1 (ja) * | 2022-05-23 | 2023-11-30 | 日本電信電話株式会社 | 体感品質劣化推定装置、機械学習方法、体感品質劣化推定方法及びプログラム |
| CN115208518A (zh) * | 2022-07-15 | 2022-10-18 | 腾讯科技(深圳)有限公司 | 数据传输控制方法、装置及计算机可读存储介质 |
| CN115208518B (zh) * | 2022-07-15 | 2025-01-21 | 腾讯科技(深圳)有限公司 | 数据传输控制方法、装置及计算机可读存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20220337489A1 (en) | 2022-10-20 |
| JP7259978B2 (ja) | 2023-04-18 |
| JPWO2021064766A1 (https=) | 2021-04-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10608901B2 (en) | System and method for applying machine learning algorithms to compute health scores for workload scheduling | |
| CN111090631B (zh) | 分布式环境下的信息共享方法、装置和电子设备 | |
| CN113887748B (zh) | 在线联邦学习任务分配方法、装置、联邦学习方法及系统 | |
| JP7251647B2 (ja) | 制御装置、制御方法及びシステム | |
| JP7259978B2 (ja) | 制御装置、方法及びシステム | |
| CN110247795B (zh) | 一种基于意图的云网资源服务链编排方法及系统 | |
| CN112667400A (zh) | 边缘自治中心管控的边云资源调度方法、装置及系统 | |
| CN114090108A (zh) | 算力任务执行方法、装置、电子设备及存储介质 | |
| CN106850289A (zh) | 结合高斯过程与强化学习的服务组合方法 | |
| JP7251646B2 (ja) | 制御装置、方法及びシステム | |
| CN110233763B (zh) | 一种基于时序差分学习的虚拟网络嵌入算法 | |
| CN116781343A (zh) | 一种终端可信度的评估方法、装置、系统、设备及介质 | |
| CN111211984A (zh) | 优化cdn网络的方法、装置及电子设备 | |
| Harris Jr et al. | Ddt: a reinforcement learning approach to dynamic flow timeout assignment in software defined networks | |
| CN116192766B (zh) | 用于调整数据发送速率和训练拥塞控制模型的方法及装置 | |
| Liu et al. | Optimizing lightweight neural networks for efficient mobile edge computing | |
| Gomez et al. | Federated intelligence for active queue management in inter-domain congestion | |
| JP2022009740A (ja) | 制御システム及び制御方法 | |
| Shaio et al. | A reinforcement learning approach to congestion control of high-speed multimedia networks | |
| CN119211101B (zh) | 应用于跨域通信组网的智能决策方法、系统以及电子设备 | |
| Samani et al. | Scaleip: a hybrid autoscaling of voip services based on deep reinforcement learning | |
| Zhou et al. | Safety in DRL-Based Congestion Control: A Framework Empowered by Expert Refinement | |
| Shen et al. | Federated Multi-Agent Reinforcement Learning for Heterogeneous Action Spaces | |
| US20250252347A1 (en) | Client selection for asynchronous federated learning | |
| Li et al. | Noninvasive real-time traffic and congestion control algorithm based on policy |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19947594 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021550731 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19947594 Country of ref document: EP Kind code of ref document: A1 |