CN114202066B - Network control method and device, electronic equipment and storage medium - Google Patents
Network control method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114202066B CN114202066B CN202210154404.0A CN202210154404A CN114202066B CN 114202066 B CN114202066 B CN 114202066B CN 202210154404 A CN202210154404 A CN 202210154404A CN 114202066 B CN114202066 B CN 114202066B
- Authority
- CN
- China
- Prior art keywords
- network
- moment
- width learning
- learning network
- experience
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/16—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Neurology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The application provides a network control method, a network control device, electronic equipment and a storage medium, which relate to the technical field of computer networks and specifically comprise the following steps: acquiring the network state of a fine-grained data plane at the current moment; performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network; processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment; and packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet. By training the first width learning network on line, the network change can be responded in real time, and the network emergency can be rapidly dealt with.
Description
Technical Field
The present application relates to the field of computer network technologies, and in particular, to a network control method and apparatus, an electronic device, and a storage medium.
Background
The reinforcement learning Network (DQN, Deep Q Network) has the disadvantages of poor learning and convergence effects, slow speed, etc. due to the simple Network structure. Although the traditional width learning system model can realize rapid convergence by increasing nodes in an incremental manner so as to improve the accuracy of training, the traditional width learning system model still outputs a classification result Y of input data X through training, so that the traditional width learning system model still belongs to a supervised machine learning method, and the application range of the traditional width learning system model is limited by mechanisms and cannot be well extended to some application scenes of unsupervised learning and weakly supervised learning. The convergence of the training effect and the versatility of the training method greatly affect the performance of the algorithm.
The monitoring function of the traditional network controller is separated from the function of managing and issuing the rules of the network equipment, and a network administrator needs to analyze the rules and then issue the rules. This takes a significant amount of time to be able to resolve the network emergency.
Disclosure of Invention
In view of this, the present application provides a network control method, an apparatus, an electronic device, and a storage medium, which solve the technical problem that the network emergency can be solved only when the existing network controller consumes a lot of time.
In a first aspect, an embodiment of the present application provides a network control method, including:
acquiring the network state of a fine-grained data plane at the current moment;
performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;
processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment;
and packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet.
Further, the experience base stores experiences at a plurality of consecutive time instants, the experiences comprising: the network state at the time, the execution action at the time, the reward at the time and the network state at the next time.
Further, the obtaining of the network state of the fine-grained data plane at the current time includes: and carrying out normalized extraction and format unification on the network state information at the current moment.
Further, the second width learning network and the first width learning network have the same structure;
utilizing a pre-established experience base and a second width learning network to carry out on-line training on the first width learning network; the method comprises the following steps:
will experiencePutting the obtained product into an experience library, wherein,the network state at the last moment after preprocessing,in order to perform the action at the last moment,the network state at the current moment after preprocessing,to take an actionEntering the next momentNetwork statusLater, the reward obtained from the network environment;
randomly selecting P-1 experiences from a library of experiences, and experienceForming P experiences as experience samples;
inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value;A weight parameter representing the second breadth learning network;the network state of the next moment in the p-th experience is obtained;representing possible actions to be performed by the second width learning network,to take an actionA corresponding value assessment value;
calculating the network state and performing actions at the moment of the p-th empirical sampleCorresponding target value:
Wherein the content of the first and second substances,γis a factor of the number of the first and second,the reward of the p-th experience sample at the moment is less than or equal to 1p≤P;
And taking the network state and the execution action of the time in all the P experience samples as input samples of the first width learning network, taking the target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression.
Further, the method further comprises: and randomly generating initial weight parameters of the first width learning network, and assigning the initial weight parameters of the first width learning network to the second width learning network.
Further, the method for processing the network state at the current moment after the preprocessing by using the first width learning network after the online training to obtain the optimal execution action corresponding to the current state includes:
the first width learning network after on-line training is used for preprocessing the network state information at the current momentProcessing and outputting the optimal execution action at the current time:
Wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training is completed,for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,to take an actionaA corresponding value assessment value;is that makeAnd obtaining the execution action corresponding to the maximum value.
Further, the method further comprises: and periodically acquiring the network parameters of the first width learning network, and updating the network parameters of the second width learning network.
In a second aspect, an embodiment of the present application provides a network control apparatus, including:
the acquisition unit is used for acquiring the network state of the fine-grained data plane at the current moment;
the online training unit is used for performing online training on the first width learning network by utilizing an experience library for storing local network environment historical data and the second width learning network;
the optimal execution action acquisition unit is used for processing the network state at the current moment by utilizing the first width learning network completed by online training to obtain the optimal execution action corresponding to the network state at the current moment;
and the issuing unit is used for packaging the optimal execution action at the current moment into a control rule data packet and then issuing the control rule data packet.
In a third aspect, an embodiment of the present application provides an electronic device, including: the network control system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network control method of the embodiment of the application.
In a fourth aspect, the present application provides a computer-readable storage medium, where computer instructions are stored, and when executed by a processor, the computer instructions implement the network control method of the present application.
By training the first width learning network on line, the method and the device can respond to network changes in real time and quickly respond to network emergency situations.
Drawings
In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic diagram of a conventional network control architecture according to an embodiment of the present application;
fig. 2 is a schematic diagram of a network control architecture based on width reinforcement learning according to an embodiment of the present disclosure.
FIG. 3 is a diagram illustrating a width reinforcement learning system according to an embodiment of the present disclosure;
fig. 4 is a flowchart of a network control method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a process for in-band telemetry provided by an embodiment of the present application;
fig. 6 is a functional block diagram of a network control apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First, technical terms related to the embodiments of the present application will be briefly described.
1. Deep reinforcement learning
The reinforcement learning is a branch of machine learning, and compared with the classic problems of supervised learning and unsupervised learning, the reinforcement learning has the greatest characteristic of learning in interaction. And the intelligent agent continuously learns knowledge according to the acquired reward or punishment in the interaction with the environment so as to adapt to the environment. In essence, reinforcement learning has reward score guidance and can therefore be considered a type of weakly supervised learning. An overview of the mechanism for the value-based DQN reinforcement learning method is given here in detail.
DQN is an algorithm combining deep learning and reinforcement learning, and the proposed motivation is that the storage space of the traditional reinforcement learning Q-learning algorithm is limited, and a Q table capable of storing an overlarge state space and representing the state is good or bad cannot be constructed for a large number of states in a complex environment. Therefore, a neural network is introduced to form a deep Q network. Specifically, DQN combines a Convolutional Neural Network (CNN), the input of which is raw image data (as State), and Q-Learning, the output of which is a value estimate (Q value) for each Action (Action).
DQN has two sets of neural network structures, which are referred to as EvalNet and TargetNet, respectively. This is because two neural networks having the same structure but different update frequencies are constructed to solve the problem, considering that the correlation between the pre-training experience and the post-training experience is large in the actual training process. Specifically, the output of EvalNet is used to evaluate the value function of the current state-action pair; based on the two neural networks, the DQN algorithm updates the neural network of EvalNet in each turn, the updating method is a gradient descent method, TargetNet periodically updates at a certain frequency, and the updating method is to directly copy the neural network parameters of EvalNet. In addition, compared with the traditional Q-learning method, the DQN also stores data into a database through an agent in the learning process, extracts the data from the database by using a uniform random sampling method, and trains a neural network by using the extracted data.
2. Width learning network
The breadth Learning network (Broad Learning) is a novel neural network architecture, and an effective solution is provided for the problem that the calculation cost of the current deep Learning method is huge. Meanwhile, if the network needs to be expanded, the width learning network can be expanded on the original network architecture, and the problem of high overhead caused by retraining of the traditional depth learning network is solved. In addition, the network architecture of the breadth learning can be regarded as being composed of a link neural network of a random vector function and a related deduction mechanism.
The most basic width learning network is provided, and when the basic width learning network cannot meet the performance requirements of some specific training, the width learning network can further incrementally extend the number of the enhanced nodes or the mapping nodes, so that the training precision is improved. Note that in the incremental expansion process, the original neural network architecture and weights do not need to be changed, and the updated new input matrix is decomposed into the original input matrix and the corresponding newly added part, and the updated neural network weights are obtained through pseudo-inverse operation. Therefore, unlike the retraining of the deep learning network, the neural network structure of the wide learning network can be continuously dynamically adjusted through rapid incremental expansion, which makes it extremely high in extensibility and greatly reduces the training overhead.
3. In-band telemetry (INT) method
The in-band remote measurement is a new network monitoring technology, and the basic idea is to record and add network equipment information forwarded by a data packet to a packet head bit hop by hop, and perform related information on a source point-destination point complete path at an end point to form fine-grained network sensing capability.
Common traditional network measurement methods include Ping protocol, IP measurement protocol (IPMP), MPLS packet loss/delay measurement protocol, etc., which actively send special protocol data packets to the network to count network information, which may result in large network overhead, and at the same time, it can only measure coarse-grained network performance indexes such as packet loss rate, delay, TTL, etc. With the rise of software-defined networks, another type of network measurement technology appears, which obtains information of network internal devices directly from the periphery of the network through a controller, and this way can obtain perception of global state, but because it transmits network state information between the controller and the network devices through a large amount of data exchange, it will generate larger overhead, and at the same time, it can only achieve coarse-grained network telemetry by directly reading information from the controller, and it cannot achieve packet-level network state information acquisition.
The network information measurement by adopting the in-band telemetry mode can fully utilize the data packet which is already transmitted in the network, and when the data packet passes through one network device, the state information related to the network device is added on the data packet. The newly added network state information is extracted before the data packet is transmitted to the destination node. The in-band remote measurement method is equivalent to further expanding the sensing capability of the network while ensuring the basic transmission capability of a data packet, so that the method is a high-efficiency and low-overhead network management and control method.
4. Network control method
Network control refers to fine-grained and differentiated management and control of each device terminal in a network, and most of common network control schemes are realized based on a centralized controller or a CPU integrated in network devices. As shown in fig. 1, a common centralized controller interacts with a network device through a southbound interface, and can directly read related information of the network device, and simultaneously issue network rules according to the obtained information, thereby realizing a network control mode integrating high awareness and fast response. However, the conventional network management and control technology is limited in the coarse-grained sensing and control of network devices and network basic resources, when a network has sudden situations such as network congestion and single-point failure, a centralized control unit cannot timely and quickly adapt and issue a timely and effective control rule according to information changes of underlying network resources and network devices, even if manual adjustment and correction are needed, even if a network administrator can quickly respond, the network can generate extremely high sudden processing and response time, which can cause the transmission performance of the network to be seriously reduced, so that the transmission of data packets cannot be completed on time as required. The method for managing and controlling the network is efficient and self-adaptive by deploying the width reinforcement learning system in the centralized controller.
After introducing the technical terms related to the present application, the design ideas of the embodiments of the present application will be briefly described below.
The monitoring function of the traditional centralized controller is separated from the function of managing and issuing the rules of the network equipment, and a network administrator needs to analyze the rules and then issue the rules. This takes a significant amount of time to be able to resolve the network emergency.
In order to solve the technical problems, the present application first provides a width reinforcement learning system, which is deployed in a centralized controller, acquires and collects fine granularity of network state information through in-band telemetry, transmits the collected state information to the centralized controller, and selects an action according to a network real-time state by the centralized controller, that is, issues different data packet transmission and forwarding rules to form a fast response and dynamic regulation and control strategy for different network states, and a specific control architecture of the system is shown in fig. 2.
First, in order to comprehensively utilize the intelligent decision-making capability of the DQN method and the fast convergence capability of the width learning method and further improve the training accuracy of the algorithm, the structure of the width reinforcement learning system designed in the embodiment of the present application is shown in fig. 3.
The width reinforcement learning system can be regarded as the combination of a DQN reinforcement learning algorithm and a width learning network, and adopts methods such as EvalNet-TargetNet and an experience base in the DQN algorithm to make up for deficiencies, but a width neural network is used for replacing a neural network part. Specifically, the width reinforcement learning system includes: environment, experience base and training pool, E-BLS network and T-BLS network:
environment: a network environment where agents (centralized controllers) interact. In particular, the agent obtains the current state from the environments t It is transmitted to E-BLS network to obtain actiona t I.e. specific network management rules output by the centralized controller, the network environment changes to a new state according to the actions t+1The agent receives a single step rewardr t 。
Experience base and training pool: the experience base stores data for training the E-BLS network, but the data is stored continuously in chronological order. Therefore, during training, a batch of experience (P in number) needs to be randomly selected from the training data, and is recorded asAnd put it into a training pool. In addition, the training pool willInputting into T-BLS network to obtain value evaluation Q value corresponding to optimal action valueTo facilitate subsequent calculations。
E-BLS network: employing breadth learning network characterizationAndthe relationship between them. Generating a value assessment value by interacting with an environmentAnd returns to the agent for optimal action. In addition, it also periodically synchronizes parameter updates with the T-BLS network.
T-BLS network: also, a wide learning network is employed, which is consistent with the structure of the E-BLS network but with a different update frequency, periodically updating neural network parameters from the E-BLS network. It is also responsible for receiving input from the training poolAnd generating a Q value corresponding to the optimal action value.
The application applies the width reinforcement learning system to the generation of the network control strategy and deploys the network control strategy on the controller. The information acquired by using the in-band telemetry technology is collected and used as the environment of the system, the network control strategy is generated by using the width reinforcement learning method and then is issued to the network equipment, and a network control mechanism integrating network state perception, controller intelligent decision and control rule issuing is formed. Specifically, the network management and control closed loop is formed by the following three processes:
network state awareness: the method comprises the steps of acquiring and collecting fine granularity of network states through the existing related in-band telemetry technology, and unifying normalized extraction and formatting of real-time network information of a data plane to form information data which is easy to understand by a neural network and is used for inputting into a width reinforcement learning system.
And (3) intelligently deciding by the controller: due to the fact that the network in a real scene has the problems of flow fluctuation and node instability, the online learning strategy is adopted for real-time feedback adjustment, and a dynamic control decision scheme is formed. Therefore, the breadth-enhanced learning system uses the extracted network information data as the input of the state value, firstly carries out the online training and updating of the E-BLS network, and then carries out the online training and updating according to the current state through the E-BLS networks t Selecting the optimal action corresponding to the maximum Q value outputa t 。
And (3) control rule issuing: the action value output by the width reinforcement learning system is not the rule which is finally issued, and the action value is packaged into a control rule data packet which can be identified by the switch and then issued.
According to the method, the width learning technology and the deep reinforcement learning technology are improved and combined aiming at the in-network control scene, the neural network system for width learning is deployed by adopting the DQN framework, the method has the advantages of rapid training and good convergence effect after improvement, a higher reward value can be generated, a robust in-network control strategy is formed, and dynamic regulation and control can be carried out aiming at different network environment states.
The breadth-enhanced learning network architecture provided by the application has high convergence speed and robust training effect, can be widely applied to various machine learning problems, and particularly can be combined with a decision control method under a network environment to enable a centralized network controller to form intelligent network strategy self-learning and self-adaptive capacity; in addition, because the E-BLS network always carries out on-line training and updating, the network change can be responded in real time, and the network emergency can be quickly dealt with.
According to the network control method, real-time state information is obtained by adopting a fine-grained in-band remote measurement method, adaptive adjustment and issuing of network control rules are realized through real-time and rapid training of width reinforcement learning and intelligent control, efficient adjustment and adaptation of the network control rules as required are realized, and finally, the intelligent robust network control method is constructed.
After introducing the application scenario and the design concept of the embodiment of the present application, the following describes a technical solution provided by the embodiment of the present application.
As shown in fig. 4, an embodiment of the present application provides a network control method, including:
step 101: acquiring the network state of a fine-grained data plane at the current moment;
specifically, as shown in fig. 5, a basic processing flow of in-band telemetry adopted in the present application is that a sending end is set to send a data packet according to a certain periodic rate, telemetry information is added to a packet header bit of the data packet every time the data packet passes through one switch, the data packet is analyzed in a previous hop of a receiving end, the acquired telemetry information is transmitted to a controller, and the data packet is restored to an initial state.
As a possible implementation, obtaining the network state of the fine-grained data plane at the current time includes: and carrying out normalized extraction and format unification on the network state information at the current moment.
Step 102: performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;
the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state at the time, the execution action at the time, the reward at the time and the network state at the next time.
In this embodiment, the first width learning network is an E-BLS network, the second width learning network is a T-BLS network, and the E-BLS network and the T-BLS network have the same structure. At an initial time, randomly generating initial weight parameters of the E-BLS networkθ E And assigning it to the T-BLS network;
the method specifically comprises the following steps:
step 201: judging whether the experience library reaches the maximum capacity or not, if so, deleting the earliest experience stored in the experience library according to a time sequence; otherwise, go to step 203;
step 202: will experiencePutting the obtained product into an experience library, wherein,the network state at the last moment after preprocessing,in order to perform the action at the last moment,the network state at the current moment after preprocessing,performing actions for an agentEntering the network state of the next momentLater, the reward obtained from the network environment;
step 203: randomly selecting P-1 experiences from a library of experiences, and experienceForming P experiences as experience samples;
step 204: inputting the next network state of the moment in the p-th empirical sample into the T-BLS network to obtain a value evaluation value corresponding to the optimal action value;θ T A weight parameter representing the T-BLS network;the network state of the next moment in the p-th experience is obtained;indicating possible actions to be performed by the T-BLS network,to take an actionA corresponding value assessment value;
step 205: calculating the network state and performing actions at the moment of the p-th empirical sampleCorresponding target value:
Wherein the content of the first and second substances,γis a factor of the number of the first and second,the reward of the p-th experience sample at the moment is less than or equal to 1p≤P;
Step 206: and taking the network state and the execution action of the time in all the P experience samples as input samples of the E-BLS network, taking the target value as expected output, and training the E-BLS network by adopting a weight calculation method based on ridge regression.
The T-BLS network takes K as a period and completes the update by copying the weight of the E-BLS network (,). Likewise, the E-BLS network also supports incremental expansion of feature mapping nodes and enhanced nodes. In general, the training complexity of the E-BLS network is low, and the incremental learning also ensures good extensibility.
Step 103: processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the current state;
when the experience amount in the experience base can realize the first widthThe learning network is trained on line, and the first width learning network utilizes the preprocessed network state information at the current momentCalculating a value evaluation valueQ E And based on this, selecting the optimal action, adoptGreedy's method to ensure that actions are traded off between random exploration and optimal decision making; the optimal execution action at the current timeComprises the following steps:
wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training completion;to take an actionaThe value of (c) is evaluated.For dynamic thresholding, the random factor is then [0,1 ]]Random number therebetween, perform an actionaWhether the selection of (B) is biased to random or maximumQ E The value is determined primarily based on a change in the threshold.Is that makeAnd obtaining the execution action corresponding to the maximum value.
In the present embodiment, as the algorithm is iterated continuously, the dynamic threshold is gradually reduced from 0.95 to 0.05, which means that at the early stage of the algorithm execution, the action selection is biased to be random, which is beneficial for the algorithm to fully perform the optimal action exploration in the solution space while avoiding falling into the local optimal solution, and as the algorithm is iteratively updated, the action selection will tend to be based on the deterministic decision of the maximized Q value, which makes the algorithm eventually smooth and capable of performing robust action decision.
Step 104: packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet;
based on the foregoing embodiments, an embodiment of the present application provides a network control apparatus, and referring to fig. 6, a network control apparatus 300 according to an embodiment of the present application at least includes:
an obtaining unit 301, configured to obtain a network state of a fine-grained data plane at a current time;
an online training unit 302, configured to perform online training on a first width learning network by using an experience base storing local network environment historical data and a second width learning network;
an optimal execution action obtaining unit 303, configured to complete processing of the network state at the current time by using the first width learning network after online training, so as to obtain an optimal execution action corresponding to the current state;
the issuing unit 304 is configured to encapsulate the optimal execution action at the current time into a control rule data packet, and then issue the control rule data packet.
It should be noted that the principle of the network control apparatus 300 provided in the embodiment of the present application for solving the technical problem is similar to that of the network control method provided in the embodiment of the present application, and therefore, for implementation of the network control apparatus 300 provided in the embodiment of the present application, reference may be made to implementation of the network control method provided in the embodiment of the present application, and repeated details are not repeated.
As shown in fig. 7, an electronic device 400 provided in the embodiment of the present application at least includes: the network control method comprises a processor 401, a memory 402 and a computer program stored on the memory 402 and capable of running on the processor 401, wherein the processor 401 executes the computer program to realize the network control method provided by the embodiment of the application.
The electronic device 400 provided by the embodiment of the present application may further include a bus 403 that connects different components (including the processor 401 and the memory 402). Bus 403 represents one or more of any of several types of bus structures, including a memory bus, a peripheral bus, a local bus, and so forth.
The Memory 402 may include readable media in the form of volatile Memory, such as Random Access Memory (RAM) 4021 and/or cache Memory 4022, and may further include a Read Only Memory (ROM) 4023.
It should be noted that the electronic device 400 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.
The embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored, and when the computer instructions are executed by a processor, the computer instructions implement the network control method provided by the embodiment of the present application.
Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
Claims (7)
1. A network control method, comprising:
acquiring the network state of a fine-grained data plane at the current moment;
performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;
processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment;
packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet;
the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state of the moment, the execution action of the moment, the reward of the moment and the network state of the next moment;
the second width learning network and the first width learning network have the same structure;
performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network; the method comprises the following steps:
will experiencePutting the obtained product into an experience library, wherein,the network state at the last moment after preprocessing,in order to perform the action at the last moment,the network state at the current moment after preprocessing,to take an actionEntering the network state of the next momentLater, the reward obtained from the network environment;
randomly selecting P-1 experiences from a library of experiences, and experienceForming P experiences as experience samples;
inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value;A weight parameter representing the second breadth learning network;the network state of the next moment in the p-th experience is obtained;representing possible actions to be performed by the second width learning network,to take an actionA corresponding value assessment value;
calculating the network state and performing actions at the moment of the p-th empirical sampleCorresponding target value:
Wherein the content of the first and second substances,γis a factor of the number of the first and second,the reward of the p-th experience sample at the moment is less than or equal to 1p≤P;
Taking the network state and the execution action of the time in all P experience samples as input samples of a first width learning network, taking a target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression;
utilizing the online training to complete the first width learning network to process the preprocessed network state at the current moment to obtain the optimal execution action corresponding to the current state, and the method comprises the following steps:
the first width learning network after on-line training is used for preprocessing the network state information at the current momentProcessing and outputting the optimal execution action at the current time:
Wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training is completed,for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,to take an actionaA corresponding value assessment value;is that makeAnd obtaining the execution action corresponding to the maximum value.
2. The network control method according to claim 1, wherein the obtaining the network status of the fine-grained data plane at the current time comprises: and carrying out normalized extraction and format unification on the network state information at the current moment.
3. The network control method of claim 1, wherein the method further comprises: and randomly generating initial weight parameters of the first width learning network, and assigning the initial weight parameters of the first width learning network to the second width learning network.
4. The network control method of claim 1, wherein the method further comprises: and periodically acquiring the network parameters of the first width learning network, and updating the network parameters of the second width learning network.
5. A network control apparatus, comprising:
the acquisition unit is used for acquiring the network state of the fine-grained data plane at the current moment;
the online training unit is used for performing online training on the first width learning network by utilizing an experience library for storing local network environment historical data and the second width learning network;
the optimal execution action acquisition unit is used for processing the network state at the current moment by utilizing the first width learning network completed by online training to obtain the optimal execution action corresponding to the network state at the current moment;
the issuing unit is used for packaging the optimal execution action at the current moment into a control rule data packet and then issuing the control rule data packet;
the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state of the moment, the execution action of the moment, the reward of the moment and the network state of the next moment;
the second width learning network and the first width learning network have the same structure;
the online training unit is specifically configured to:
will experiencePutting the obtained product into an experience library, wherein,the network state at the last moment after preprocessing,in order to perform the action at the last moment,the network state at the current moment after preprocessing,to take an actionEntering the network state of the next momentLater, the reward obtained from the network environment;
randomly selecting P-1 experiences from a library of experiences, and experienceForming P experiences as experience samples;
inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value;A weight parameter representing the second breadth learning network;the network state of the next moment in the p-th experience is obtained;representing possible actions to be performed by the second width learning network,to take an actionA corresponding value assessment value;
calculating the network state and performing actions at the moment of the p-th empirical sampleCorresponding target value:
Wherein the content of the first and second substances,γis a factor of the number of the first and second,the reward of the p-th experience sample at the moment is less than or equal to 1p≤P;
Taking the network state and the execution action of the time in all P experience samples as input samples of a first width learning network, taking a target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression;
the optimal execution action acquisition unit is specifically configured to:
the first width learning network after on-line training is used for preprocessing the network state information at the current momentProcessing and outputting the optimal execution action at the current time:
Wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training is completed,for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,to take an actionaA corresponding value assessment value;is that makeAnd obtaining the execution action corresponding to the maximum value.
6. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the network control method according to any of claims 1-4 when executing the computer program.
7. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the network control method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210154404.0A CN114202066B (en) | 2022-02-21 | 2022-02-21 | Network control method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210154404.0A CN114202066B (en) | 2022-02-21 | 2022-02-21 | Network control method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114202066A CN114202066A (en) | 2022-03-18 |
CN114202066B true CN114202066B (en) | 2022-04-26 |
Family
ID=80645716
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210154404.0A Active CN114202066B (en) | 2022-02-21 | 2022-02-21 | Network control method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114202066B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111010294A (en) * | 2019-11-28 | 2020-04-14 | 国网甘肃省电力公司电力科学研究院 | Electric power communication network routing method based on deep reinforcement learning |
CN112491714A (en) * | 2020-11-13 | 2021-03-12 | 安徽大学 | Intelligent QoS route optimization method and system based on deep reinforcement learning in SDN environment |
CN112564118A (en) * | 2020-11-23 | 2021-03-26 | 广西大学 | Distributed real-time voltage control method capable of expanding quantum deep width learning |
CN113328938A (en) * | 2021-05-25 | 2021-08-31 | 电子科技大学 | Network autonomous intelligent management and control method based on deep reinforcement learning |
US11121788B1 (en) * | 2020-06-08 | 2021-09-14 | Wuhan University | Channel prediction method and system for MIMO wireless communication system |
-
2022
- 2022-02-21 CN CN202210154404.0A patent/CN114202066B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111010294A (en) * | 2019-11-28 | 2020-04-14 | 国网甘肃省电力公司电力科学研究院 | Electric power communication network routing method based on deep reinforcement learning |
US11121788B1 (en) * | 2020-06-08 | 2021-09-14 | Wuhan University | Channel prediction method and system for MIMO wireless communication system |
CN112491714A (en) * | 2020-11-13 | 2021-03-12 | 安徽大学 | Intelligent QoS route optimization method and system based on deep reinforcement learning in SDN environment |
CN112564118A (en) * | 2020-11-23 | 2021-03-26 | 广西大学 | Distributed real-time voltage control method capable of expanding quantum deep width learning |
CN113328938A (en) * | 2021-05-25 | 2021-08-31 | 电子科技大学 | Network autonomous intelligent management and control method based on deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
Deep Q-Learning Aided Networking,Caching, and Computing Resources Allocation in Software-Defined Satellite-Terrestrial Networks;Chao Qiu 等;《IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY》;20190630;第68卷(第6期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114202066A (en) | 2022-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | A multi-agent reinforcement learning approach for efficient client selection in federated learning | |
Wang et al. | Machine learning for networking: Workflow, advances and opportunities | |
CN112491714B (en) | Intelligent QoS route optimization method and system based on deep reinforcement learning in SDN environment | |
Yao et al. | AI routers & network mind: A hybrid machine learning paradigm for packet routing | |
WO2020077682A1 (en) | Service quality evaluation model training method and device | |
CN112422443B (en) | Adaptive control method, storage medium, equipment and system of congestion algorithm | |
US11507887B2 (en) | Model interpretability using proxy features | |
US20230145097A1 (en) | Autonomous traffic (self-driving) network with traffic classes and passive and active learning | |
CN116743635B (en) | Network prediction and regulation method and network regulation system | |
US20230112534A1 (en) | Artificial intelligence planning method and real-time radio access network intelligence controller | |
CN111556173B (en) | Service chain mapping method based on reinforcement learning | |
WO2023045565A1 (en) | Network management and control method and system thereof, and storage medium | |
Sun et al. | Accelerating convergence of federated learning in MEC with dynamic community | |
Huang et al. | Intelligent traffic control for QoS optimization in hybrid SDNs | |
CN108880909B (en) | Network energy saving method and device based on reinforcement learning | |
Chen et al. | Deep learning-based traffic prediction for energy efficiency optimization in software-defined networking | |
Wei et al. | GRL-PS: Graph embedding-based DRL approach for adaptive path selection | |
Zheng et al. | Enabling robust DRL-driven networking systems via teacher-student learning | |
Zeng et al. | Multi-agent reinforcement learning for adaptive routing: A hybrid method using eligibility traces | |
CN114202066B (en) | Network control method and device, electronic equipment and storage medium | |
Sneha et al. | Prediction of network congestion at router using machine learning technique | |
Hu et al. | Clustered data sharing for Non-IID federated learning over wireless networks | |
CN115022231A (en) | Optimal path planning method and system based on deep reinforcement learning | |
Zheng et al. | Leveraging domain knowledge for robust deep reinforcement learning in networking | |
WO2021064769A1 (en) | System, method, and control device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |