CN114202066B - Network control method and device, electronic equipment and storage medium - Google Patents

Network control method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114202066B
CN114202066B CN202210154404.0A CN202210154404A CN114202066B CN 114202066 B CN114202066 B CN 114202066B CN 202210154404 A CN202210154404 A CN 202210154404A CN 114202066 B CN114202066 B CN 114202066B
Authority
CN
China
Prior art keywords
network
moment
width learning
learning network
experience
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210154404.0A
Other languages
Chinese (zh)
Other versions
CN114202066A (en
Inventor
姚海鹏
吴桐
王尊梁
苏波
买天乐
忻向军
吴巍
张尼
吴小华
王山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tianchi Network Co ltd
Beijing University of Posts and Telecommunications
Original Assignee
Beijing Tianchi Network Co ltd
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tianchi Network Co ltd, Beijing University of Posts and Telecommunications filed Critical Beijing Tianchi Network Co ltd
Priority to CN202210154404.0A priority Critical patent/CN114202066B/en
Publication of CN114202066A publication Critical patent/CN114202066A/en
Application granted granted Critical
Publication of CN114202066B publication Critical patent/CN114202066B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application provides a network control method, a network control device, electronic equipment and a storage medium, which relate to the technical field of computer networks and specifically comprise the following steps: acquiring the network state of a fine-grained data plane at the current moment; performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network; processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment; and packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet. By training the first width learning network on line, the network change can be responded in real time, and the network emergency can be rapidly dealt with.

Description

Network control method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer network technologies, and in particular, to a network control method and apparatus, an electronic device, and a storage medium.
Background
The reinforcement learning Network (DQN, Deep Q Network) has the disadvantages of poor learning and convergence effects, slow speed, etc. due to the simple Network structure. Although the traditional width learning system model can realize rapid convergence by increasing nodes in an incremental manner so as to improve the accuracy of training, the traditional width learning system model still outputs a classification result Y of input data X through training, so that the traditional width learning system model still belongs to a supervised machine learning method, and the application range of the traditional width learning system model is limited by mechanisms and cannot be well extended to some application scenes of unsupervised learning and weakly supervised learning. The convergence of the training effect and the versatility of the training method greatly affect the performance of the algorithm.
The monitoring function of the traditional network controller is separated from the function of managing and issuing the rules of the network equipment, and a network administrator needs to analyze the rules and then issue the rules. This takes a significant amount of time to be able to resolve the network emergency.
Disclosure of Invention
In view of this, the present application provides a network control method, an apparatus, an electronic device, and a storage medium, which solve the technical problem that the network emergency can be solved only when the existing network controller consumes a lot of time.
In a first aspect, an embodiment of the present application provides a network control method, including:
acquiring the network state of a fine-grained data plane at the current moment;
performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;
processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment;
and packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet.
Further, the experience base stores experiences at a plurality of consecutive time instants, the experiences comprising: the network state at the time, the execution action at the time, the reward at the time and the network state at the next time.
Further, the obtaining of the network state of the fine-grained data plane at the current time includes: and carrying out normalized extraction and format unification on the network state information at the current moment.
Further, the second width learning network and the first width learning network have the same structure;
utilizing a pre-established experience base and a second width learning network to carry out on-line training on the first width learning network; the method comprises the following steps:
will experience
Figure P_220221083817650_650540001
Putting the obtained product into an experience library, wherein,
Figure P_220221083817682_682328002
the network state at the last moment after preprocessing,
Figure P_220221083817697_697968003
in order to perform the action at the last moment,
Figure P_220221083817713_713560004
the network state at the current moment after preprocessing,
Figure P_220221083817744_744817005
to take an action
Figure P_220221083817760_760442006
Entering the next momentNetwork status
Figure P_220221083817791_791710007
Later, the reward obtained from the network environment;
randomly selecting P-1 experiences from a library of experiences, and experience
Figure P_220221083817807_807321001
Forming P experiences as experience samples;
inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value
Figure P_220221083817845_845844001
Figure P_220221083817893_893255002
A weight parameter representing the second breadth learning network;
Figure P_220221083817908_908890003
the network state of the next moment in the p-th experience is obtained;
Figure P_220221083817940_940163004
representing possible actions to be performed by the second width learning network,
Figure P_220221083817955_955753005
to take an action
Figure P_220221083817987_987025006
A corresponding value assessment value;
calculating the network state and performing actions at the moment of the p-th empirical sample
Figure P_220221083818002_002615001
Corresponding target value
Figure P_220221083818033_033878002
Figure P_220221083818050_050435001
Wherein the content of the first and second substances,γis a factor of the number of the first and second,
Figure P_220221083818082_082224001
the reward of the p-th experience sample at the moment is less than or equal to 1pP
And taking the network state and the execution action of the time in all the P experience samples as input samples of the first width learning network, taking the target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression.
Further, the method further comprises: and randomly generating initial weight parameters of the first width learning network, and assigning the initial weight parameters of the first width learning network to the second width learning network.
Further, the method for processing the network state at the current moment after the preprocessing by using the first width learning network after the online training to obtain the optimal execution action corresponding to the current state includes:
the first width learning network after on-line training is used for preprocessing the network state information at the current moment
Figure P_220221083818113_113471001
Processing and outputting the optimal execution action at the current time
Figure P_220221083818129_129115002
Figure P_220221083818144_144705001
Wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training is completed,
Figure P_220221083818191_191595001
for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,
Figure P_220221083818207_207237002
to take an actionaA corresponding value assessment value;
Figure P_220221083818240_240880003
is that make
Figure P_220221083818257_257036004
And obtaining the execution action corresponding to the maximum value.
Further, the method further comprises: and periodically acquiring the network parameters of the first width learning network, and updating the network parameters of the second width learning network.
In a second aspect, an embodiment of the present application provides a network control apparatus, including:
the acquisition unit is used for acquiring the network state of the fine-grained data plane at the current moment;
the online training unit is used for performing online training on the first width learning network by utilizing an experience library for storing local network environment historical data and the second width learning network;
the optimal execution action acquisition unit is used for processing the network state at the current moment by utilizing the first width learning network completed by online training to obtain the optimal execution action corresponding to the network state at the current moment;
and the issuing unit is used for packaging the optimal execution action at the current moment into a control rule data packet and then issuing the control rule data packet.
In a third aspect, an embodiment of the present application provides an electronic device, including: the network control system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network control method of the embodiment of the application.
In a fourth aspect, the present application provides a computer-readable storage medium, where computer instructions are stored, and when executed by a processor, the computer instructions implement the network control method of the present application.
By training the first width learning network on line, the method and the device can respond to network changes in real time and quickly respond to network emergency situations.
Drawings
In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic diagram of a conventional network control architecture according to an embodiment of the present application;
fig. 2 is a schematic diagram of a network control architecture based on width reinforcement learning according to an embodiment of the present disclosure.
FIG. 3 is a diagram illustrating a width reinforcement learning system according to an embodiment of the present disclosure;
fig. 4 is a flowchart of a network control method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a process for in-band telemetry provided by an embodiment of the present application;
fig. 6 is a functional block diagram of a network control apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First, technical terms related to the embodiments of the present application will be briefly described.
1. Deep reinforcement learning
The reinforcement learning is a branch of machine learning, and compared with the classic problems of supervised learning and unsupervised learning, the reinforcement learning has the greatest characteristic of learning in interaction. And the intelligent agent continuously learns knowledge according to the acquired reward or punishment in the interaction with the environment so as to adapt to the environment. In essence, reinforcement learning has reward score guidance and can therefore be considered a type of weakly supervised learning. An overview of the mechanism for the value-based DQN reinforcement learning method is given here in detail.
DQN is an algorithm combining deep learning and reinforcement learning, and the proposed motivation is that the storage space of the traditional reinforcement learning Q-learning algorithm is limited, and a Q table capable of storing an overlarge state space and representing the state is good or bad cannot be constructed for a large number of states in a complex environment. Therefore, a neural network is introduced to form a deep Q network. Specifically, DQN combines a Convolutional Neural Network (CNN), the input of which is raw image data (as State), and Q-Learning, the output of which is a value estimate (Q value) for each Action (Action).
DQN has two sets of neural network structures, which are referred to as EvalNet and TargetNet, respectively. This is because two neural networks having the same structure but different update frequencies are constructed to solve the problem, considering that the correlation between the pre-training experience and the post-training experience is large in the actual training process. Specifically, the output of EvalNet is used to evaluate the value function of the current state-action pair; based on the two neural networks, the DQN algorithm updates the neural network of EvalNet in each turn, the updating method is a gradient descent method, TargetNet periodically updates at a certain frequency, and the updating method is to directly copy the neural network parameters of EvalNet. In addition, compared with the traditional Q-learning method, the DQN also stores data into a database through an agent in the learning process, extracts the data from the database by using a uniform random sampling method, and trains a neural network by using the extracted data.
2. Width learning network
The breadth Learning network (Broad Learning) is a novel neural network architecture, and an effective solution is provided for the problem that the calculation cost of the current deep Learning method is huge. Meanwhile, if the network needs to be expanded, the width learning network can be expanded on the original network architecture, and the problem of high overhead caused by retraining of the traditional depth learning network is solved. In addition, the network architecture of the breadth learning can be regarded as being composed of a link neural network of a random vector function and a related deduction mechanism.
The most basic width learning network is provided, and when the basic width learning network cannot meet the performance requirements of some specific training, the width learning network can further incrementally extend the number of the enhanced nodes or the mapping nodes, so that the training precision is improved. Note that in the incremental expansion process, the original neural network architecture and weights do not need to be changed, and the updated new input matrix is decomposed into the original input matrix and the corresponding newly added part, and the updated neural network weights are obtained through pseudo-inverse operation. Therefore, unlike the retraining of the deep learning network, the neural network structure of the wide learning network can be continuously dynamically adjusted through rapid incremental expansion, which makes it extremely high in extensibility and greatly reduces the training overhead.
3. In-band telemetry (INT) method
The in-band remote measurement is a new network monitoring technology, and the basic idea is to record and add network equipment information forwarded by a data packet to a packet head bit hop by hop, and perform related information on a source point-destination point complete path at an end point to form fine-grained network sensing capability.
Common traditional network measurement methods include Ping protocol, IP measurement protocol (IPMP), MPLS packet loss/delay measurement protocol, etc., which actively send special protocol data packets to the network to count network information, which may result in large network overhead, and at the same time, it can only measure coarse-grained network performance indexes such as packet loss rate, delay, TTL, etc. With the rise of software-defined networks, another type of network measurement technology appears, which obtains information of network internal devices directly from the periphery of the network through a controller, and this way can obtain perception of global state, but because it transmits network state information between the controller and the network devices through a large amount of data exchange, it will generate larger overhead, and at the same time, it can only achieve coarse-grained network telemetry by directly reading information from the controller, and it cannot achieve packet-level network state information acquisition.
The network information measurement by adopting the in-band telemetry mode can fully utilize the data packet which is already transmitted in the network, and when the data packet passes through one network device, the state information related to the network device is added on the data packet. The newly added network state information is extracted before the data packet is transmitted to the destination node. The in-band remote measurement method is equivalent to further expanding the sensing capability of the network while ensuring the basic transmission capability of a data packet, so that the method is a high-efficiency and low-overhead network management and control method.
4. Network control method
Network control refers to fine-grained and differentiated management and control of each device terminal in a network, and most of common network control schemes are realized based on a centralized controller or a CPU integrated in network devices. As shown in fig. 1, a common centralized controller interacts with a network device through a southbound interface, and can directly read related information of the network device, and simultaneously issue network rules according to the obtained information, thereby realizing a network control mode integrating high awareness and fast response. However, the conventional network management and control technology is limited in the coarse-grained sensing and control of network devices and network basic resources, when a network has sudden situations such as network congestion and single-point failure, a centralized control unit cannot timely and quickly adapt and issue a timely and effective control rule according to information changes of underlying network resources and network devices, even if manual adjustment and correction are needed, even if a network administrator can quickly respond, the network can generate extremely high sudden processing and response time, which can cause the transmission performance of the network to be seriously reduced, so that the transmission of data packets cannot be completed on time as required. The method for managing and controlling the network is efficient and self-adaptive by deploying the width reinforcement learning system in the centralized controller.
After introducing the technical terms related to the present application, the design ideas of the embodiments of the present application will be briefly described below.
The monitoring function of the traditional centralized controller is separated from the function of managing and issuing the rules of the network equipment, and a network administrator needs to analyze the rules and then issue the rules. This takes a significant amount of time to be able to resolve the network emergency.
In order to solve the technical problems, the present application first provides a width reinforcement learning system, which is deployed in a centralized controller, acquires and collects fine granularity of network state information through in-band telemetry, transmits the collected state information to the centralized controller, and selects an action according to a network real-time state by the centralized controller, that is, issues different data packet transmission and forwarding rules to form a fast response and dynamic regulation and control strategy for different network states, and a specific control architecture of the system is shown in fig. 2.
First, in order to comprehensively utilize the intelligent decision-making capability of the DQN method and the fast convergence capability of the width learning method and further improve the training accuracy of the algorithm, the structure of the width reinforcement learning system designed in the embodiment of the present application is shown in fig. 3.
The width reinforcement learning system can be regarded as the combination of a DQN reinforcement learning algorithm and a width learning network, and adopts methods such as EvalNet-TargetNet and an experience base in the DQN algorithm to make up for deficiencies, but a width neural network is used for replacing a neural network part. Specifically, the width reinforcement learning system includes: environment, experience base and training pool, E-BLS network and T-BLS network:
environment: a network environment where agents (centralized controllers) interact. In particular, the agent obtains the current state from the environments t It is transmitted to E-BLS network to obtain actiona t I.e. specific network management rules output by the centralized controller, the network environment changes to a new state according to the actions t+1The agent receives a single step rewardr t
Experience base and training pool: the experience base stores data for training the E-BLS network, but the data is stored continuously in chronological order. Therefore, during training, a batch of experience (P in number) needs to be randomly selected from the training data, and is recorded as
Figure P_220221083818288_288304001
And put it into a training pool. In addition, the training pool will
Figure P_220221083818319_319541002
Inputting into T-BLS network to obtain value evaluation Q value corresponding to optimal action value
Figure P_220221083818350_350804003
To facilitate subsequent calculations
Figure P_220221083818382_382052004
E-BLS network: employing breadth learning network characterization
Figure P_220221083818413_413304001
And
Figure P_220221083818450_450369002
the relationship between them. Generating a value assessment value by interacting with an environment
Figure P_220221083818482_482131003
And returns to the agent for optimal action. In addition, it also periodically synchronizes parameter updates with the T-BLS network.
T-BLS network: also, a wide learning network is employed, which is consistent with the structure of the E-BLS network but with a different update frequency, periodically updating neural network parameters from the E-BLS network. It is also responsible for receiving input from the training pool
Figure P_220221083818513_513391001
And generating a Q value corresponding to the optimal action value.
The application applies the width reinforcement learning system to the generation of the network control strategy and deploys the network control strategy on the controller. The information acquired by using the in-band telemetry technology is collected and used as the environment of the system, the network control strategy is generated by using the width reinforcement learning method and then is issued to the network equipment, and a network control mechanism integrating network state perception, controller intelligent decision and control rule issuing is formed. Specifically, the network management and control closed loop is formed by the following three processes:
network state awareness: the method comprises the steps of acquiring and collecting fine granularity of network states through the existing related in-band telemetry technology, and unifying normalized extraction and formatting of real-time network information of a data plane to form information data which is easy to understand by a neural network and is used for inputting into a width reinforcement learning system.
And (3) intelligently deciding by the controller: due to the fact that the network in a real scene has the problems of flow fluctuation and node instability, the online learning strategy is adopted for real-time feedback adjustment, and a dynamic control decision scheme is formed. Therefore, the breadth-enhanced learning system uses the extracted network information data as the input of the state value, firstly carries out the online training and updating of the E-BLS network, and then carries out the online training and updating according to the current state through the E-BLS networks t Selecting the optimal action corresponding to the maximum Q value outputa t
And (3) control rule issuing: the action value output by the width reinforcement learning system is not the rule which is finally issued, and the action value is packaged into a control rule data packet which can be identified by the switch and then issued.
According to the method, the width learning technology and the deep reinforcement learning technology are improved and combined aiming at the in-network control scene, the neural network system for width learning is deployed by adopting the DQN framework, the method has the advantages of rapid training and good convergence effect after improvement, a higher reward value can be generated, a robust in-network control strategy is formed, and dynamic regulation and control can be carried out aiming at different network environment states.
The breadth-enhanced learning network architecture provided by the application has high convergence speed and robust training effect, can be widely applied to various machine learning problems, and particularly can be combined with a decision control method under a network environment to enable a centralized network controller to form intelligent network strategy self-learning and self-adaptive capacity; in addition, because the E-BLS network always carries out on-line training and updating, the network change can be responded in real time, and the network emergency can be quickly dealt with.
According to the network control method, real-time state information is obtained by adopting a fine-grained in-band remote measurement method, adaptive adjustment and issuing of network control rules are realized through real-time and rapid training of width reinforcement learning and intelligent control, efficient adjustment and adaptation of the network control rules as required are realized, and finally, the intelligent robust network control method is constructed.
After introducing the application scenario and the design concept of the embodiment of the present application, the following describes a technical solution provided by the embodiment of the present application.
As shown in fig. 4, an embodiment of the present application provides a network control method, including:
step 101: acquiring the network state of a fine-grained data plane at the current moment;
specifically, as shown in fig. 5, a basic processing flow of in-band telemetry adopted in the present application is that a sending end is set to send a data packet according to a certain periodic rate, telemetry information is added to a packet header bit of the data packet every time the data packet passes through one switch, the data packet is analyzed in a previous hop of a receiving end, the acquired telemetry information is transmitted to a controller, and the data packet is restored to an initial state.
As a possible implementation, obtaining the network state of the fine-grained data plane at the current time includes: and carrying out normalized extraction and format unification on the network state information at the current moment.
Step 102: performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;
the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state at the time, the execution action at the time, the reward at the time and the network state at the next time.
In this embodiment, the first width learning network is an E-BLS network, the second width learning network is a T-BLS network, and the E-BLS network and the T-BLS network have the same structure. At an initial time, randomly generating initial weight parameters of the E-BLS networkθ E And assigning it to the T-BLS network;
the method specifically comprises the following steps:
step 201: judging whether the experience library reaches the maximum capacity or not, if so, deleting the earliest experience stored in the experience library according to a time sequence; otherwise, go to step 203;
step 202: will experience
Figure P_220221083818544_544632001
Putting the obtained product into an experience library, wherein,
Figure P_220221083818575_575875002
the network state at the last moment after preprocessing,
Figure P_220221083818591_591509003
in order to perform the action at the last moment,
Figure P_220221083818622_622787004
the network state at the current moment after preprocessing,
Figure P_220221083818655_655428005
performing actions for an agent
Figure P_220221083818687_687205006
Entering the network state of the next moment
Figure P_220221083818702_702821007
Later, the reward obtained from the network environment;
step 203: randomly selecting P-1 experiences from a library of experiences, and experience
Figure P_220221083818734_734114001
Forming P experiences as experience samples;
step 204: inputting the next network state of the moment in the p-th empirical sample into the T-BLS network to obtain a value evaluation value corresponding to the optimal action value
Figure P_220221083818875_875692001
θ T A weight parameter representing the T-BLS network;
Figure P_220221083818891_891318002
the network state of the next moment in the p-th experience is obtained;
Figure P_220221083818922_922582003
indicating possible actions to be performed by the T-BLS network,
Figure P_220221083818938_938188004
to take an action
Figure P_220221083818969_969437005
A corresponding value assessment value;
step 205: calculating the network state and performing actions at the moment of the p-th empirical sample
Figure P_220221083818985_985078001
Corresponding target value
Figure P_220221083819016_016290002
Figure P_220221083819048_048511001
Wherein the content of the first and second substances,γis a factor of the number of the first and second,
Figure P_220221083819080_080269001
the reward of the p-th experience sample at the moment is less than or equal to 1pP
Step 206: and taking the network state and the execution action of the time in all the P experience samples as input samples of the E-BLS network, taking the target value as expected output, and training the E-BLS network by adopting a weight calculation method based on ridge regression.
The T-BLS network takes K as a period and completes the update by copying the weight of the E-BLS network (
Figure P_220221083819111_111520001
,
Figure P_220221083819127_127150002
). Likewise, the E-BLS network also supports incremental expansion of feature mapping nodes and enhanced nodes. In general, the training complexity of the E-BLS network is low, and the incremental learning also ensures good extensibility.
Step 103: processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the current state;
when the experience amount in the experience base can realize the first widthThe learning network is trained on line, and the first width learning network utilizes the preprocessed network state information at the current moment
Figure P_220221083819158_158399001
Calculating a value evaluation valueQ E And based on this, selecting the optimal action, adopt
Figure P_220221083819174_174001002
Greedy's method to ensure that actions are traded off between random exploration and optimal decision making; the optimal execution action at the current time
Figure P_220221083819205_205296003
Comprises the following steps:
Figure P_220221083819220_220906001
wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training completion;
Figure P_220221083819275_275097001
to take an actionaThe value of (c) is evaluated.
Figure P_220221083819290_290713002
For dynamic thresholding, the random factor is then [0,1 ]]Random number therebetween, perform an actionaWhether the selection of (B) is biased to random or maximumQ E The value is determined primarily based on a change in the threshold.
Figure P_220221083819322_322017003
Is that make
Figure P_220221083819353_353214004
And obtaining the execution action corresponding to the maximum value.
In the present embodiment, as the algorithm is iterated continuously, the dynamic threshold is gradually reduced from 0.95 to 0.05, which means that at the early stage of the algorithm execution, the action selection is biased to be random, which is beneficial for the algorithm to fully perform the optimal action exploration in the solution space while avoiding falling into the local optimal solution, and as the algorithm is iteratively updated, the action selection will tend to be based on the deterministic decision of the maximized Q value, which makes the algorithm eventually smooth and capable of performing robust action decision.
Step 104: packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet;
based on the foregoing embodiments, an embodiment of the present application provides a network control apparatus, and referring to fig. 6, a network control apparatus 300 according to an embodiment of the present application at least includes:
an obtaining unit 301, configured to obtain a network state of a fine-grained data plane at a current time;
an online training unit 302, configured to perform online training on a first width learning network by using an experience base storing local network environment historical data and a second width learning network;
an optimal execution action obtaining unit 303, configured to complete processing of the network state at the current time by using the first width learning network after online training, so as to obtain an optimal execution action corresponding to the current state;
the issuing unit 304 is configured to encapsulate the optimal execution action at the current time into a control rule data packet, and then issue the control rule data packet.
It should be noted that the principle of the network control apparatus 300 provided in the embodiment of the present application for solving the technical problem is similar to that of the network control method provided in the embodiment of the present application, and therefore, for implementation of the network control apparatus 300 provided in the embodiment of the present application, reference may be made to implementation of the network control method provided in the embodiment of the present application, and repeated details are not repeated.
As shown in fig. 7, an electronic device 400 provided in the embodiment of the present application at least includes: the network control method comprises a processor 401, a memory 402 and a computer program stored on the memory 402 and capable of running on the processor 401, wherein the processor 401 executes the computer program to realize the network control method provided by the embodiment of the application.
The electronic device 400 provided by the embodiment of the present application may further include a bus 403 that connects different components (including the processor 401 and the memory 402). Bus 403 represents one or more of any of several types of bus structures, including a memory bus, a peripheral bus, a local bus, and so forth.
The Memory 402 may include readable media in the form of volatile Memory, such as Random Access Memory (RAM) 4021 and/or cache Memory 4022, and may further include a Read Only Memory (ROM) 4023.
Memory 402 may also include a program tool 4024 having a set of (at least one) program modules 4025, program modules 4025 including, but not limited to: an operating subsystem, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Electronic device 400 may also communicate with one or more external devices 404 (e.g., keyboard, remote control, etc.), with one or more devices that enable a user to interact with electronic device 400 (e.g., cell phone, computer, etc.), and/or with any devices that enable electronic device 400 to communicate with one or more other electronic devices 400 (e.g., router, modem, etc.). This communication may be through an Input/Output (I/O) interface 405. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network, such as the internet) via the Network adapter 406. As shown in FIG. 7, the network adapter 406 communicates with the other modules of the electronic device 400 via the bus 403. It should be understood that although not shown in FIG. 7, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, Redundant processors, external disk drive Arrays, disk array (RAID) subsystems, tape drives, and data backup storage subsystems, to name a few.
It should be noted that the electronic device 400 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.
The embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored, and when the computer instructions are executed by a processor, the computer instructions implement the network control method provided by the embodiment of the present application.
Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (7)

1. A network control method, comprising:
acquiring the network state of a fine-grained data plane at the current moment;
performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;
processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment;
packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet;
the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state of the moment, the execution action of the moment, the reward of the moment and the network state of the next moment;
the second width learning network and the first width learning network have the same structure;
performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network; the method comprises the following steps:
will experience
Figure P_220325160937622_622096001
Putting the obtained product into an experience library, wherein,
Figure P_220325160937653_653343002
the network state at the last moment after preprocessing,
Figure P_220325160937684_684158003
in order to perform the action at the last moment,
Figure P_220325160937699_699751004
the network state at the current moment after preprocessing,
Figure P_220325160937730_730966005
to take an action
Figure P_220325160937747_747078006
Entering the network state of the next moment
Figure P_220325160937762_762714007
Later, the reward obtained from the network environment;
randomly selecting P-1 experiences from a library of experiences, and experience
Figure P_220325160937794_794931001
Forming P experiences as experience samples;
inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value
Figure P_220325160937811_811071001
Figure P_220325160937842_842325002
A weight parameter representing the second breadth learning network;
Figure P_220325160937857_857919003
the network state of the next moment in the p-th experience is obtained;
Figure P_220325160937873_873551004
representing possible actions to be performed by the second width learning network,
Figure P_220325160937904_904823005
to take an action
Figure P_220325160937920_920453006
A corresponding value assessment value;
calculating the network state and performing actions at the moment of the p-th empirical sample
Figure P_220325160937951_951677001
Corresponding target value
Figure P_220325160937984_984383002
Figure P_220325160938000_000520001
Wherein the content of the first and second substances,γis a factor of the number of the first and second,
Figure P_220325160938047_047432001
the reward of the p-th experience sample at the moment is less than or equal to 1pP
Taking the network state and the execution action of the time in all P experience samples as input samples of a first width learning network, taking a target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression;
utilizing the online training to complete the first width learning network to process the preprocessed network state at the current moment to obtain the optimal execution action corresponding to the current state, and the method comprises the following steps:
the first width learning network after on-line training is used for preprocessing the network state information at the current moment
Figure P_220325160938078_078668001
Processing and outputting the optimal execution action at the current time
Figure P_220325160938109_109915002
Figure P_220325160938125_125537001
Wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training is completed,
Figure P_220325160938189_189466001
for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,
Figure P_220325160938205_205593002
to take an actionaA corresponding value assessment value;
Figure P_220325160938236_236865003
is that make
Figure P_220325160938268_268147004
And obtaining the execution action corresponding to the maximum value.
2. The network control method according to claim 1, wherein the obtaining the network status of the fine-grained data plane at the current time comprises: and carrying out normalized extraction and format unification on the network state information at the current moment.
3. The network control method of claim 1, wherein the method further comprises: and randomly generating initial weight parameters of the first width learning network, and assigning the initial weight parameters of the first width learning network to the second width learning network.
4. The network control method of claim 1, wherein the method further comprises: and periodically acquiring the network parameters of the first width learning network, and updating the network parameters of the second width learning network.
5. A network control apparatus, comprising:
the acquisition unit is used for acquiring the network state of the fine-grained data plane at the current moment;
the online training unit is used for performing online training on the first width learning network by utilizing an experience library for storing local network environment historical data and the second width learning network;
the optimal execution action acquisition unit is used for processing the network state at the current moment by utilizing the first width learning network completed by online training to obtain the optimal execution action corresponding to the network state at the current moment;
the issuing unit is used for packaging the optimal execution action at the current moment into a control rule data packet and then issuing the control rule data packet;
the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state of the moment, the execution action of the moment, the reward of the moment and the network state of the next moment;
the second width learning network and the first width learning network have the same structure;
the online training unit is specifically configured to:
will experience
Figure P_220325160938299_299351001
Putting the obtained product into an experience library, wherein,
Figure P_220325160938330_330587002
the network state at the last moment after preprocessing,
Figure P_220325160938361_361866003
in order to perform the action at the last moment,
Figure P_220325160938377_377493004
the network state at the current moment after preprocessing,
Figure P_220325160938410_410682005
to take an action
Figure P_220325160938426_426326006
Entering the network state of the next moment
Figure P_220325160938457_457548007
Later, the reward obtained from the network environment;
randomly selecting P-1 experiences from a library of experiences, and experience
Figure P_220325160938473_473183001
Forming P experiences as experience samples;
inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value
Figure P_220325160938488_488779001
Figure P_220325160938520_520055002
A weight parameter representing the second breadth learning network;
Figure P_220325160938551_551295003
the network state of the next moment in the p-th experience is obtained;
Figure P_220325160938566_566944004
representing possible actions to be performed by the second width learning network,
Figure P_220325160938627_627007005
to take an action
Figure P_220325160938658_658249006
A corresponding value assessment value;
calculating the network state and performing actions at the moment of the p-th empirical sample
Figure P_220325160938705_705138001
Corresponding target value
Figure P_220325160938721_721231002
Figure P_220325160938752_752487001
Wherein the content of the first and second substances,γis a factor of the number of the first and second,
Figure P_220325160938817_817437001
the reward of the p-th experience sample at the moment is less than or equal to 1pP
Taking the network state and the execution action of the time in all P experience samples as input samples of a first width learning network, taking a target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression;
the optimal execution action acquisition unit is specifically configured to:
the first width learning network after on-line training is used for preprocessing the network state information at the current moment
Figure P_220325160938848_848671001
Processing and outputting the optimal execution action at the current time
Figure P_220325160938879_879943002
Figure P_220325160938895_895551001
Wherein the content of the first and second substances,θ E a weight parameter representing a first width learning network on-line training is completed,
Figure P_220325160938973_973680001
for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,
Figure P_220325160939006_006390002
to take an actionaA corresponding value assessment value;
Figure P_220325160939037_037661003
is that make
Figure P_220325160939068_068913004
And obtaining the execution action corresponding to the maximum value.
6. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the network control method according to any of claims 1-4 when executing the computer program.
7. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the network control method of any one of claims 1-4.
CN202210154404.0A 2022-02-21 2022-02-21 Network control method and device, electronic equipment and storage medium Active CN114202066B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210154404.0A CN114202066B (en) 2022-02-21 2022-02-21 Network control method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210154404.0A CN114202066B (en) 2022-02-21 2022-02-21 Network control method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114202066A CN114202066A (en) 2022-03-18
CN114202066B true CN114202066B (en) 2022-04-26

Family

ID=80645716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210154404.0A Active CN114202066B (en) 2022-02-21 2022-02-21 Network control method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114202066B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111010294A (en) * 2019-11-28 2020-04-14 国网甘肃省电力公司电力科学研究院 Electric power communication network routing method based on deep reinforcement learning
CN112491714A (en) * 2020-11-13 2021-03-12 安徽大学 Intelligent QoS route optimization method and system based on deep reinforcement learning in SDN environment
CN112564118A (en) * 2020-11-23 2021-03-26 广西大学 Distributed real-time voltage control method capable of expanding quantum deep width learning
CN113328938A (en) * 2021-05-25 2021-08-31 电子科技大学 Network autonomous intelligent management and control method based on deep reinforcement learning
US11121788B1 (en) * 2020-06-08 2021-09-14 Wuhan University Channel prediction method and system for MIMO wireless communication system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111010294A (en) * 2019-11-28 2020-04-14 国网甘肃省电力公司电力科学研究院 Electric power communication network routing method based on deep reinforcement learning
US11121788B1 (en) * 2020-06-08 2021-09-14 Wuhan University Channel prediction method and system for MIMO wireless communication system
CN112491714A (en) * 2020-11-13 2021-03-12 安徽大学 Intelligent QoS route optimization method and system based on deep reinforcement learning in SDN environment
CN112564118A (en) * 2020-11-23 2021-03-26 广西大学 Distributed real-time voltage control method capable of expanding quantum deep width learning
CN113328938A (en) * 2021-05-25 2021-08-31 电子科技大学 Network autonomous intelligent management and control method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deep Q-Learning Aided Networking,Caching, and Computing Resources Allocation in Software-Defined Satellite-Terrestrial Networks;Chao Qiu 等;《IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY》;20190630;第68卷(第6期);全文 *

Also Published As

Publication number Publication date
CN114202066A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
Zhang et al. A multi-agent reinforcement learning approach for efficient client selection in federated learning
Wang et al. Machine learning for networking: Workflow, advances and opportunities
CN112491714B (en) Intelligent QoS route optimization method and system based on deep reinforcement learning in SDN environment
Yao et al. AI routers & network mind: A hybrid machine learning paradigm for packet routing
WO2020077682A1 (en) Service quality evaluation model training method and device
CN112422443B (en) Adaptive control method, storage medium, equipment and system of congestion algorithm
US11507887B2 (en) Model interpretability using proxy features
US20230145097A1 (en) Autonomous traffic (self-driving) network with traffic classes and passive and active learning
CN116743635B (en) Network prediction and regulation method and network regulation system
US20230112534A1 (en) Artificial intelligence planning method and real-time radio access network intelligence controller
CN111556173B (en) Service chain mapping method based on reinforcement learning
WO2023045565A1 (en) Network management and control method and system thereof, and storage medium
Sun et al. Accelerating convergence of federated learning in MEC with dynamic community
Huang et al. Intelligent traffic control for QoS optimization in hybrid SDNs
CN108880909B (en) Network energy saving method and device based on reinforcement learning
Chen et al. Deep learning-based traffic prediction for energy efficiency optimization in software-defined networking
Wei et al. GRL-PS: Graph embedding-based DRL approach for adaptive path selection
Zheng et al. Enabling robust DRL-driven networking systems via teacher-student learning
Zeng et al. Multi-agent reinforcement learning for adaptive routing: A hybrid method using eligibility traces
CN114202066B (en) Network control method and device, electronic equipment and storage medium
Sneha et al. Prediction of network congestion at router using machine learning technique
Hu et al. Clustered data sharing for Non-IID federated learning over wireless networks
CN115022231A (en) Optimal path planning method and system based on deep reinforcement learning
Zheng et al. Leveraging domain knowledge for robust deep reinforcement learning in networking
WO2021064769A1 (en) System, method, and control device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant