CN114202066B

CN114202066B - Network control method and device, electronic equipment and storage medium

Info

Publication number: CN114202066B
Application number: CN202210154404.0A
Authority: CN
Inventors: 姚海鹏; 吴桐; 王尊梁; 苏波; 买天乐; 忻向军; 吴巍; 张尼; 吴小华; 王山
Original assignee: Beijing Tianchi Network Co ltd; Beijing University of Posts and Telecommunications
Current assignee: Beijing Tianchi Network Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-04-26
Anticipated expiration: 2042-02-21
Also published as: CN114202066A

Abstract

The application provides a network control method, a network control device, electronic equipment and a storage medium, which relate to the technical field of computer networks and specifically comprise the following steps: acquiring the network state of a fine-grained data plane at the current moment; performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network; processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment; and packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet. By training the first width learning network on line, the network change can be responded in real time, and the network emergency can be rapidly dealt with.

Description

Network control method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer network technologies, and in particular, to a network control method and apparatus, an electronic device, and a storage medium.

Background

The reinforcement learning Network (DQN, Deep Q Network) has the disadvantages of poor learning and convergence effects, slow speed, etc. due to the simple Network structure. Although the traditional width learning system model can realize rapid convergence by increasing nodes in an incremental manner so as to improve the accuracy of training, the traditional width learning system model still outputs a classification result Y of input data X through training, so that the traditional width learning system model still belongs to a supervised machine learning method, and the application range of the traditional width learning system model is limited by mechanisms and cannot be well extended to some application scenes of unsupervised learning and weakly supervised learning. The convergence of the training effect and the versatility of the training method greatly affect the performance of the algorithm.

The monitoring function of the traditional network controller is separated from the function of managing and issuing the rules of the network equipment, and a network administrator needs to analyze the rules and then issue the rules. This takes a significant amount of time to be able to resolve the network emergency.

Disclosure of Invention

In view of this, the present application provides a network control method, an apparatus, an electronic device, and a storage medium, which solve the technical problem that the network emergency can be solved only when the existing network controller consumes a lot of time.

In a first aspect, an embodiment of the present application provides a network control method, including:

acquiring the network state of a fine-grained data plane at the current moment;

performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;

processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the network state at the current moment;

and packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet.

Further, the experience base stores experiences at a plurality of consecutive time instants, the experiences comprising: the network state at the time, the execution action at the time, the reward at the time and the network state at the next time.

Further, the obtaining of the network state of the fine-grained data plane at the current time includes: and carrying out normalized extraction and format unification on the network state information at the current moment.

Further, the second width learning network and the first width learning network have the same structure;

utilizing a pre-established experience base and a second width learning network to carry out on-line training on the first width learning network; the method comprises the following steps:

will experience

Putting the obtained product into an experience library, wherein,

the network state at the last moment after preprocessing,

in order to perform the action at the last moment,

the network state at the current moment after preprocessing,

to take an action

Entering the next momentNetwork status

Later, the reward obtained from the network environment;

randomly selecting P-1 experiences from a library of experiences, and experience

Forming P experiences as experience samples;

inputting the next network state of the moment in the p-th empirical sample into the second width learning network to obtain the maximum value evaluation value

；

A weight parameter representing the second breadth learning network;

the network state of the next moment in the p-th experience is obtained;

representing possible actions to be performed by the second width learning network,

to take an action

A corresponding value assessment value;

calculating the network state and performing actions at the moment of the p-th empirical sample

Corresponding target value

：

Wherein the content of the first and second substances,γis a factor of the number of the first and second,

the reward of the p-th experience sample at the moment is less than or equal to 1p≤P；

And taking the network state and the execution action of the time in all the P experience samples as input samples of the first width learning network, taking the target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression.

Further, the method further comprises: and randomly generating initial weight parameters of the first width learning network, and assigning the initial weight parameters of the first width learning network to the second width learning network.

Further, the method for processing the network state at the current moment after the preprocessing by using the first width learning network after the online training to obtain the optimal execution action corresponding to the current state includes:

the first width learning network after on-line training is used for preprocessing the network state information at the current moment

Processing and outputting the optimal execution action at the current time

：

Wherein the content of the first and second substances,θ _Ea weight parameter representing a first width learning network on-line training is completed,

for dynamic threshold, the random factor is [0,1 ]]A random number in between;arepresenting possible actions to be performed by the second width learning network,

to take an actionaA corresponding value assessment value;

is that make

And obtaining the execution action corresponding to the maximum value.

Further, the method further comprises: and periodically acquiring the network parameters of the first width learning network, and updating the network parameters of the second width learning network.

In a second aspect, an embodiment of the present application provides a network control apparatus, including:

the acquisition unit is used for acquiring the network state of the fine-grained data plane at the current moment;

the online training unit is used for performing online training on the first width learning network by utilizing an experience library for storing local network environment historical data and the second width learning network;

the optimal execution action acquisition unit is used for processing the network state at the current moment by utilizing the first width learning network completed by online training to obtain the optimal execution action corresponding to the network state at the current moment;

and the issuing unit is used for packaging the optimal execution action at the current moment into a control rule data packet and then issuing the control rule data packet.

In a third aspect, an embodiment of the present application provides an electronic device, including: the network control system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network control method of the embodiment of the application.

In a fourth aspect, the present application provides a computer-readable storage medium, where computer instructions are stored, and when executed by a processor, the computer instructions implement the network control method of the present application.

By training the first width learning network on line, the method and the device can respond to network changes in real time and quickly respond to network emergency situations.

Drawings

In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of a conventional network control architecture according to an embodiment of the present application;

fig. 2 is a schematic diagram of a network control architecture based on width reinforcement learning according to an embodiment of the present disclosure.

FIG. 3 is a diagram illustrating a width reinforcement learning system according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a network control method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a process for in-band telemetry provided by an embodiment of the present application;

fig. 6 is a functional block diagram of a network control apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, technical terms related to the embodiments of the present application will be briefly described.

1. Deep reinforcement learning

The reinforcement learning is a branch of machine learning, and compared with the classic problems of supervised learning and unsupervised learning, the reinforcement learning has the greatest characteristic of learning in interaction. And the intelligent agent continuously learns knowledge according to the acquired reward or punishment in the interaction with the environment so as to adapt to the environment. In essence, reinforcement learning has reward score guidance and can therefore be considered a type of weakly supervised learning. An overview of the mechanism for the value-based DQN reinforcement learning method is given here in detail.

DQN is an algorithm combining deep learning and reinforcement learning, and the proposed motivation is that the storage space of the traditional reinforcement learning Q-learning algorithm is limited, and a Q table capable of storing an overlarge state space and representing the state is good or bad cannot be constructed for a large number of states in a complex environment. Therefore, a neural network is introduced to form a deep Q network. Specifically, DQN combines a Convolutional Neural Network (CNN), the input of which is raw image data (as State), and Q-Learning, the output of which is a value estimate (Q value) for each Action (Action).

DQN has two sets of neural network structures, which are referred to as EvalNet and TargetNet, respectively. This is because two neural networks having the same structure but different update frequencies are constructed to solve the problem, considering that the correlation between the pre-training experience and the post-training experience is large in the actual training process. Specifically, the output of EvalNet is used to evaluate the value function of the current state-action pair; based on the two neural networks, the DQN algorithm updates the neural network of EvalNet in each turn, the updating method is a gradient descent method, TargetNet periodically updates at a certain frequency, and the updating method is to directly copy the neural network parameters of EvalNet. In addition, compared with the traditional Q-learning method, the DQN also stores data into a database through an agent in the learning process, extracts the data from the database by using a uniform random sampling method, and trains a neural network by using the extracted data.

2. Width learning network

The breadth Learning network (Broad Learning) is a novel neural network architecture, and an effective solution is provided for the problem that the calculation cost of the current deep Learning method is huge. Meanwhile, if the network needs to be expanded, the width learning network can be expanded on the original network architecture, and the problem of high overhead caused by retraining of the traditional depth learning network is solved. In addition, the network architecture of the breadth learning can be regarded as being composed of a link neural network of a random vector function and a related deduction mechanism.

The most basic width learning network is provided, and when the basic width learning network cannot meet the performance requirements of some specific training, the width learning network can further incrementally extend the number of the enhanced nodes or the mapping nodes, so that the training precision is improved. Note that in the incremental expansion process, the original neural network architecture and weights do not need to be changed, and the updated new input matrix is decomposed into the original input matrix and the corresponding newly added part, and the updated neural network weights are obtained through pseudo-inverse operation. Therefore, unlike the retraining of the deep learning network, the neural network structure of the wide learning network can be continuously dynamically adjusted through rapid incremental expansion, which makes it extremely high in extensibility and greatly reduces the training overhead.

3. In-band telemetry (INT) method

The in-band remote measurement is a new network monitoring technology, and the basic idea is to record and add network equipment information forwarded by a data packet to a packet head bit hop by hop, and perform related information on a source point-destination point complete path at an end point to form fine-grained network sensing capability.

Common traditional network measurement methods include Ping protocol, IP measurement protocol (IPMP), MPLS packet loss/delay measurement protocol, etc., which actively send special protocol data packets to the network to count network information, which may result in large network overhead, and at the same time, it can only measure coarse-grained network performance indexes such as packet loss rate, delay, TTL, etc. With the rise of software-defined networks, another type of network measurement technology appears, which obtains information of network internal devices directly from the periphery of the network through a controller, and this way can obtain perception of global state, but because it transmits network state information between the controller and the network devices through a large amount of data exchange, it will generate larger overhead, and at the same time, it can only achieve coarse-grained network telemetry by directly reading information from the controller, and it cannot achieve packet-level network state information acquisition.

The network information measurement by adopting the in-band telemetry mode can fully utilize the data packet which is already transmitted in the network, and when the data packet passes through one network device, the state information related to the network device is added on the data packet. The newly added network state information is extracted before the data packet is transmitted to the destination node. The in-band remote measurement method is equivalent to further expanding the sensing capability of the network while ensuring the basic transmission capability of a data packet, so that the method is a high-efficiency and low-overhead network management and control method.

4. Network control method

Network control refers to fine-grained and differentiated management and control of each device terminal in a network, and most of common network control schemes are realized based on a centralized controller or a CPU integrated in network devices. As shown in fig. 1, a common centralized controller interacts with a network device through a southbound interface, and can directly read related information of the network device, and simultaneously issue network rules according to the obtained information, thereby realizing a network control mode integrating high awareness and fast response. However, the conventional network management and control technology is limited in the coarse-grained sensing and control of network devices and network basic resources, when a network has sudden situations such as network congestion and single-point failure, a centralized control unit cannot timely and quickly adapt and issue a timely and effective control rule according to information changes of underlying network resources and network devices, even if manual adjustment and correction are needed, even if a network administrator can quickly respond, the network can generate extremely high sudden processing and response time, which can cause the transmission performance of the network to be seriously reduced, so that the transmission of data packets cannot be completed on time as required. The method for managing and controlling the network is efficient and self-adaptive by deploying the width reinforcement learning system in the centralized controller.

After introducing the technical terms related to the present application, the design ideas of the embodiments of the present application will be briefly described below.

The monitoring function of the traditional centralized controller is separated from the function of managing and issuing the rules of the network equipment, and a network administrator needs to analyze the rules and then issue the rules. This takes a significant amount of time to be able to resolve the network emergency.

In order to solve the technical problems, the present application first provides a width reinforcement learning system, which is deployed in a centralized controller, acquires and collects fine granularity of network state information through in-band telemetry, transmits the collected state information to the centralized controller, and selects an action according to a network real-time state by the centralized controller, that is, issues different data packet transmission and forwarding rules to form a fast response and dynamic regulation and control strategy for different network states, and a specific control architecture of the system is shown in fig. 2.

First, in order to comprehensively utilize the intelligent decision-making capability of the DQN method and the fast convergence capability of the width learning method and further improve the training accuracy of the algorithm, the structure of the width reinforcement learning system designed in the embodiment of the present application is shown in fig. 3.

The width reinforcement learning system can be regarded as the combination of a DQN reinforcement learning algorithm and a width learning network, and adopts methods such as EvalNet-TargetNet and an experience base in the DQN algorithm to make up for deficiencies, but a width neural network is used for replacing a neural network part. Specifically, the width reinforcement learning system includes: environment, experience base and training pool, E-BLS network and T-BLS network:

environment: a network environment where agents (centralized controllers) interact. In particular, the agent obtains the current state from the environments _tIt is transmitted to E-BLS network to obtain actiona _tI.e. specific network management rules output by the centralized controller, the network environment changes to a new state according to the actions _t+1The agent receives a single step rewardr _t。

Experience base and training pool: the experience base stores data for training the E-BLS network, but the data is stored continuously in chronological order. Therefore, during training, a batch of experience (P in number) needs to be randomly selected from the training data, and is recorded as

And put it into a training pool. In addition, the training pool will

Inputting into T-BLS network to obtain value evaluation Q value corresponding to optimal action value

To facilitate subsequent calculations

。

E-BLS network: employing breadth learning network characterization

And

the relationship between them. Generating a value assessment value by interacting with an environment

And returns to the agent for optimal action. In addition, it also periodically synchronizes parameter updates with the T-BLS network.

T-BLS network: also, a wide learning network is employed, which is consistent with the structure of the E-BLS network but with a different update frequency, periodically updating neural network parameters from the E-BLS network. It is also responsible for receiving input from the training pool

And generating a Q value corresponding to the optimal action value.

The application applies the width reinforcement learning system to the generation of the network control strategy and deploys the network control strategy on the controller. The information acquired by using the in-band telemetry technology is collected and used as the environment of the system, the network control strategy is generated by using the width reinforcement learning method and then is issued to the network equipment, and a network control mechanism integrating network state perception, controller intelligent decision and control rule issuing is formed. Specifically, the network management and control closed loop is formed by the following three processes:

network state awareness: the method comprises the steps of acquiring and collecting fine granularity of network states through the existing related in-band telemetry technology, and unifying normalized extraction and formatting of real-time network information of a data plane to form information data which is easy to understand by a neural network and is used for inputting into a width reinforcement learning system.

And (3) intelligently deciding by the controller: due to the fact that the network in a real scene has the problems of flow fluctuation and node instability, the online learning strategy is adopted for real-time feedback adjustment, and a dynamic control decision scheme is formed. Therefore, the breadth-enhanced learning system uses the extracted network information data as the input of the state value, firstly carries out the online training and updating of the E-BLS network, and then carries out the online training and updating according to the current state through the E-BLS networks _tSelecting the optimal action corresponding to the maximum Q value outputa _t。

And (3) control rule issuing: the action value output by the width reinforcement learning system is not the rule which is finally issued, and the action value is packaged into a control rule data packet which can be identified by the switch and then issued.

According to the method, the width learning technology and the deep reinforcement learning technology are improved and combined aiming at the in-network control scene, the neural network system for width learning is deployed by adopting the DQN framework, the method has the advantages of rapid training and good convergence effect after improvement, a higher reward value can be generated, a robust in-network control strategy is formed, and dynamic regulation and control can be carried out aiming at different network environment states.

The breadth-enhanced learning network architecture provided by the application has high convergence speed and robust training effect, can be widely applied to various machine learning problems, and particularly can be combined with a decision control method under a network environment to enable a centralized network controller to form intelligent network strategy self-learning and self-adaptive capacity; in addition, because the E-BLS network always carries out on-line training and updating, the network change can be responded in real time, and the network emergency can be quickly dealt with.

According to the network control method, real-time state information is obtained by adopting a fine-grained in-band remote measurement method, adaptive adjustment and issuing of network control rules are realized through real-time and rapid training of width reinforcement learning and intelligent control, efficient adjustment and adaptation of the network control rules as required are realized, and finally, the intelligent robust network control method is constructed.

After introducing the application scenario and the design concept of the embodiment of the present application, the following describes a technical solution provided by the embodiment of the present application.

As shown in fig. 4, an embodiment of the present application provides a network control method, including:

step 101: acquiring the network state of a fine-grained data plane at the current moment;

specifically, as shown in fig. 5, a basic processing flow of in-band telemetry adopted in the present application is that a sending end is set to send a data packet according to a certain periodic rate, telemetry information is added to a packet header bit of the data packet every time the data packet passes through one switch, the data packet is analyzed in a previous hop of a receiving end, the acquired telemetry information is transmitted to a controller, and the data packet is restored to an initial state.

As a possible implementation, obtaining the network state of the fine-grained data plane at the current time includes: and carrying out normalized extraction and format unification on the network state information at the current moment.

Step 102: performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network;

the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state at the time, the execution action at the time, the reward at the time and the network state at the next time.

In this embodiment, the first width learning network is an E-BLS network, the second width learning network is a T-BLS network, and the E-BLS network and the T-BLS network have the same structure. At an initial time, randomly generating initial weight parameters of the E-BLS networkθ _EAnd assigning it to the T-BLS network;

the method specifically comprises the following steps:

step 201: judging whether the experience library reaches the maximum capacity or not, if so, deleting the earliest experience stored in the experience library according to a time sequence; otherwise, go to step 203;

step 202: will experience

Putting the obtained product into an experience library, wherein,

the network state at the last moment after preprocessing,

in order to perform the action at the last moment,

the network state at the current moment after preprocessing,

performing actions for an agent

Entering the network state of the next moment

Later, the reward obtained from the network environment;

step 203: randomly selecting P-1 experiences from a library of experiences, and experience

Forming P experiences as experience samples;

step 204: inputting the next network state of the moment in the p-th empirical sample into the T-BLS network to obtain a value evaluation value corresponding to the optimal action value

；θ _TA weight parameter representing the T-BLS network;

the network state of the next moment in the p-th experience is obtained;

indicating possible actions to be performed by the T-BLS network,

to take an action

A corresponding value assessment value;

step 205: calculating the network state and performing actions at the moment of the p-th empirical sample

Corresponding target value

：

Step 206: and taking the network state and the execution action of the time in all the P experience samples as input samples of the E-BLS network, taking the target value as expected output, and training the E-BLS network by adopting a weight calculation method based on ridge regression.

The T-BLS network takes K as a period and completes the update by copying the weight of the E-BLS network (

,

). Likewise, the E-BLS network also supports incremental expansion of feature mapping nodes and enhanced nodes. In general, the training complexity of the E-BLS network is low, and the incremental learning also ensures good extensibility.

Step 103: processing the network state at the current moment by utilizing the first width learning network after online training to obtain the optimal execution action corresponding to the current state;

when the experience amount in the experience base can realize the first widthThe learning network is trained on line, and the first width learning network utilizes the preprocessed network state information at the current moment

Calculating a value evaluation valueQ _EAnd based on this, selecting the optimal action, adopt

Greedy's method to ensure that actions are traded off between random exploration and optimal decision making; the optimal execution action at the current time

Comprises the following steps:

wherein the content of the first and second substances,θ _Ea weight parameter representing a first width learning network on-line training completion;

to take an actionaThe value of (c) is evaluated.

For dynamic thresholding, the random factor is then [0,1 ]]Random number therebetween, perform an actionaWhether the selection of (B) is biased to random or maximumQ _EThe value is determined primarily based on a change in the threshold.

Is that make

And obtaining the execution action corresponding to the maximum value.

In the present embodiment, as the algorithm is iterated continuously, the dynamic threshold is gradually reduced from 0.95 to 0.05, which means that at the early stage of the algorithm execution, the action selection is biased to be random, which is beneficial for the algorithm to fully perform the optimal action exploration in the solution space while avoiding falling into the local optimal solution, and as the algorithm is iteratively updated, the action selection will tend to be based on the deterministic decision of the maximized Q value, which makes the algorithm eventually smooth and capable of performing robust action decision.

Step 104: packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet;

based on the foregoing embodiments, an embodiment of the present application provides a network control apparatus, and referring to fig. 6, a network control apparatus 300 according to an embodiment of the present application at least includes:

an obtaining unit 301, configured to obtain a network state of a fine-grained data plane at a current time;

an online training unit 302, configured to perform online training on a first width learning network by using an experience base storing local network environment historical data and a second width learning network;

an optimal execution action obtaining unit 303, configured to complete processing of the network state at the current time by using the first width learning network after online training, so as to obtain an optimal execution action corresponding to the current state;

the issuing unit 304 is configured to encapsulate the optimal execution action at the current time into a control rule data packet, and then issue the control rule data packet.

It should be noted that the principle of the network control apparatus 300 provided in the embodiment of the present application for solving the technical problem is similar to that of the network control method provided in the embodiment of the present application, and therefore, for implementation of the network control apparatus 300 provided in the embodiment of the present application, reference may be made to implementation of the network control method provided in the embodiment of the present application, and repeated details are not repeated.

As shown in fig. 7, an electronic device 400 provided in the embodiment of the present application at least includes: the network control method comprises a processor 401, a memory 402 and a computer program stored on the memory 402 and capable of running on the processor 401, wherein the processor 401 executes the computer program to realize the network control method provided by the embodiment of the application.

The electronic device 400 provided by the embodiment of the present application may further include a bus 403 that connects different components (including the processor 401 and the memory 402). Bus 403 represents one or more of any of several types of bus structures, including a memory bus, a peripheral bus, a local bus, and so forth.

The Memory 402 may include readable media in the form of volatile Memory, such as Random Access Memory (RAM) 4021 and/or cache Memory 4022, and may further include a Read Only Memory (ROM) 4023.

Memory 402 may also include a program tool 4024 having a set of (at least one) program modules 4025, program modules 4025 including, but not limited to: an operating subsystem, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Electronic device 400 may also communicate with one or more external devices 404 (e.g., keyboard, remote control, etc.), with one or more devices that enable a user to interact with electronic device 400 (e.g., cell phone, computer, etc.), and/or with any devices that enable electronic device 400 to communicate with one or more other electronic devices 400 (e.g., router, modem, etc.). This communication may be through an Input/Output (I/O) interface 405. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network, such as the internet) via the Network adapter 406. As shown in FIG. 7, the network adapter 406 communicates with the other modules of the electronic device 400 via the bus 403. It should be understood that although not shown in FIG. 7, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, Redundant processors, external disk drive Arrays, disk array (RAID) subsystems, tape drives, and data backup storage subsystems, to name a few.

It should be noted that the electronic device 400 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments.

The embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored, and when the computer instructions are executed by a processor, the computer instructions implement the network control method provided by the embodiment of the present application.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A network control method, comprising:

acquiring the network state of a fine-grained data plane at the current moment;

packaging the optimal execution action at the current moment into a control rule data packet, and then issuing the control rule data packet;

the experience base stores experiences at a plurality of successive time instants, the experiences comprising: the network state of the moment, the execution action of the moment, the reward of the moment and the network state of the next moment;

the second width learning network and the first width learning network have the same structure;

performing online training on the first width learning network by using an experience library for storing local network environment historical data and a second width learning network; the method comprises the following steps:

will experience

Putting the obtained product into an experience library, wherein,

the network state at the last moment after preprocessing,

in order to perform the action at the last moment,

the network state at the current moment after preprocessing,

to take an action

Entering the network state of the next moment

Later, the reward obtained from the network environment;

Forming P experiences as experience samples;

；

A weight parameter representing the second breadth learning network;

the network state of the next moment in the p-th experience is obtained;

to take an action

A corresponding value assessment value;

Corresponding target value

：

Taking the network state and the execution action of the time in all P experience samples as input samples of a first width learning network, taking a target value as expected output, and training the first width learning network by adopting a weight calculation method based on ridge regression;

utilizing the online training to complete the first width learning network to process the preprocessed network state at the current moment to obtain the optimal execution action corresponding to the current state, and the method comprises the following steps:

Processing and outputting the optimal execution action at the current time

：

to take an actionaA corresponding value assessment value;

is that make

And obtaining the execution action corresponding to the maximum value.

2. The network control method according to claim 1, wherein the obtaining the network status of the fine-grained data plane at the current time comprises: and carrying out normalized extraction and format unification on the network state information at the current moment.

3. The network control method of claim 1, wherein the method further comprises: and randomly generating initial weight parameters of the first width learning network, and assigning the initial weight parameters of the first width learning network to the second width learning network.

4. The network control method of claim 1, wherein the method further comprises: and periodically acquiring the network parameters of the first width learning network, and updating the network parameters of the second width learning network.

5. A network control apparatus, comprising:

the issuing unit is used for packaging the optimal execution action at the current moment into a control rule data packet and then issuing the control rule data packet;

the online training unit is specifically configured to:

will experience

Putting the obtained product into an experience library, wherein,

the network state at the last moment after preprocessing,

in order to perform the action at the last moment,

the network state at the current moment after preprocessing,

to take an action

Entering the network state of the next moment

Later, the reward obtained from the network environment;

Forming P experiences as experience samples;

；

A weight parameter representing the second breadth learning network;

the network state of the next moment in the p-th experience is obtained;

to take an action

A corresponding value assessment value;

Corresponding target value

：

the optimal execution action acquisition unit is specifically configured to:

Processing and outputting the optimal execution action at the current time

：

to take an actionaA corresponding value assessment value;

is that make

And obtaining the execution action corresponding to the maximum value.

6. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the network control method according to any of claims 1-4 when executing the computer program.

7. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the network control method of any one of claims 1-4.