CN115102906A

CN115102906A - Load balancing method based on deep reinforcement learning drive

Info

Publication number: CN115102906A
Application number: CN202210700058.1A
Authority: CN
Inventors: 吴立军; 曾祥云
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-23

Abstract

The invention discloses a load balancing method driven by deep reinforcement learning, which comprises the steps of obtaining a network topological structure, switch characteristics and data link characteristics; generating node-wise graph characteristics according to the network topology structure and the switch characteristics; generating edge-wise graph characteristics according to the network topology structure and the data link characteristics; and constructing and training a BRGCN model, and acquiring a routing path according to the node-wise graph characteristics and the edge-wise graph characteristics by using the trained BRGCN model. The invention combines deep reinforcement learning with the graph convolution neural network and the cyclic neural network and applies the combination to the load balancing method, so that the model can make a decision according to the state information, the structure and the topological relation of the network are taken as the decision-making factors, and the model has the capability of processing the time sequence information and improves the performance of the model.

Description

Load balancing method based on deep reinforcement learning drive

Technical Field

The invention relates to the technical field of network optimization of an SDN data center, in particular to a load balancing method based on deep reinforcement learning drive.

Background

Since the 21 st century, the development of information technology has become faster and faster, the number of internet users in the world has increased remarkably, the amount of data in the network has also exploded in a well-spraying manner, and the internet world has entered a new era. Especially, in the last 10 years, various behaviors of more and more users in the internet have produced information data such as voice, text, pictures, and the like on an increasingly large scale. The development of the internet and the increase of internet users have also led to a large number of internet companies, which provide various application services to users, and the traffic and consumed resources generated by the operation of these services cannot be supported by the data center network. Because the data center network has large uncertainty due to the dynamics of the data center network, the load balancing algorithm routes the flow in the network by selecting a proper path, so that the load of the data link of the network is balanced, and the stability of the network is ensured.

In the SDN architecture, a controller may detect network information in real time, deploy SDN in a data center network, and manage the network according to global information of the network. Due to the characteristics of the SDN and the multi-path structure of the data center network, the controller can select an appropriate path route for the traffic according to the global state of the network. Therefore, after the SDN is introduced into the data center network, the load balancing algorithm has a larger promotion space. However, the load balancing algorithm in the network still has the problems of slow response, large calculation amount and the like due to the problems of the cost of calculating the routing path, the dynamic property of the network traffic and the like.

Accordingly, a technical solution is desired to overcome or at least alleviate at least one of the above-mentioned drawbacks of the prior art.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a load balancing method based on deep reinforcement learning driving.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a deep reinforcement learning driven load balancing method comprises the following steps:

s1, acquiring a network topology structure, switch characteristics and data link characteristics;

s2, generating node-wise graph characteristics according to the network topology structure and the switch characteristics;

s3, generating edge-wise graph characteristics according to the network topology structure and the data link characteristics;

s4, constructing and training a BRGCN model, and acquiring a routing path according to the node-wise graph characteristics and the edge-wise graph characteristics by using the trained BRGCN model.

Alternatively, in step S1:

the method specifically comprises the steps of obtaining a connection relation between a switch and a data link;

the method specifically comprises the steps of obtaining load information of the switch, flow table utilization rate information, and one-hot codes of a source switch address and a destination switch address;

the obtaining of the data link characteristics specifically includes obtaining delay, packet loss rate, and link utilization information of the data link in the network.

Optionally, step S2 specifically includes:

and constructing a node-wise graph structure by taking the switches as nodes and taking data links between the switches as edges.

Optionally, step S3 specifically includes:

and constructing an edge-wise graph structure by taking the data links as nodes and the public switches between the data links as edges.

Optionally, the BRGCN network model constructed in step S4 specifically includes:

a node-wise recurrent pattern neural network alternately arranged by a three-layer pattern convolutional neural network and a three-layer recurrent neural network;

an edge-wise recurrent pattern neural network alternately arranged by a three-layer pattern convolution neural network and a three-layer recurrent neural network;

and the node-wise circulation diagram neural network and the edge-wise circulation diagram neural network are spliced at the tail end and connected with a layer of fully connected neural network.

Optionally, the training of the BRGCN network model in step S4 specifically includes:

acquiring a network state after the routing path is executed;

calculating rewards according to the network state after the routing path is executed;

storing the network state and the reward after the routing path is executed to an experience pool;

and selecting a group of experience calculation action values from the experience pool by adopting a random continuous sampling strategy, and updating the model parameters by utilizing a loss function based on the action values.

Optionally, the calculation manner of the reward is:

wherein, lr _i Indicates ith Link utilization, Ave _lr Means, Cor, representing the current link utilization _lr Denotes a link utilization correction coefficient, alpha denotes a link utilization evaluation weight coefficient, Tol _ade Mean value of variation, Met, representing the mean delay of the link _ade Indicating the link average delay at the current time, Mle _ade Representing the average delay, Cor, of the link at the last moment _ade Denotes the link mean delay correction coefficient, beta denotes the link mean delay evaluation weight coefficient, Tol _apl Mean value of variation, Met, representing the average packet loss rate of the link _apl Indicating the average packet loss rate of the link at the current time, Mle _apl Represents the average packet loss rate, Cor, of the link at the last moment _apl Represents the modification coefficient of the average link loss rate, gamma represents the evaluation weight coefficient of the average link loss rate, Tlo _al Mean value of variation, Met, representing the average load of the link _al Indicating the average load of the link at the current time, Mle _al Representing the average load, Cor, of the link at the last moment _al Denotes a link average load correction coefficient, [ theta ] denotes a link average load evaluation weight coefficient, L _num Indicating the number of links.

Optionally, the action value is calculated by:

y _i ＝r _i +δQ′(s′ _i ,a ^max (Q(s′ _i ,a _i ,ω))；ω′)

wherein, y _i Represents the value of the motion, r _i Representing the reward, delta the attenuation factor, Q the Q network, Q' the target network, s _i ' indicates the network status at the next time, a _i Representing the action, ω represents a parameter of the Q network, and ω' represents a parameter of the target network.

Optionally, the action-value-based loss function is specifically:

wherein M represents the number of samples, y _i Representing the value of the action, gamma representing the attenuation factor, Q representing the Q network, s _i Indicating the current time state, a _i Representing the action, ω represents a parameter of the Q network.

Optionally, the obtaining of the routing path according to the node-wise graph feature and the edge-wise graph feature by using the trained BRGCN network model in step S4 specifically includes:

constructing a node-wise feature matrix by utilizing the switch feature, constructing a node-wise adjacency matrix and a node-wise degree matrix according to a node-wise graph structure, constructing an edge-wise feature matrix by utilizing the data link feature, and constructing an edge-wise adjacency matrix and an edge-wise degree matrix according to an edge-wise graph structure;

the method comprises the steps that a trained BRGCN network model is utilized to take an action value table according to a node-wise characteristic matrix, a node-wise adjacency matrix, a node-wise degree matrix, an edge-wise characteristic matrix, an edge-wise adjacency matrix and an edge-wise degree matrix;

and selecting the action as a routing path of the flow by using a greedy strategy according to the action value table.

The invention has the following beneficial effects:

1. according to the method, the deep reinforcement learning, the graph convolution neural network and the cyclic neural network are combined and applied to the load balancing algorithm, so that the model can make a decision according to the state information, the structure and the topological relation of the network are considered as the decision-making factors, the capability of processing the time sequence information is achieved, and the performance of the model is improved.

2. According to the method and the device, aiming at the optimization target of load balancing, the variance of the calculated link utilization rate is provided as a main reward factor, and the stability of the long-term state of the network can be better kept compared with the calculation of the change value of the maximum link utilization rate.

3. The method and the device use more comprehensive network states, including multidimensional characteristics of the switch and the data link and topological structure characteristics of the network, and ensure that the model can make decisions according to the more comprehensive states.

Drawings

Fig. 1 is a schematic flowchart of a deep reinforcement learning-driven load balancing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a node-wise graph structure and an edge-wise graph structure in an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a BRGCN network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a node-wise circulation neural network structure according to an embodiment of the present invention;

fig. 5 is an exemplary structural diagram of an electronic device in the embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, an embodiment of the present invention provides a deep reinforcement learning driven load balancing method, including the following steps S1 to S4:

in an alternative embodiment of the present invention, in step S1:

in an optional embodiment of the present invention, step S2 specifically includes:

in an optional embodiment of the present invention, step S3 specifically includes:

As shown in FIG. 2, taking the left-hand configuration as an example, each node a-F represents a switch and A-F represents an edge between switches. The modeled node-wise structure is shown as the left picture, and the edge-wise structure is shown as the right picture. It can be seen that the node-wise structure diagram (left diagram) has 6 nodes a-F and 6 edges a-F, when the edge-wise structure (right diagram) is constructed, the nodes a-F are regarded as the nodes of the new diagram, the common nodes between the nodes a-F are regarded as the edges of the new diagram, and the shared common nodes can be changed into the edges between a plurality of new diagrams. It can be seen that there are 6 nodes A-F in the new graph, corresponding to 6 edges in the original graph; meanwhile, the graph has 10 edges c1-c6, d, e1-e3, which correspond to nodes c, d, f in the original graph respectively, wherein each new edge represents the mapping of a pair of edges with a common node in the original graph.

In an optional embodiment of the present invention, the BRGCN network model constructed in step S4 specifically includes:

As shown in fig. 3 and 4, the deep neural network in the BRGCN network model is constructed by combining the convolutional neural network and the recurrent neural network, wherein the node-wise convolutional neural network has a structure consistent with that of the edge-wise convolutional neural network, and comprises three layers of convolutional neural networks and three layers of recurrent neural networks, and a layer of recurrent neural network is connected behind each layer of convolutional neural network.

The step S4 of training the BRGCN network model specifically includes:

firstly, acquiring a network state after a routing path is executed; the network state comprises a node-wise characteristic matrix, a node-wise adjacency matrix and a node-wise degree matrix, and the three are used as the input of the node-wise recurrent pattern neural network. The network state further comprises an edge-wise characteristic matrix, an edge-wise adjacency matrix and an edge-wise degree matrix, and the three are used as the input of the edge-wise recurrent pattern neural network.

Then calculating reward according to the network state after executing the routing path; the calculation mode of the reward is as follows:

wherein, lr _i Indicates the ith Link utilization, Ave _lr Means, Cor, representing the current link utilization _lr Denotes a link utilization correction coefficient, alpha denotes a link utilization evaluation weight coefficient, Tol _ade Representing chainsMean of variation of the way average delay, Met _ade Indicating the link average delay at the current time, Mle _ade Representing the average delay, Cor, of the link at the last moment _ade Denotes the link mean delay correction coefficient, beta denotes the link mean delay evaluation weight coefficient, Tol _apl Mean value of variation, Met, representing the average packet loss rate of the link _apl Indicating the average packet loss rate of the link at the current time, Mle _apl Represents the average loss rate, Cor, of the link at the last moment _apl Represents the modification coefficient of the average link loss rate, gamma represents the evaluation weight coefficient of the average link loss rate, Tlo _al Mean value of variation, Met, representing the average load of the link _al Indicating the average load of the link at the current time, Mle _al Representing the average load, Cor, of the link at the last moment _al Represents a link average load correction coefficient, theta represents a link average load evaluation weight coefficient, L _num Indicating the number of links.

Then storing the network state and the reward after the routing path is executed to an experience pool;

and finally, selecting a group of experience calculation action values from the experience pool by adopting a random continuous sampling strategy, and updating the model parameters by utilizing a loss function based on the action values. The calculation mode of the action value is as follows:

y _i ＝r _i +δQ′(s′ _i ,a ^max (Q(s′ _i ,a _i ,ω))；ω′)

The loss function based on the action value is specifically as follows:

wherein M represents the number of samples, y _i Representing the value of the motion, gamma representing the attenuation factor, Q-tableQ-network, s _i Indicating the current time state, a _i Representing the action, ω represents a parameter of the Q network.

In step S4, acquiring a routing path according to the node-wise graph feature and the edge-wise graph feature using the trained BRGCN network model specifically includes:

using the trained BRGCN model to take an action value table according to a node-wise characteristic matrix, a node-wise adjacency matrix, a node-wise degree matrix, an edge-wise characteristic matrix, an edge-wise adjacency matrix and an edge-wise degree matrix; as shown in fig. 3 and 4, specifically, the node-wise feature matrix, the node-wise adjacency matrix and the node-wise degree matrix are used as the input of the node-wise recurrent neural network, the edge-wise feature matrix, the edge-wise adjacency matrix and the edge-wise degree matrix are used as the input of the edge-wise recurrent neural network, after the processing of the node-wise recurrent neural network and the edge-wise recurrent neural network, the network state is processed into two groups of features, which are respectively the outputs of the two recurrent neural networks, the two groups of features are spliced, and finally, the action value table is output through the fully connected neural network.

And selecting the action as a routing path of the flow by a greedy strategy according to the action value table. And the greedy strategy specifically comprises the steps that if the greedy degree is e, the probability that the model has e is executed by the action with the highest value selected according to the action value table, and one action is randomly selected to be executed by the probability of 1-e.

After the SDN controller is started, the network is initialized first, and basic information of all switches and links in the network and a network topology structure formed by the physical devices are obtained.

According to the topological structure of the network, the switches are regarded as nodes, data links among the switches are regarded as edges, and a node-wise graph structure is modeled, the data links are regarded as nodes, and a public switch among the data links is regarded as an edge, and an edge-wise graph structure is modeled.

The invention utilizes the SDN controller to monitor the state of the whole network constantly, sends an inquiry message to the switch through the SDN communication protocol, the switch returns a reply message, and obtains the switch characteristic and the data link characteristic of the network according to the analysis of the reply message.

Each time a new flow arrives at the network, the SDN controller sends the current network state to the BRGCN network model. After the model is analyzed through the network state, a reasonable routing path is decided, and then the routing of the current flow is executed according to the routing path.

The deep reinforcement learning driven load balancing method for the convolutional neural network and the recurrent neural network of the present application is further described below by way of examples, and it should be understood that the examples do not constitute any limitation to the present application.

In this embodiment, the hardware platform is implemented by a Dell Precision T7920 tower workstation, and is programmed using Java and Python languages.

We use Java to implement a controller of the SDN, which performs state detection, routing control, data processing, and reward calculation for the entire network. Whenever a new flow arrives at the network, the controller sends the latest network state to the model, followed by routing the flow according to the routing path decided by the model. And collecting new network states after the routing flow, calculating rewards, and then sending (state, action, reward, new state) tuples as experiences to an experience pool.

We implemented a model of deep reinforcement learning using Python. In the operation process of the model, receiving a request from the controller and state data of the network, wherein the state data comprises a node-wise feature matrix, a node-wise adjacency matrix, a node-wise degree matrix, an edge-wise feature matrix, an edge-wise adjacency matrix and an edge-wise degree matrix. And inputting the state data into two recurrent graph neural networks for processing, finally outputting an action value table through splicing and full connection layer processing, deciding a routing path by the model according to the action value table through a greedy strategy, and finally sending the routing path to the SDN controller for execution. The controller collects new state after performing routing, calculates rewards, and sends experiences to the experience pool of models. After the experience is stored to a certain scale, the model adopts a random continuous sampling strategy to acquire a group of experiences to update the parameters of the neural network.

The load balancing method based on the deep reinforcement learning drive can realize intelligent routing of the flow in the SDN data center network, achieves the load balancing effect, improves the maximum link utilization rate index by more than 20% compared with methods such as ECMP, and meanwhile, is superior to the ECMP methods in indexes such as delay rate and packet loss rate.

The application also provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to implement the load balancing method based on deep reinforcement learning driving.

The present application also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the load balancing method based on deep reinforcement learning driving can be implemented.

Fig. 5 is an exemplary block diagram of an electronic device capable of implementing a deep reinforcement learning-driven based load balancing method according to an embodiment of the present application.

As shown in fig. 5, the electronic device includes an input device 501, an input interface 502, a central processor 503, a memory 504, an output interface 505, and an output device 506. The input interface 502, the central processing unit 503, the memory 504 and the output interface 505 are connected to each other through a bus 507, and the input device 501 and the output device 506 are connected to the bus 507 through the input interface 502 and the output interface 505, respectively, and further connected to other components of the electronic device. Specifically, the input device 504 receives input information from the outside and transmits the input information to the central processor 503 through the input interface 502; the central processor 503 processes input information based on computer-executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; the output device 506 outputs the output information to the outside of the electronic device for use by the user.

That is, the electronic device shown in fig. 5 may also be implemented to include: a memory storing computer executable instructions; and one or more processors that when executing computer executable instructions may implement the deep reinforcement learning driven based load balancing methodology described in conjunction with fig. 1.

In one embodiment, the electronic device shown in fig. 5 may be implemented to include: a memory 504 configured to store executable program code; one or more processors 503 configured to execute executable program code stored in the memory 504 to perform the human-machine multi-turn dialog method in the above-described embodiments.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Furthermore, it will be obvious that the term "comprising" does not exclude other elements or steps. A plurality of units, modules or devices recited in the device claims may also be implemented by one unit or overall device by software or hardware.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks identified in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The Processor in this embodiment may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

In this embodiment, the module/unit integrated with the apparatus/terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in the jurisdiction. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application.

The foregoing is merely a preferred embodiment of this invention, which is intended to be illustrative, not limiting; it will be appreciated by those skilled in the art that many changes, modifications and equivalents can be made thereto within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A load balancing method based on deep reinforcement learning drive is characterized by comprising the following steps:

2. The load balancing method based on deep reinforcement learning driving according to claim 1, wherein in step S1:

3. The load balancing method based on deep reinforcement learning driving according to claim 1, wherein the step S2 specifically includes:

4. The load balancing method based on deep reinforcement learning driving according to claim 1, wherein the step S3 specifically includes:

and taking the data links as nodes and the public switch among the data links as edges to construct an edge-wise graph structure.

5. The load balancing method based on deep reinforcement learning driving as claimed in claim 1, wherein the BRGCN network model constructed in step S4 specifically includes:

6. The load balancing method based on deep reinforcement learning driving as claimed in claim 1, wherein the training of the BRGCN network model in step S4 specifically includes:

acquiring a network state after the routing path is executed;

and selecting a group of experience calculation action values from the experience pool by adopting a random continuous sampling strategy, and updating model parameters by using a loss function based on the action values.

7. The deep reinforcement learning drive-based load balancing method according to claim 6, wherein the reward is calculated by:

wherein, lr _i Indicates ith Link utilization, Ave _lr Means, Cor, representing the current link utilization _lr Denotes a link utilization correction coefficient, alpha denotes a link utilization evaluation weight coefficient, Tol _ade Mean value of variation, Met, representing the mean delay of the link _ade Indicating the link average delay at the current time, Mle _ade Representing the average delay, Cor, of the link at the last moment _ade Denotes the link mean delay correction coefficient, beta denotes the link mean delay evaluation weight coefficient, Tol _apl Mean value of variation, Met, representing the average packet loss rate of the link _apl Indicating the average packet loss rate of the link at the current time, Mle _apl Represents the average loss rate, Cor, of the link at the last moment _apl Represents the modification coefficient of the average link loss rate, gamma represents the evaluation weight coefficient of the average link loss rate, Tlo _al Representing link averagesMean change in load, Met _al Indicating the average load of the link at the current time, Mle _al Representing the average load, Cor, of the link at the last moment _al Represents a link average load correction coefficient, theta represents a link average load evaluation weight coefficient, L _num Indicating the number of links.

8. The load balancing method based on deep reinforcement learning driving according to claim 6, wherein the action value is calculated by:

y _i ＝r _i +δQ′(s′ _i ,a ^max (Q(s′ _i ,a _i ,ω))；ω′)

9. The deep reinforcement learning drive-based load balancing method according to claim 6, wherein the action value-based loss function is specifically:

wherein M represents the number of samples, y _i Representing the value of the action, gamma the attenuation factor, Q the Q network, s _i Indicating the current time state, a _i Representing the action and ω representing the parameters of the Q network.

10. The load balancing method based on deep reinforcement learning driving as claimed in claim 1, wherein the step S4 of obtaining the routing path according to the node-wise graph feature and the edge-wise graph feature by using the trained BRGCN network model specifically includes:

using the trained BRGCN model to take an action value table according to a node-wise characteristic matrix, a node-wise adjacency matrix, a node-wise degree matrix, an edge-wise characteristic matrix, an edge-wise adjacency matrix and an edge-wise degree matrix;

and selecting the action as a routing path of the flow by a greedy strategy according to the action value table.