CN111526055A

CN111526055A - Route planning method and device and electronic equipment

Info

Publication number: CN111526055A
Application number: CN202010330122.2A
Authority: CN
Inventors: 姚海鹏; 袁鑫; 买天乐
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2020-08-11

Abstract

The invention provides a route planning method, a device and electronic equipment, wherein the method comprises the following steps: acquiring a destination address route of a target router and an adjacent router of the target router; inputting a destination address route of a target router and an adjacent router of the target router into a deep learning model obtained through pre-training, and obtaining a weighing value corresponding to each executable action of the target router based on the deep learning model obtained through pre-training; wherein the executable action comprises a next hop router of the target router and/or each path from the target router to the destination address route; determining a target execution action of the target router based on the metric values. The invention improves the reliability of route planning.

Description

Route planning method and device and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a route planning method and apparatus, and an electronic device.

Background

With the rapid development, the number of users is rapidly increased, new network applications are continuously emerged, so that the network flow is rapidly increased, the network congestion caused by the rapid increase becomes a bottleneck problem restricting the network development and the application, the information congestion is a main reason influencing the network service quality, and the effective solution of the congestion problem has important significance for improving the network performance. SDN (software Defined networking) as a novel network architecture, has a characteristic of separating forwarding from control, centralized control also brings great convenience to network management, how to find a suitable forwarding path for a data packet, and how to fully and efficiently utilize each data link in the SDN is a hot topic of current research. However, the existing route planning technology is based on a reinforcement learning algorithm to calculate the metric value, and is difficult to apply to the problem of high-dimensional state space and continuous state space. Therefore, the existing route planning method has the problem of low reliability.

Disclosure of Invention

The embodiment of the invention aims to provide a route planning method, a route planning device and electronic equipment, which can improve the reliability of route planning.

In a first aspect, an embodiment of the present invention provides a routing planning method, including: acquiring a destination address route of a target router and an adjacent router of the target router; inputting a pre-trained deep learning model into a destination address route of the target router and an adjacent router of the target router, and obtaining a weighing value corresponding to each executable action of the target router based on the pre-trained deep learning model; wherein the executable action comprises a next hop router of the target router and/or respective paths of the target router to the destination address route; determining a target execution action for the target router based on the metric value.

In an alternative embodiment, the deep learning model comprises a Seq2Seq model; the executable action comprises a next hop router of the target router; the step of obtaining a metric value corresponding to each executable action of the target router based on the deep learning model obtained by the pre-training comprises the following steps: taking each adjacent router of the target router as a next hop router of the target router; and determining a metric value generated from the target router to each next-hop router based on the Seq2Seq model obtained by pre-training.

In an alternative embodiment, the executable action includes respective paths of the destination router to the destination address route; the step of obtaining a metric value corresponding to each executable action of the target router based on the deep learning model obtained by the pre-training comprises the following steps: and determining the metric values generated by all paths from the target router to the destination address route based on the Seq2Seq model obtained by pre-training.

In an alternative embodiment, the training process of the Seq2Seq model includes: inputting a target training sample into a Seq2Seq model, and carrying out iterative training on the Seq2Seq model based on the target training sample until the training is finished to obtain the trained Seq2Seq model; the target training samples comprise samples marked with metric values generated from the target router to each next-hop router and/or samples marked with metric values generated from each path from the target router to the destination address route, and the metric values marked by the target training samples are environment rewards to actions obtained in advance based on a reinforcement learning algorithm.

In an optional embodiment, the step of determining the target execution action of the target router based on the metric value includes: determining a corresponding target execution action when the weighing value is maximum based on a preset first greedy strategy equation, wherein the preset first greedy strategy equation is as follows:

wherein, Q(s)_t,a_t) Is the said measure, a_tPerforming an action for the target, s_tIs the current network state of the target router.

In an optional embodiment, the step of determining the target execution action of the target router based on the metric value includes: and determining a corresponding target execution action when the weighing value is maximum based on a preset second greedy strategy equation, wherein the preset second greedy strategy equation is as follows:

wherein, tau_nIs a temperature parameter.

In an alternative embodiment, the calculation of the temperature parameter is as follows:

wherein, num_nIs (tau)_n,τ_n-1]Number of dynamic streams in the period, T time to achieve convergence, τ₀And τ_TInitial and final values, respectively.

In a second aspect, an embodiment of the present invention provides a route planning apparatus, including: the state acquisition module is used for acquiring a destination address route of a target router and an adjacent router of the target router; the metric value determining module is used for inputting a destination address route of the target router and an adjacent router of the target router into a pre-trained deep learning model, and obtaining a metric value corresponding to each executable action of the target router based on the pre-trained deep learning model; wherein the executable action comprises a next hop router of the target router and/or respective paths of the target router to the destination address route; and the action determining module is used for determining a target execution action of the target router based on the weighing value.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor executes the computer program to implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer-readable medium, wherein the computer-readable medium stores computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the method of the first aspect.

The embodiment of the invention provides a route planning method, a device and electronic equipment, wherein in the method, a destination address route of a target router and adjacent routers of the target router are obtained firstly; inputting a destination address route of the target router and adjacent routers of the target router into a deep learning model obtained through pre-training, and obtaining weighing values corresponding to each executable action (including a next hop router of the target router and/or each path from the target router to the destination address route) of the target router based on the deep learning model obtained through pre-training; and finally, determining the target execution action of the target router based on the weighing value. According to the method, the target address route of the target router to be planned and the adjacent routers of the target router are input into the deep learning model, so that the weighing values corresponding to all executable actions of the target router can be obtained, the method can be applied to a novel network architecture of the SDN, the problem that a traditional reinforcement learning algorithm is difficult to apply to a high-dimensional state space and a continuous state space is solved, and the reliability of route planning is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a route planning method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of Seq2Seq model identification according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a route planning apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of the problem that the existing route planning method is low in reliability, embodiments of the present invention provide a route planning method, a device, and an electronic device, which can be applied to improve the reliability of route planning, and the embodiments of the present invention are described in detail below.

An embodiment of the present invention provides a route planning method, which may be executed by an electronic device such as a mobile terminal or a computer, and mainly includes the following steps S102 to S106, referring to a flow chart of the route planning method shown in fig. 1:

step S102, the destination address route of the target router and the adjacent router of the target router are obtained.

The target router may be any router in the SDN network architecture, which needs to plan a packet forwarding path, and the destination address route is a packet forwarding destination of the target router, that is, the target router forwards the packet to the destination address route, where the neighboring routers include all neighboring routers of the target router.

And step S104, inputting the destination address route of the target router and the adjacent router of the target router into the pre-trained deep learning model, and obtaining the weighing values corresponding to each executable action of the target router based on the pre-trained deep learning model.

The executable actions include a next hop router of the target router and/or respective paths of the target router to the destination address route. The above-mentioned metric value may also be referred to as a reward (i.e. a reward fed back by the environment after the state transition), i.e. a reward fed back by the environment. In another embodiment, the current state of the target router may also be input into a deep learning model obtained through pre-training, where the current state of the target router includes the source router, the target router, and the destination address route, and the deep learning model obtains a reward corresponding to each packet forwarding path from the target router to the destination address route according to the source router, the target router, and the destination address route. And obtaining the reward corresponding to each data packet forwarding path from the target router to the destination address route by using a reinforcement learning algorithm according to the source router, the target router and the destination address route.

And step S106, determining the target execution action of the target router based on the metric value.

Because each executable action of the target router has a corresponding metric value, the executable action corresponding to the maximum metric value can be used as the target execution action of the target router, so that the target router can obtain the maximum metric value or receive the best reward fed back by the environment after the target router performs the action to the next router according to the target execution action or performs the action to the target address route according to the target execution action.

According to the route planning method provided by the embodiment, the target address route of the target router to be planned and the adjacent routers of the target router are input into the deep learning model, so that the weighing values corresponding to each executable action of the target router can be obtained, the method can be applied to a novel network architecture of an SDN, the problem that a traditional reinforcement learning algorithm is difficult to apply to a high-dimensional state space and a continuous state space is solved, and the reliability of route planning is improved.

In order to accurately obtain the metric values corresponding to the executable actions of the target router, this embodiment provides an implementation manner of obtaining the metric values corresponding to the executable actions of the target router based on a deep learning model obtained by training in advance, and the following steps (1) to (2) may be specifically referred to for execution:

step (1): and taking each adjacent router of the target router as the next hop router of the target router.

The deep learning model includes a Seq2Seq model, the Seq2Seq model is an architectural form represented by an encoding (Encode) and a decoding (Decode), and the Seq2Seq model is a state-action value that generates an output from a current state of an input. When the executable action includes a next-hop router of the target router, respective neighboring routers of the target router may be treated as next-hop routers of the target router in order to determine the next-hop router of the target router, wherein the source router cannot be treated as the next-hop router of the target router in order to prevent the target router from jumping back (i.e., jumping back to the source router).

Step (2): and determining the weighing values generated from the target router to each next-hop router based on a Seq2Seq model obtained by pre-training.

After the current state of the target router (source router, target router and destination address route) is input into the Seq2Seq model, referring to the Seq2Seq model identification diagram shown in fig. 2, the mapping matrix M of the slave router_n×mObtaining a vector of the target router, wherein the mapping matrix M_n×mComprises the following steps:

such as from the mapping matrix M_n×mRespectively obtaining vectors h1, h2 and h3 of a source router, a target router and a destination address route, generating a current state vector c of the target router according to the vectors h1, h2 and h3 of the source router, the target router and the destination address route, and generating a current state vector c of the target router according to the current state vector c of the target router and each adjacent router of the target router

Can obtain y₁Will y is₁Metric values (rewards) Q(s) generated as target routers to respective next hop routers_t,a_t). The Seq2Seq model is obtained by training a sample labeled with a metric value generated from a target router to each next-hop router.

In another embodiment, the above executable actsThe method comprises each path from a target router to a destination address route, and can also determine a metric value generated by each path from the target router to the destination address route based on a pre-trained Seq2Seq model. In this embodiment, the Seq2Seq model is based on the current state vector c of the target router and each neighboring router of the target router

The quantitative value (reward) Q(s) generated by each data packet forwarding path from the target router to the destination address can be obtained_t,a_t). The Seq2Seq model is obtained by training samples labeled with metric values generated by each path from the target router to the destination address route.

In a specific embodiment, the training process of the Seq2Seq model includes:

inputting a target training sample into the Seq2Seq model, and performing iterative training on the Seq2Seq model based on the target training sample until the training is finished to obtain the trained Seq2Seq model; the target training samples comprise samples marked with metric values generated from the target router to each next-hop router and/or samples marked with metric values generated from each path from the target router to the destination address route, and the metric values marked by the target training samples are rewards of the environment to the action (the executable action of the target router) obtained in advance based on the reinforcement learning algorithm. The mark for ending the training of the Seq2Seq model may be that the iterative training frequency of the Seq2Seq model reaches a preset frequency.

In order to obtain the target execution action of the target router, this embodiment provides a specific implementation manner for determining the target execution action of the target router based on the metric value, and the following manner one and manner two may be specifically referred to for execution:

the first method is as follows: and determining a corresponding target execution action when the weighing value is maximum based on a preset first greedy strategy equation, wherein the preset first greedy strategy equation is as follows:

wherein, Q(s)_t，a_t) Is a constant value of a_tTo perform an action for the target, s_tIs the current network state of the target router (i.e., the current state of the target router), including the source router, the target router, and the destination address route. a is_tCan be selected as the next hop that the target router will pass through to the destination address route, Q(s)_t，a_t) A reward for environmental feedback at the current state of the target router. Aiming at the current network state of the target router, acquiring the weighted value Q(s) by using a preset first greedy strategy equation_t，a_t) Corresponding target execution action a when taking maximum value_t. The maximum reward for environmental feedback may be obtained when the target router forwards the packet according to the target execution action.

The second method comprises the following steps: and determining a corresponding target execution action when the weighing value is maximum based on a preset second greedy strategy equation, wherein the preset second greedy strategy equation is as follows:

wherein, tau_nThe calculation formula of the temperature parameter is as follows:

wherein, num_nIs (tau)_n,τ_n-1]Number of dynamic streams in the period, T time to achieve convergence, τ₀And τ_TInitial and final values, respectively. In order to make the route simultaneously learn the dynamic property of the current network and reserve the utilization of the learned knowledge, the greedy strategy is improved, and a temperature parameter tau is added_n. Temperature parameter tau_nIs a time-varying parameter used to balance the exploration of unknown behavior with the selection of an existing policy when τ_nAt larger values, the probability of action being taken is almost the same, and τ_nAt smaller values, with τ_nThe number of the grooves is continuously reduced,the selection of target execution actions gradually approaches a greedy strategy. The temperature parameter can be used for exploring more possible next hop routes when the network dynamics is strong, and when a plurality of network conditions are learned and the network tends to be static, better next hop route selection is given according to the prior experience. Temperature parameter tau_nAs the arrival and departure of flows in the network fluctuate, the more the flows fluctuate, the more new alternative routes need to be explored, and the smaller the fluctuations, the better the benefits will be brought by taking the best known target to perform the action (routing policy).

In the routing planning method provided by this embodiment, on one hand, the Q value in the conventional Q-learning algorithm is estimated by using the deep learning Seq2Seq model, so that the problem that the conventional Q-learning algorithm is difficult to be applied to a high-dimensional state space and a continuous state space is solved, and compared with the Q-learning algorithm, the Q values in different states are calculated by using the deep learning Seq2Seq model, so that inconvenience caused by storing the Q values in the Q-learning algorithm in a Q table form is avoided; on the other hand, when the target execution action is determined, the temperature parameter is introduced, the fluctuation of the network flow is considered, and the obtained routing planning strategy can bring better network benefits.

Corresponding to the above route planning method, this embodiment provides a route planning device, referring to the schematic structural diagram of the route planning device shown in fig. 3, the device includes:

and a state obtaining module 31, configured to obtain a destination address route of the target router and a router adjacent to the target router.

A metric value determining module 32, configured to input the destination address route of the target router and the neighboring routers of the target router into a pre-trained deep learning model, and obtain, based on the pre-trained deep learning model, a metric value corresponding to each executable action of the target router; wherein the executable action includes a next hop router of the target router and/or respective paths of the target router to the destination address route.

And an action determining module 33, configured to determine a target execution action of the target router based on the metric value.

According to the route planning device provided by the embodiment, the destination address route of the target router to be planned and the adjacent routers of the target router are input into the deep learning model, so that the weighing values corresponding to each executable action of the target router can be obtained, the device can be applied to a novel network architecture of an SDN (software defined network), the problem that a traditional reinforcement learning algorithm is difficult to apply to a high-dimensional state space and a continuous state space is solved, and the reliability of route planning is improved.

In one embodiment, the deep learning model includes a Seq2Seq model; the executable action includes a next hop router of the target router; the metric value determining module 32 is further configured to use each neighboring router of the target router as a next-hop router of the target router; and determining the weighing values generated from the target router to each next-hop router based on a Seq2Seq model obtained by pre-training.

In one embodiment, the executable actions include respective paths from the target router to the destination address route; the metric value determining module 32 is further configured to determine, based on a pre-trained Seq2Seq model, metric values generated by paths from the target router to the destination address route.

In one embodiment, the training process of the Seq2Seq model includes: inputting a target training sample into the Seq2Seq model, and performing iterative training on the Seq2Seq model based on the target training sample until the training is finished to obtain the trained Seq2Seq model; the target training samples comprise samples marked with metric values generated from the target router to each next-hop router and/or samples marked with metric values generated from each path from the target router to the destination address route, and the metric values marked by the target training samples are environment rewards to actions obtained in advance based on a reinforcement learning algorithm.

In an embodiment, the action determining module 33 is further configured to determine, based on a preset first greedy policy equation, a corresponding target execution action when the metric value is maximum, where the preset first greedy policy equation is:

wherein, Q(s)_t,a_t) Is a constant value of a_tTo perform an action for the target, s_tIs the current network state of the target router.

In an embodiment, the action determining module 33 is further configured to determine, based on a preset second greedy policy equation, a corresponding target execution action when the metric value is maximum, where the preset second greedy policy equation is:

wherein, tau_nIs a temperature parameter.

In one embodiment, the calculation formula of the temperature parameter is:

On one hand, the route planning apparatus provided in this embodiment estimates the Q value in the conventional Q-learning algorithm by using the deep learning Seq2Seq model, so as to solve the problem that the conventional Q-learning algorithm is difficult to be applied to a high-dimensional state space and a continuous state space, and compared with the Q-learning algorithm, calculates the Q values in different states by using the deep learning Seq2Seq model, thereby avoiding inconvenience caused by storing the Q values in the Q-learning form; on the other hand, when the target execution action is determined, the temperature parameter is introduced, the fluctuation of the network flow is considered, and the obtained routing planning strategy can bring better network benefits.

The device provided by the embodiment has the same implementation principle and technical effect as the foregoing embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiment for the portion of the embodiment of the device that is not mentioned.

An embodiment of the present invention provides an electronic device, as shown in a schematic structural diagram of the electronic device shown in fig. 4, where the electronic device includes a processor 41 and a memory 42, where a computer program operable on the processor is stored in the memory, and when the processor executes the computer program, the steps of the method provided in the foregoing embodiment are implemented.

Referring to fig. 4, the electronic device further includes: a bus 44 and a communication interface 43, and the processor 41, the communication interface 43 and the memory 42 are connected by the bus 44. The processor 41 is arranged to execute executable modules, such as computer programs, stored in the memory 42.

The Memory 42 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 43 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

The bus 44 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.

The memory 42 is configured to store a program, and the processor 41 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 41, or implemented by the processor 41.

The processor 41 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 41. The Processor 41 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like. The device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 42, and the processor 41 reads the information in the memory 42 and performs the steps of the above method in combination with the hardware thereof.

Embodiments of the present invention provide a computer-readable medium, wherein the computer-readable medium stores computer-executable instructions, which, when invoked and executed by a processor, cause the processor to implement the method of the above-mentioned embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for routing, comprising:

acquiring a destination address route of a target router and an adjacent router of the target router;

inputting a pre-trained deep learning model into a destination address route of the target router and an adjacent router of the target router, and obtaining a weighing value corresponding to each executable action of the target router based on the pre-trained deep learning model; wherein the executable action comprises a next hop router of the target router and/or respective paths of the target router to the destination address route;

determining a target execution action for the target router based on the metric value.

2. The method of claim 1, wherein the deep learning model comprises a Seq2Seq model; the executable action comprises a next hop router of the target router;

the step of obtaining a metric value corresponding to each executable action of the target router based on the deep learning model obtained by the pre-training comprises the following steps:

taking each adjacent router of the target router as a next hop router of the target router;

and determining a metric value generated from the target router to each next-hop router based on the Seq2Seq model obtained by pre-training.

3. The method of claim 2, wherein the executable actions include respective paths of the target router to the destination address route;

the step of obtaining the metric values corresponding to each executable action of the target router based on the deep learning model obtained by pre-training comprises the following steps:

and determining the metric values generated by all paths from the target router to the destination address route based on the Seq2Seq model obtained by pre-training.

4. The method of claim 3, wherein the training process of the Seq2Seq model comprises:

inputting a target training sample into a Seq2Seq model, and carrying out iterative training on the Seq2Seq model based on the target training sample until the training is finished to obtain the trained Seq2Seq model; the target training samples comprise samples marked with metric values generated from the target router to each next-hop router and/or samples marked with metric values generated from each path from the target router to the destination address route, and the metric values marked by the target training samples are environment rewards to actions obtained in advance based on a reinforcement learning algorithm.

5. The method of claim 1, wherein the step of determining the target of the target router to perform the action based on the metric value comprises:

determining a corresponding target execution action when the weighing value is maximum based on a preset first greedy strategy equation, wherein the preset first greedy strategy equation is as follows:

wherein, Q(s)_t，a_t) Is the said measure, a_tPerforming an action for the target, s_tIs the current network state of the target router.

6. The method of claim 5, wherein the step of determining the target of the target router to perform the action based on the metric value comprises:

and determining a corresponding target execution action when the weighing value is maximum based on a preset second greedy strategy equation, wherein the preset second greedy strategy equation is as follows:

wherein, tau_nIs a temperature parameter.

7. The method of claim 6, wherein the temperature parameter is calculated by:

8. A route planning apparatus, comprising:

the state acquisition module is used for acquiring a destination address route of a target router and an adjacent router of the target router;

the metric value determining module is used for inputting a destination address route of the target router and an adjacent router of the target router into a pre-trained deep learning model, and obtaining a metric value corresponding to each executable action of the target router based on the pre-trained deep learning model; wherein the executable action comprises a next hop router of the target router and/or respective paths of the target router to the destination address route;

and the action determining module is used for determining a target execution action of the target router based on the weighing value.

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor implements the method of any of claims 1-7 when executing the computer program.

10. A computer-readable medium having stored thereon computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1-7.