CN113055203A

CN113055203A - Method and device for recovering abnormity of SDN control plane

Info

Publication number: CN113055203A
Application number: CN201911370291.2A
Authority: CN
Inventors: 秦可; 刁拥浩; 高莉
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Chongqing Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Chongqing Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2021-06-29
Anticipated expiration: 2039-12-26
Also published as: CN113055203B

Abstract

The invention discloses an abnormal recovery method and device for an SDN control plane, wherein the method comprises the following steps: acquiring a first address of each SDN switch in a data plane of the SDN network and a second address of a server directly connected with each SDN switch; sending heartbeat detection requests to the servers according to the second address and judging whether at least one server which does not return heartbeat information and is directly connected with the SDN switch exists or not, if so, updating historical abnormal node information tables according to request results of the heartbeat detection requests returned by the servers when the abnormality is not a physical network element fault; matching an abnormal recovery strategy and abnormal reporting time of the abnormal server according to the abnormal characteristics of all the abnormal servers in the updated abnormal node information table; and executing the exception recovery processing on the SDN control plane by using the exception recovery strategy. Therefore, the scheme of the invention can effectively recover the abnormal condition when the strategy arrangement or scheduling of the SDN control plane is abnormal.

Description

Method and device for recovering abnormity of SDN control plane

Technical Field

The invention relates to the technical field of cloud computing virtual networks, in particular to an abnormal recovery method and device for an SDN control plane.

Background

A Software Defined Network (SDN Network) is a Network virtualization architecture that has emerged in recent years, and is mainly used for cloud resource pool networking, and the core of the Software Defined Network is to separate a control plane and a data plane of a Network device, thereby implementing flexible control of Network traffic.

The core of the SDN control plane is composed of one or more SDN controllers, which are the brains of the SDN network. On one hand, the SDN controller performs centralized management, state monitoring and forwarding decision on the underlying network switching equipment through a southbound interface protocol to process and schedule the traffic of a data plane. Among them, policy making is one of the core technologies in the southbound network. The switch flow table generation algorithm is a key factor influencing the intelligent level of the SDN controller, and the controller needs to make a corresponding forwarding strategy and generate corresponding flow table items according to transmission requirements of different layers. On the other hand, the SDN controller opens multiple levels of programmability to upper layer applications through a northbound interface, allowing network users to flexibly formulate various network policies according to specific application scenarios.

And the core of the SDN data plane consists of a plurality of SDN switches, which can be physical switches or virtual switches, are executors of the SDN policy and are mainly responsible for data processing, forwarding and state collection.

The reliability of the SDN network, which is used as the most core part of a cloud resource pool, is directly related to the stability of the resource pool, and the high-availability solution of the SDN network is mainly formulated based on the following ideas:

in the first thought, a control plane is formed by a plurality of SDN controllers, different SDN controllers are networked in a dual-active or master-standby mode, and the situation that the SDN network is in global failure due to the failure (or the failure of a related physical link) of a single SDN controller is avoided; and in the second idea, multiple SDN switches are deployed, the same or different hosts are hung under different SDN switches, and a migration technology of host virtualization is added, so that the situation that a single SDN switch fails (or a related physical link fails) to cause partial paralysis of an SDN network is avoided.

Above-mentioned prior art all is based on traditional network's high available design thinking, strengthens SDN network overall reliability through the redundant mode of network element, can promote the unavailable of SDN single component to and cloud resource pool stability under the scenes such as relevant physical link interruption, this scheme does can promote cloud resource pool stability to a certain extent, but this kind of high available design thinking based on traditional network still has certain limitation, specifically as follows:

the core of the SDN network is the arrangement and scheduling of control policies, which are also key factors affecting the level of intelligence of a controller, when the policy arrangement and scheduling of a control plane are abnormal, the SDN network is about to be paralyzed globally, and there are many similar major faults in the industry, at this time, SDN components often operate normally and physical links are in an available state, and the SDN high availability in this scenario is difficult to achieve by the existing technical scheme.

Disclosure of Invention

In view of the above, the present invention is proposed to provide an anomaly recovery method and apparatus for SDN control plane that overcomes or at least partially solves the above problems.

According to an aspect of the present invention, there is provided an anomaly recovery method for an SDN control plane, including:

acquiring a first address of each SDN switch in a data plane of the SDN network and a second address of a server directly connected with each SDN switch;

sending heartbeat detection requests to the servers according to the second address and judging whether at least one server which does not return heartbeat information and is directly connected to the SDN switch exists or not, if so, further judging whether a physical network element fault exists in the SDN network or not;

if no physical network element fault exists, updating a historical abnormal node information table according to a request result of the heartbeat detection request returned by each server;

matching an abnormal recovery strategy according to the abnormal features of all abnormal servers in the updated abnormal node information table, wherein the abnormal features comprise a first address of an SDN switch directly connected with the abnormal server and the abnormal reporting time of the abnormal server; and executing exception recovery processing on the SDN control plane by using the exception recovery strategy.

According to another aspect of the present invention, an apparatus for recovering an exception of an SDN control plane is provided, including:

the data node detection engine is suitable for sending heartbeat detection requests to all the servers according to the second address and judging whether at least one server which does not return heartbeat information and is directly connected to the SDN switch exists;

the data plane management module is suitable for acquiring a first address of each SDN switch in a data plane of the SDN network and a second address of a server directly connected with each SDN switch; if at least one server which is directly connected with the SDN switch and does not return heartbeat information exists, further judging whether a physical network element fault exists in the SDN network; if no physical network element fault exists, updating a historical abnormal node information table according to a request result of the heartbeat detection request returned by each server;

the control state analysis module is suitable for matching an abnormal recovery strategy according to the abnormal characteristics of all the abnormal servers in the updated abnormal node information table, wherein the abnormal characteristics comprise a first address of an SDN (software defined network) switch directly connected with the abnormal server and the abnormal reporting time of the abnormal server;

and the self-healing execution module is suitable for executing the exception recovery processing on the SDN control plane by utilizing the exception recovery strategy.

According to yet another aspect of the present invention, there is provided a computing device comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the exception recovery method of the SDN control plane.

According to a further aspect of the present invention, a computer storage medium is provided, where at least one executable instruction is stored, and the executable instruction causes a processor to perform an operation corresponding to the above-mentioned method for recovering an exception of an SDN control plane.

According to the method and the device for recovering the abnormality of the SDN control plane, whether the SDN network is abnormal or not can be determined by acquiring the first address of each SDN switch in the SDN network and the second address of a server directly connected with each SDN switch and sending a heartbeat detection request to the second address so as to determine whether the forwarding function of each SDN switch is abnormal or not; and when the abnormity is judged to exist and is not caused by the physical network element fault, the strategy arrangement of the control plane and the effective recovery in the abnormal scheduling can be completed by updating the historical abnormal node information table and matching the recovery strategy to the corresponding degree according to the abnormal characteristics of all the abnormal servers in the abnormal node information table. Therefore, according to the scheme of the embodiment, when the strategy arrangement and scheduling of the SDN control plane are abnormal, a self-healing capability is provided for the SDN, and the high availability of the SDN is realized.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flow chart of an embodiment of an exception recovery method for an SDN control plane according to the invention;

figure 2 shows an architectural schematic of an anomaly recovery apparatus of the SDN control plane;

fig. 3 is a flowchart illustrating an exception recovery method for an SDN control plane according to another embodiment of the present invention;

figure 4 shows a schematic diagram of data transfer between a data synchronization driver and an SDN service management module in a specific example;

FIG. 5 illustrates a data synchronization diagram between a data node probe engine and a data plane control module in a particular embodiment;

fig. 6 is a schematic diagram illustrating the implementation process of service plane level self-healing measures;

fig. 7 is a diagram illustrating an exception node information table and a resource policy issuing record of exception recovery in a specific example;

figure 8 is a general flow diagram illustrating an exception recovery scheme implemented by an exception recovery apparatus utilizing an SDN control plane;

fig. 9 is a schematic structural diagram illustrating an embodiment of an anomaly recovery apparatus for an SDN control plane according to the present invention;

FIG. 10 shows a schematic block diagram of an embodiment of a computing device of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flowchart of an embodiment of the method for recovering an exception of an SDN control plane, which is applied to an exception recovery device of an SDN control plane, the device being dedicated to detecting and recovering a policy orchestration and scheduling exception of the SDN control plane, and a specific structure of the device may be referred to in the following description of an embodiment of the device. As shown in fig. 1, the method comprises the steps of:

step S110: the method includes the steps of obtaining a first address of each SDN switch in a data plane of the SDN network and a second address of a server directly connected with each SDN switch.

In the SDN network, each SDN switch of the data plane has at least one service directly connected with the SDN switch.

Step S120: and sending heartbeat detection requests to the servers according to the second address, judging whether the SDN has abnormity according to request results of the heartbeat detection requests returned by the servers, and if so, further judging whether physical network element faults exist in the SDN.

After the second address of the server is obtained, a heartbeat detection request can be sent to the second address to detect the heartbeat condition of the server, if the group plan sending function of the SDN switch is normal, the server directly connected with the SDN switch returns heartbeat information, and otherwise, the heartbeat information cannot be returned.

The scheme of the embodiment is mainly directed to an exception when data forwarding of a service network fails due to non-physical reasons, that is, an exception occurs in policy arrangement and scheduling of an SDN control plane, where the exception is usually caused by an application plane data error, an application scheduling template mismatch, a controller policy making error, a function failure of a north-south interface, a flow table operation exception, and the like, and at this time, a physical network element fault often does not occur. Based on this, when at least one server does not return heartbeat information, it is determined that an abnormality exists in the SDN network, and it is further determined whether the abnormality is caused by a physical network element fault.

Step S130: and if no physical network element fault exists, updating the historical abnormal node information table according to the request result of the heartbeat detection request returned by each server.

If no physical network element fault exists, it is indicated that the strategy arrangement and scheduling of the SDN control plane are abnormal, at this time, a historical abnormal node information table is updated, wherein the historical abnormal node information table records the relevant information of a historically determined abnormal server (namely, a server which does not return heartbeat information), and the updating is to add the relevant information of the newly added abnormal server into the historical abnormal node information table and delete the abnormal server which is recovered to be normal from the historical abnormal node information table.

Step S140: matching an abnormal recovery strategy according to the abnormal features of all abnormal servers in the updated abnormal node information table, wherein the abnormal features comprise a first address of an SDN switch directly connected with the abnormal server and the abnormal reporting time of the abnormal server; and executing the exception recovery processing on the SDN control plane by using the exception recovery strategy.

The second address in the exception feature may point to a unique exception server, the first address may reflect whether all exception servers are directly connected to the same SDN switch, and the exception report time may reflect the earliest time for finding an exception of the corresponding server. Based on the abnormal features, the range of the SDN switch related to the abnormality and the time for starting the abnormality can be determined, that is, the severity of the abnormality can be determined, and then an appropriate abnormality recovery strategy can be matched, so that the SDN network is subjected to abnormality recovery processing, and the normal operation of the SDN network is recovered.

According to the method for recovering the abnormality of the SDN control plane provided by the embodiment, whether the SDN network is abnormal or not can be determined by obtaining the first address of each SDN switch in the SDN network and the second address of the server directly connected to each SDN switch, and sending a heartbeat detection request to the second address to determine whether the forwarding function of each SDN switch is abnormal or not; and when the abnormity is judged to exist and is not caused by the physical network element fault, the strategy arrangement of the control plane and the effective recovery in the abnormal scheduling can be completed by updating the historical abnormal node information table and matching the recovery strategy to the corresponding degree according to the abnormal characteristics of all the abnormal servers in the abnormal node information table. Therefore, according to the scheme of the embodiment, when the strategy arrangement and scheduling of the SDN control plane are abnormal, a self-healing capability is provided for the SDN, and the high availability of the SDN is realized.

Before describing the embodiments of the SDN control plane anomaly recovery method in the following, the configurations of the SDN network and the SDN control plane anomaly recovery device are introduced, so that the corresponding embodiment scheme is described in detail with reference to the SDN control plane anomaly recovery device in the following. Fig. 2 shows an architecture diagram of an anomaly recovery apparatus of an SDN control plane. As shown in fig. 2, the dark part represents the self-function of the SDN network (different SDN solutions are slightly different in module description and module function segmentation), and the light part represents the constituent modules of the SDN control plane self-healing device (hereinafter referred to as a self-healing device, which is an abnormal recovery device for the SDN control plane, and is the same in the following figures). The internal communication of an application plane, a control plane and a data plane in the SDN network and the communication among the planes are realized through a service network; communication between each plane and the self-healing device in the SDN passes through the management network. The following briefly describes the functions of the modules of the self-healing device:

data synchronous drive (application plane): monitoring application plane SDN service model data (host information, QoS strategy, ACL strategy, state strategy, two-layer network data, three-layer network data and the like), forming a service log, and transmitting the service log to an SDN service management module in a quasi-real-time manner;

application self-healing engine (control plane): implementing a self-healing scheme for the SDN application;

controller self-healing engine (control plane): implementing different self-healing schemes for the SDN controller;

switch self-healing engine (data plane): implementing different self-healing schemes for the SDN switch;

data node probe engine (data plane): reporting basic information of the node to a data plane management module, acquiring information of other nodes, detecting a forwarding function of the SDN switch with other nodes through heartbeat messages, and reporting the detection condition to the data plane management module;

an SDN service management module: receiving and storing data from the data synchronization driver, communicating with the SDN controller based on a northbound interface, and taking over the SDN controller in abnormal conditions;

the data plane management module: receiving information from a data node detection engine, and feeding the abnormal condition of the control plane back to a control state analysis module;

a control state analysis module: analyzing and formulating a self-healing strategy according to the information fed back by the data plane management module, and issuing a self-healing execution module;

a self-healing execution module: informing the relevant modules to execute the corresponding schemes according to the self-healing strategy;

an alarm module: and sending alarm information to the administrator in the modes of mail, short message, voice and the like.

The following describes a preferred embodiment of the present invention in detail with reference to the above-mentioned exception recovery apparatus for SDN control plane:

fig. 3 is a flowchart illustrating an exception recovery method for an SDN control plane according to another embodiment of the present invention. The method is applied to an exception recovery device of an SDN control plane, and the device is specially used for detecting and recovering the strategy scheduling and scheduling exception of the SDN control plane, as shown in FIG. 3, the method comprises the following steps:

step S310: and synchronously processing SDN service model data of an application plane in the SDN network.

According to the implementation principle of the SDN network, when an SDN service request occurs (such as adding a virtual machine, changing network card configuration, setting an ACL policy, setting a QoS policy, and the like), an application plane submits a network behavior that needs to be requested to an SDN controller through a predetermined service template, the SDN controller abstracts the network behavior into a forwarding model, a flow table containing elements such as MAC information, MPLS labels, routing information, ACL access control information, and the like is formed, and the flow table is issued to each SDN switch, and the SDN switch forwards the flow table according to the flow table.

In this embodiment, calls of SDN service templates in the application plane are synchronized to keep the SDN network running normally through the synchronized data in case of an abnormal condition.

Specifically, monitoring a call request of an SDN service module in an application plane of an SDN network; the calling request comprises host information, a QoS strategy, an ACL strategy, a state strategy, two-layer network data and/or three-layer network data; adding a sequence number and a time stamp to the calling request and then synchronizing the sequence number and the time stamp to a temporary table; periodically analyzing the synchronous data before N + M time in the temporary table one by one according to the sequence number and the timestamp of the synchronous data in the temporary table, wherein N represents the aging time of the SDN network flow table, and M represents the estimated time until the flow table is aged to find the SDN abnormity; and updating the analysis result into the data table and deleting the corresponding data in the temporary table. The data in the data table can be used for enabling the processes and the data in the SDN to be recovered to the state of the SDN in normal operation before the time of N + M, and the normal operation of the SDN can be ensured at the time.

When the exception recovery device of the SDN control plane is used to execute the scheme of the embodiment, the data synchronization driver arranged in the application plane monitors the call of the SDN service template, adds a sequence number and a timestamp to a network behavior request, and asynchronously transmits the network behavior request to the SDN service management module. Figure 4 shows a schematic diagram of data transfer between a data synchronization driver and an SDN service management module in a specific example. As shown in fig. 4, when the first piece of synchronization data indicates that 22 minutes and 34 seconds are reached at 12 o ' clock 1 o ' clock 10 o ' clock 2019, the application APP1 initiates an SDN service request to create a subnet with a number of 20010 for the host VM1, and the network segment is 192.168.1.0/24. Then, the SDN service management module receives the synchronous data and stores the synchronous data in a temporary table; and the SDN service management module periodically analyzes the synchronous data before the N + M time in the temporary table one by one according to the sequence of the serial numbers, updates the synchronous data to the corresponding data table and deletes the corresponding data in the temporary table. The N and the M can be set according to needs, the SDN controller cannot issue a flow table to the SDN switch under certain control plane failure scenes, and the correct flow table of the original SDN switch can still provide forwarding service before aging; in the most extreme case, the SDN anomaly cannot be found until the flow table is aged, and the N + M setting in the place is to ensure the effectiveness of the self-healing strategy.

Step S320: the method includes the steps of obtaining a first address of each SDN switch in a data plane of the SDN network and a second address of a server directly connected with each SDN switch.

When the abnormal recovery device of the SDN control plane is used for executing the scheme of the embodiment, a data node detection engine is deployed on a server directly connected with each SDN switch, and when the data node detection engine is started, a second address of the server where the data node detection engine is located and a first address of the SDN switch directly connected with the server where the data node detection engine is located are reported to a data plane management module through a management network. The data node detection engine can periodically perform routing detection, and actively reports the routing detection to the data plane management module when the SDN switch address directly connected with the data node detection engine is found to be switched.

Step S330: and sending heartbeat detection requests to the servers according to the second address, and judging whether at least one server which does not return heartbeat information and is directly connected to the SDN switch exists, if so, further judging whether a physical network element fault exists in the SDN network.

Specifically, if it is determined that there is no physical network element fault, it indicates that there is an exception in policy arrangement and scheduling of the SDN control plane, and then step S340 to step S260 are executed to perform exception recovery; otherwise, if the physical network element fault exists in the SDN network, sending abnormal alarm information, marking the server which does not return the heartbeat information to quit the service, and pushing the full-amount node IP address information to all the data nodes.

When the scheme of this embodiment is executed by using an anomaly recovery device of an SDN control plane, a data plane management module determines whether a new addition or a withdrawal of a data node (which may be understood as a server, the same below) occurs according to node information (i.e., a first address and a second address) reported by a plurality of data node probe engines, and if the new addition or withdrawal occurs, actively pushes a node list of second addresses of a full number of nodes to all data nodes; the data node detection engine sends heartbeat detection requests to the detection engines of all other data nodes according to the node list provided by the data plane management module and receives feedback information of the heartbeat detection requests; if the feedback heartbeat of a certain data node cannot be received within the set time, the forwarding function of the SDN switch directly connected with the data node is abnormal, and if the feedback heartbeat of the certain data node is received within the set time, the forwarding function of the SDN switch directly connected with the data node is normal, and based on the forwarding function, the information about whether each data node is normal or not is reported to a data plane management module, wherein the information mainly comprises a second address of a server, a mark state of the server, and/or abnormal time when the abnormality is found. Correspondingly, the data plane management module can obtain abnormal information needing to be updated according to the information reported by the data node detection engine.

FIG. 5 illustrates a data synchronization diagram between a data node probe engine and a data plane control module in a particular embodiment. As shown in fig. 5, the data node probe engine reports an IP (second address of the server) and an IP of the primary SDN (first address of the SDN switch to which the server is directly connected) to the data plane management module, and the data plane management module returns a node list consisting of second addresses of the server reported by the multiple data node probe engines to each data node probe engine, so that the data node probe engine sends a heartbeat probe request to other data node probe engines according to the node list to determine whether a forwarding function of the SDN switch directly connected to other servers is abnormal.

Step S340: and if no physical network element fault exists, updating the historical abnormal node information table according to the request result of the heartbeat detection request returned by each server.

Specifically, for any server which does not return heartbeat information, whether the server is included in a historical abnormal node information table or not is judged; if the abnormal node information table does not contain the heartbeat information, the abnormal characteristic of the server is added to the abnormal node information table, if the heartbeat information is not returned by the server and is not recorded in the historical abnormal node information table, the server is indicated to be a newly added abnormality, and the newly added abnormality is added to the abnormal node information table so as to be matched with a corresponding recovery strategy in the subsequent process. On the contrary, if the server does not return the heartbeat information, but the abnormal characteristic of the server is recorded in the abnormal node information table, the abnormal condition is the condition that the history is generated but is not recovered, and the repeated recording is not needed. And/or, judging whether the server is contained in a historical abnormal node information table or not aiming at any server returning heartbeat information; if the abnormal feature of the server is found, the abnormal feature of the server is deleted from the abnormal node information table, if the server returns heartbeat information, the forwarding function of the SDN switch directly connected with the server is normal, meanwhile, if the abnormal feature of the server is recorded in the abnormal node information table, the server is marked as abnormal in the historical detection process, and therefore the condition that the forwarding function of the switch directly connected with the server is recovered from the abnormal condition to be normal can be determined, and the abnormal feature of the server is deleted from the abnormal node information table. Through the specific judgment and updating mode, the accuracy of the abnormal characteristic record in the abnormal node information table can be ensured.

When the scheme of the embodiment is executed by using the abnormality recovery device of the SDN control plane, after receiving information about whether a server returned by the data node detection engine is abnormal, the data plane management module deletes information about a server which is recovered to be normal in a historical abnormal node information table, and adds information about a newly added abnormal server (in the embodiment, a server which is abnormal in a forwarding function of a directly connected SDN switch is referred to as an abnormal server) to the abnormal node information table, where the information about the abnormal server includes a second address of the abnormal server, a first address of the SDN switch to which the abnormal server is directly connected, and an abnormality report time of the abnormal server. Optionally, the data node detection engine performs a certain judgment on the detection result, and if the node which does not receive the heartbeat information is found to be marked as abnormal, the node does not report to the data plane management module any more, so as to avoid repeated recording in the abnormal node information table; and if the heartbeat information of the data node marked with the abnormity is received, reporting the heartbeat information to the data plane management module so that the data plane management module cancels the record of the data node in the abnormal node information table.

In addition, in some optional embodiments of the present invention, if it is determined that there is no physical network element fault, it is detected whether the abnormality is an abnormality in which the SDN network enters from a normal state for the first time, and if so, the analysis processing is stopped. If it is determined that no physical network element fault exists, the scheduling and scheduling of the control strategy of the SDN control plane are considered to be abnormal, and in order to ensure that the normal running state before the abnormality can be recovered, the analysis of the synchronous data in the temporary table needs to be stopped, so that the analyzed data is data monitored during the normal running of the SDN. Generally, when the normal operation state of the SDN network is changed into the scheduling and scheduling abnormal state of the control strategy of the SDN control plane (i.e., an abnormal state is found for the first time), the analysis is stopped, and in the subsequent process, the analysis is resumed until all data nodes are recovered to be normal, that is, the SDN network is recovered to be normal. For example, if the SDN network is normal at 0 th second, the server 1 does not return heartbeat information when the server 1 is detected at 1 st second, the detected abnormality may be considered as an abnormality of the first entry, if the server 2 does not return heartbeat information when the server 2 is detected at 2 nd second, the detected abnormality may not be considered as an abnormality of the first entry, and if the SDN network is not restored to normal subsequently, the detected abnormality may not be considered as an abnormality of the first entry, and the process is repeated until the SDN network is restored to normal.

When the above optional embodiment is executed by using the anomaly recovery device of the SDN control plane, the control state analysis module immediately notifies the SDN service management module to stop analyzing the synchronous data in the temporary table when receiving the abnormal node information table, which is sent by the data plane management module and enters for the first time, of an anomaly.

Step S350: and matching an exception recovery strategy according to the exception characteristics of all exception servers in the updated exception node information table, wherein the exception characteristics comprise the first address of the SDN switch directly connected with the exception server and the exception reporting time of the exception server.

Specifically, whether first addresses of SDN switches directly connected to all the abnormal servers in the updated abnormal node information table are the same is judged, wherein the first addresses of the SDN switches directly connected to the abnormal servers are the same, the SDN switch with the abnormal forwarding function is the same SDN switch, the degree of scheduling and scheduling abnormality of a control strategy of an SDN control plane is lighter, and otherwise, the degree of abnormality is more serious; and matching the abnormal reporting time farthest from the current time among the abnormal reporting times of all the abnormal servers with a preset time interval, and determining the time interval to which the earliest abnormal reporting time belongs, wherein the later the time interval to which the earliest abnormal reporting time belongs, the shorter the abnormal time is, and the lighter the abnormal degree is, for example, the more serious the abnormal condition before 5 minutes is compared with the abnormal condition before 1 minute. Matching an abnormal recovery strategy according to a judgment result of whether the first addresses are the same and a time interval to which the earliest abnormal delivery time belongs, wherein the corresponding abnormal recovery strategy can be matched according to the judgment result of whether the first addresses are the same and the abnormal severity corresponding to the earliest abnormal delivery time, and the abnormal recovery strategy generally comprises controlling a northbound application interface, a northbound control interface, a soutbound data plane interface, an SDN controller and/or an SDN switch to restart, and/or emptying a forwarding table to realize targeted accurate recovery.

When the scheme of the embodiment is executed by using an abnormal recovery device of an SDN control plane, the data plane management module feeds back an updated abnormal node information table to the control state analysis module; after receiving the abnormal node information, the control state analysis module records and analyzes all current abnormal characteristics, including the judgment of whether the first address is the same or not, the matching of a time interval and the matching of a corresponding abnormal recovery strategy, then informs the self-healing execution module to implement self-healing (in this document, self-healing is recovery), and simultaneously sends out a related abnormal alarm to an administrator through the alarm module.

In addition, except for the newly added abnormal node and/or the abnormal node recovery, the data plane management module actively feeds back the updated abnormal node information table to the control state analysis module, the control state analysis module periodically scans the abnormal node information table to update the self-healing strategy according to the current time change and the abnormal reporting time in the abnormal node information table, for example, as the time moves backwards, the time interval to which the earliest abnormal reporting time belongs changes, and the self-healing strategy matched with the earliest abnormal reporting time can also be updated, so that the accuracy of the control state analysis module in matching the self-healing strategy can be improved. And after receiving feedback (which can be feedback from the data node detection engine to the data plane management module and then further to the control state analysis module) sent by the data plane management module that the abnormal node recovers to normal, the control state analysis module updates the execution result of the corresponding self-healing strategy in the self-healing strategy issuing record to be recovered.

Step S360: and if the first addresses are the same or different, and the time interval to which the earliest abnormal reporting time belongs is a first set time interval, controlling a northbound control interface of an SDN controller in the SDN network to point to the data table, and providing an analysis result in the data table to the SDN controller through the northbound control interface.

The first set time interval is a time far from the current time, and if the earliest abnormal reporting time is in the first set time interval, the service plane of the SDN network needs to be recovered, and the service plane needs to be recovered by using the synchronous data continuously synchronized in step S310 to replace the function of the service plane, so as to ensure the normal operation after recovery.

Specifically, if the first addresses are the same or different, and the time interval to which the earliest abnormal report time belongs is before the first set time interval, the northbound control interface of the SDN controller is disconnected from each SDN application of the SDN application plane, so that the northbound control interface of the SDN controller points to a data table storing analyzed data, and normal data before N + M time can be provided to the SDN controller from the data table, thereby ensuring normal operation of the SDN network at this time.

When the abnormal recovery device of the SDN control plane is used to execute the scheme of the embodiment, the control state analysis module notifies the self-healing execution module to execute the abnormal recovery policy obtained by matching after the abnormal recovery policy is obtained by matching, and the self-healing execution module notifies the SDN service management module to start the SDN northbound application interface and directs the northbound control interface of the SDN controller to the SDN service management module through the controller self-healing engine. The exception recovery strategy is the service plane first-level self-healing measure in the following. Fig. 6 shows a schematic diagram of the implementation process of the service plane level self-healing measure. As shown in fig. 6, the SDN northbound application interface of the SDN service management module is started, and the northbound control interface of the SDN controller is directed to the SDN service management module.

It should be noted that although only the execution of one exception recovery policy is described in step S360, there are many cases of the exception recovery policies obtained by matching the determination result of whether the first address is the same and the time interval to which the earliest exception delivery time belongs during actual implementation, and accordingly, the recovery manners are different, and the following lists several possible determination results and time interval results in the actual implementation process, and the corresponding matched exception recovery policies:

case one, the case where the first addresses are the same, which in turn includes the following sub-cases:

in sub-case 1, the time interval to which the earliest abnormal reporting time belongs is within the set time T2, and the matched abnormal recovery strategy is a data plane secondary self-healing measure at this time;

in sub-case 2, the time interval to which the earliest abnormal reporting time belongs is beyond the set time T2 but within the set time T1, and the matched abnormal recovery strategy is a data plane first-level self-healing measure at this time;

in sub-case 3, the time interval to which the earliest abnormal reporting time belongs is beyond the set time T1 but within the set time T0, and the matched abnormal recovery strategy is a service plane secondary self-healing measure at this time;

in sub-case 4, the time interval to which the earliest abnormal reporting time belongs is the time exceeding the set time T0, and the matched abnormal recovery strategy is a first-level self-healing measure of the service plane at this time;

case two, the case where the first addresses are not the same, which in turn includes the following sub-cases:

in the sub-case 1, the time interval to which the earliest abnormal reporting time belongs is within the set time T3, and the matched abnormal recovery strategy is a control plane three-level self-healing measure at this time;

in sub-case 2, the time interval to which the earliest abnormal reporting time belongs is beyond the set time T3 but within the set time T2, and the matched abnormal recovery strategy is a control plane secondary self-healing measure at this time;

in sub-case 3, the time interval to which the earliest abnormal reporting time belongs is beyond the set time T2 but within the set time T1, and the matched abnormal recovery strategy is a control plane first-level self-healing measure at this time;

in sub-case 4, the time interval to which the earliest abnormal reporting time belongs is beyond the set time T1 but within the set time T0, and the matched abnormal recovery strategy is a service plane secondary self-healing measure at this time;

in sub-case 5, the time interval to which the earliest abnormal reporting time belongs is the time exceeding the set time T0, and the matched abnormal recovery policy is a first-level self-healing measure of the service plane.

In the above cases, T3< T2< T1< T0, and the value of each set time can be set according to the actual situation, for example, T3 is 0.5 minute, T2 is 1 minute, T1 is 5 minutes, and T0 is 10 minutes; and, in the above multiple cases, the sub-case 4 of the case one and the sub-case 5 of the case two are the cases in the step S360, that is, the most serious abnormal case, the specific content of the abnormal recovery policy may be referred to the description of the above step S360, and the contents of the remaining several abnormal recovery measures are specifically as follows:

strategy one, data surface two-stage self-healing: controlling an SDN switch corresponding to a first address contained in the abnormal node information table to restart a southbound data plane interface and emptying a forwarding table; when the recovery is performed by using the abnormal recovery device of the SDN control plane, the self-healing execution module informs a switch self-healing engine of the SDN switch corresponding to the first address contained in the abnormal node information table to restart a southbound data plane interface and clear a forwarding table;

strategy two, first-level self-healing of the data surface: controlling the SDN switch corresponding to the first address contained in the abnormal node information table to restart; when recovery is performed by using an abnormal recovery device of an SDN control plane, a self-healing execution module restarts an SDN switch corresponding to a first address contained in an abnormal node information table through a management network;

strategy three, control plane three-level self-healing: controlling the south control interfaces and the north control interfaces of all SDN controllers to restart, and emptying a forwarding table; when the recovery is carried out by using an abnormal recovery device of an SDN control plane, a self-healing execution module informs controller self-healing engines of all controllers to restart a south control interface and a north control interface and clears a forwarding table;

strategy four, control plane two-level self-healing: controlling the restart of southbound data interfaces of all SDN switches and emptying a forwarding table; when the recovery is carried out by using an abnormal recovery device of an SDN control plane, a self-healing execution module informs all switch self-healing engines to restart southward data plane interfaces and clears a forwarding table;

strategy five, control plane first-level self-healing: controlling all SDN controllers and all SDN switches to restart; when the recovery is carried out by using an abnormal recovery device of the SDN control plane, the self-healing execution module restarts all SDN controllers and all SDN switches through the management network;

strategy six, service plane secondary self-healing: controlling the restart of the northbound application interface; when the recovery is carried out by using an abnormal recovery device of an SDN control plane, a self-healing execution module informs an application self-healing engine to restart a northbound application interface;

and a seventh strategy, wherein the service plane is self-healing at the first level (see fig. 6 and the corresponding description for a short time).

By the exception recovery strategy, the application self-healing engine, the controller self-healing engine, the switch self-healing engine and the self-healing execution module are used for realizing targeted recovery.

In addition, when the embodiment is executed by using the anomaly recovery device of the SDN control plane, the control state analysis module may determine the historical issuing and execution conditions of any of the self-healing policies after matching the policy, and if a record that has been issued but not yet recovered is available, the self-healing execution module is not notified any more to implement the policy.

Meanwhile, it should be emphasized here that, if all data nodes are recovered to be normal before the implementation of the service-plane-level self-healing measure is not notified, the SDN service management module is notified to recover the analysis of the synchronous data in the temporary table. If the control state analysis module informs that the first-level self-healing measure of the service plane is implemented, whether the service plane is recovered to be normal or not, the synchronous data in the temporary table cannot be automatically recovered and analyzed, and the control state analysis module can manually recover after judging the reason and eliminating the reason. By the method, the correctness of the data analyzed by the SDN service management module can be ensured.

To facilitate understanding of the implementation of the present embodiment, a specific example is described below: fig. 7 shows a schematic diagram of an exception node information table and a resource policy issuing record of exception recovery in a specific example. As shown in fig. 7, in addition to reporting the updated abnormal node information table by the data plane management module and triggering the control state analysis module to perform matching and issuing, the control state analysis module may also periodically (for example, 0 second per minute) scan the abnormal node information table and perform matching and issuing, and assume that T2 is 1 minute, T1 is 5 minutes, and T0 is 10 minutes, when the abnormal node 10.10.1.3 reports an abnormality, the control state analysis module finds that only one abnormal node record is currently provided, and at this time, the abnormality is found for the first time, and the time interval to which the abnormal node belongs is necessarily within 1 minute, and issues a data plane secondary self-healing policy; subsequently, when the abnormal node 10.10.1.8 is abnormally reported, the method is characterized in that the two abnormal nodes belong to the same SDN switch, the time interval from the earliest abnormal reporting time 12:20:18 of the abnormal node of the SDN switch to the current time 12:20:45 is 27 seconds and does not exceed 1 minute, a data plane secondary self-healing strategy is issued, but the same strategy is issued before and recovery information is not received, so the two abnormal nodes are not issued at this time; and 12: 21: 00. the 0 th second control state analysis module of 8 minutes is scanned and matched once at 12:22:00, 12:23:00, 12:24:00, 12:25:00, 12:26:00, 12:27:00 and 12:28:00, wherein the earliest abnormal report time 12:20:18 is found to be spared for 1 minute from the current time at the time of 12:22:00 but within 5 minutes, a first-level self-healing strategy of a data plane is issued at the time, and the earliest abnormal report time 12:20:18 is found to be more than 5 minutes from the current time at the time of 12:26:00 but within 10 minutes, and a second-level self-healing strategy of a service plane is issued at the time; and then, by analogy, self-healing strategy issuing records in the figure can be obtained.

Fig. 8 is a general flow illustrating an exception recovery scheme implemented by an exception recovery apparatus of an SDN control plane, where, as shown in fig. 8, a data node probe engine initiates probing of a forwarding function of an SDN switch, and if there is a functional exception, a data plane management module reports the functional exception; if the data plane management module judges that the abnormity is the non-physical network element abnormity, the control state analysis module reports the abnormal node information table, the abnormal characteristics in the abnormal node information table are analyzed through the control state analysis module, a corresponding abnormity recovery strategy is matched, and then the self-healing execution module is informed to execute the abnormity recovery strategy.

It should be noted that, in the above embodiment, the description is made by dividing the abnormality recovery apparatus of the SDN control plane into a plurality of modules to respectively execute each step of the abnormality recovery method of the SDN control plane, but in actual implementation, the division is not limited to the above, and alternatively, one or more modules may be merged or split.

According to the method for recovering the abnormality of the SDN control plane provided by the embodiment, at least the following technical effects can be achieved: firstly, the prior art scheme is based on a high-availability design idea of a traditional network, and can only solve the problem of high availability of the SDN in the situations of unavailable single SDN component, interruption of related physical links and the like, the embodiment performs high-availability design on SDN core characteristics, can solve the problem of high availability under the condition of abnormal strategy arrangement and scheduling of a control plane, and fills the blank of the field; secondly, the prior art cannot realize policy arrangement and scheduling exception detection of the control plane, and the embodiment realizes policy arrangement and scheduling exception detection based on flow table effectiveness, and can find exception of the SDN control plane in time; thirdly, the automatic recovery in the scenario that the SDN network is unavailable due to data errors cannot be solved in the prior art, and this embodiment combines with the flow table aging principle, realizes the automatic disaster tolerance switching capability of the SDN network based on the time axis, and can ensure that the SDN network is automatically recovered to an available state in a specific time.

Fig. 9 is a schematic structural diagram illustrating an embodiment of an anomaly recovery apparatus for an SDN control plane according to the present invention.

As shown in fig. 9, the apparatus includes:

the data node detection engine 910 is adapted to send heartbeat detection requests to the servers according to the second address and determine whether at least one server which does not return heartbeat information and is directly connected to the SDN switch exists;

a data plane management module 920, adapted to obtain a first address of each SDN switch in a data plane of the SDN network and a second address of a server directly connected to each SDN switch; if at least one server which is directly connected with the SDN switch and does not return heartbeat information exists, further judging whether a physical network element fault exists in the SDN network; if no physical network element fault exists, updating a historical abnormal node information table according to a request result of the heartbeat detection request returned by each server;

a control state analysis module 930, adapted to match an exception recovery policy according to the exception characteristics of all the exception servers in the updated exception node information table, where the exception characteristics include a first address of the SDN switch to which the exception server is directly connected, and an exception reporting time of the exception server;

and a self-healing execution module 940 adapted to execute the exception recovery processing on the SDN control plane by using the exception recovery policy.

In an optional manner, the data node probe engine is further adapted to:

judging whether a historical abnormal node information table contains a server or not aiming at any server which does not return heartbeat information; and/or, judging whether the server is contained in a historical abnormal node information table or not aiming at any server returning heartbeat information;

the data plane management module is further adapted to: if the historical abnormal node information table does not contain a server which does not return heartbeat information, the abnormal characteristics of the server are added to the abnormal node information table; and/or if the historical abnormal node information table contains a server returning heartbeat information, deleting the abnormal characteristics of the server from the abnormal node information table.

In an optional manner, the apparatus further comprises: the warning module is suitable for sending abnormal warning information if the physical network element fault exists in the SDN network;

the data plane management module is further adapted to: and if the physical network element fault exists in the SDN, marking the server which does not return the heartbeat information to quit the service.

In an alternative, the control state analysis module is further adapted to:

judging whether the first addresses of the SDN switches directly connected with all the abnormal servers in the updated abnormal node information table are the same or not; matching the abnormal reporting time farthest from the current time among the abnormal reporting times of all the abnormal servers with a preset time interval, and determining the time interval to which the earliest abnormal reporting time belongs;

and matching an exception recovery strategy according to the judgment result whether the first addresses are the same and the time interval to which the earliest exception reporting time belongs.

In an optional manner, the apparatus further comprises: the data synchronization driver is suitable for monitoring a call request of an SDN service module in an application plane of the SDN network; adding a sequence number and a time stamp to the calling request and then synchronizing the sequence number and the time stamp to a temporary table;

the SDN service management module is suitable for periodically analyzing and processing the synchronous data before N + M time in the temporary table one by one according to the sequence of the sequence numbers and the time stamps of the synchronous data in the temporary table, wherein N represents the aging time of a flow table of the SDN network, and M represents the estimated time of the flow table until the flow table is aged to find SDN abnormity; and updating the analysis result into the data table and deleting the corresponding data in the temporary table.

In an alternative, the control state analysis module is further adapted to:

if the physical network element fault does not exist, detecting whether the abnormality is the abnormality of the SDN network entering from the normal state for the first time, and if so, informing the SDN service management module to stop the analysis processing.

In an optional manner, the self-healing execution module is further adapted to:

if the first addresses are the same or different, and the time interval to which the earliest abnormal reporting time belongs is a first set time interval, controlling a northbound control interface of an SDN controller in the SDN network to point to the data table, and providing an analysis result in the data table to the SDN controller through the northbound control interface.

An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute the method for recovering an exception of an SDN control plane in any method embodiment described above.

Fig. 10 is a schematic structural diagram of an embodiment of a computing device according to the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

As shown in fig. 10, the computing device may include: a processor (processor)102, a Communications Interface (Communications Interface)104, a memory (memory)106, and a communication bus 108.

Wherein: the processor 102, communication interface 104, and memory 106 communicate with each other via a communication bus 108. A communication interface 104 for communicating with network elements of other devices, such as clients or other servers. The processor 102 is configured to execute the program 100, and may specifically perform relevant steps in the above-described method for recovering an exception of an SDN control plane of a computing device.

In particular, the program 100 may include program code that includes computer operating instructions.

The processor 102 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 106 for storing the program 100. Memory 106 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 100 may be specifically configured to cause the processor 102 to perform the following operations:

In an alternative, the program 100 causes the processor to:

judging whether a historical abnormal node information table contains a server or not aiming at any server which does not return heartbeat information; if not, adding the abnormal characteristics of the server to an abnormal node information table; and/or the presence of a gas in the gas,

judging whether a historical abnormal node information table contains a server or not aiming at any server returning heartbeat information; and if so, deleting the abnormal characteristics of the server from the abnormal node information table.

In an alternative, the program 100 causes the processor to: and sending abnormal alarm information and marking the server which does not return the heartbeat information to quit the service.

In an alternative, the program 100 causes the processor to:

monitoring a call request of an SDN service module in an application plane of the SDN network; adding a sequence number and a time stamp to the calling request and then synchronizing the sequence number and the time stamp to a temporary table;

periodically analyzing the synchronous data before N + M time in the temporary table one by one according to the sequence number and the timestamp of the synchronous data in the temporary table, wherein N represents the aging time of the SDN network flow table, and M represents the estimated time until the flow table is aged to find the SDN abnormity;

and updating the analysis result into the data table and deleting the corresponding data in the temporary table.

In an alternative, the program 100 causes the processor to:

if the physical network element fault does not exist, detecting whether the abnormality is the abnormality of the SDN network entering from the normal state for the first time, and if so, stopping the analysis processing.

In an alternative, the program 100 causes the processor to:

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. An exception recovery method for an SDN control plane comprises the following steps:

2. The method according to claim 1, wherein the updating the historical abnormal node information table according to the request result of the heartbeat detection request returned by each server further comprises:

3. The method of claim 1, wherein if it is determined that there is a physical network element failure in the SDN network, the method further comprises: and sending abnormal alarm information and marking the server which does not return the heartbeat information to quit the service.

4. The method according to any one of claims 1-3, wherein the matching the exception recovery policy according to the exception characteristics of all exception servers in the updated exception node information table further comprises:

5. The method of claim 4, wherein the method further comprises:

6. The method of claim 5, wherein after the determining whether there is a physical network element failure in the SDN network, the method further comprises:

7. The method of claim 6, wherein the performing of the exception recovery processing on the SDN control plane using the exception recovery policy is specifically:

8. An exception recovery apparatus for an SDN control plane, comprising:

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction, which causes the processor to perform an operation corresponding to the method for recovering an exception of the SDN control plane according to any one of claims 1 to 7.

10. A computer storage medium having stored therein at least one executable instruction to cause a processor to perform operations corresponding to the method for recovering an anomaly of an SDN control plane according to any one of claims 1-7.