WO2024119777A1 - Communication link anomaly processing method for frame-based device, frame-based device, and medium - Google Patents

Communication link anomaly processing method for frame-based device, frame-based device, and medium Download PDF

Info

Publication number
WO2024119777A1
WO2024119777A1 PCT/CN2023/101886 CN2023101886W WO2024119777A1 WO 2024119777 A1 WO2024119777 A1 WO 2024119777A1 CN 2023101886 W CN2023101886 W CN 2023101886W WO 2024119777 A1 WO2024119777 A1 WO 2024119777A1
Authority
WO
WIPO (PCT)
Prior art keywords
main control
board
control board
standby
current
Prior art date
Application number
PCT/CN2023/101886
Other languages
French (fr)
Chinese (zh)
Inventor
刘妙阁
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2024119777A1 publication Critical patent/WO2024119777A1/en

Links

Abstract

The present application belongs to the technical field of frame-based devices. Disclosed are a communication link anomaly processing method for a frame-based device, a frame-based device, and a medium. In the present application, island detection is introduced, and an action of restoring a normal communication link for a corresponding single board is executed according to an island state, such that full-frame reset caused by an anomaly in an active/standby communication link and an anomaly in a communication link between the current main control board and a service single board is avoided, thereby ensuring the normal operation and processing of the current service.

Description

框式设备通信链路异常处理方法、框式设备及介质Frame device communication link abnormality processing method, frame device and medium
相关申请Related Applications
本申请要求于2022年12月6日申请的、申请号为202211559772.X的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese patent application No. 202211559772.X filed on December 6, 2022, the entire contents of which are incorporated by reference into this application.
技术领域Technical Field
本申请涉及框式设备的技术领域,尤其涉及一种框式设备通信链路异常处理方法、框式设备及计算机可读存储介质。The present application relates to the technical field of frame devices, and in particular to a frame device communication link exception processing method, a frame device, and a computer-readable storage medium.
背景技术Background technique
目前,在通讯设备和工控设备等框式设备领域,为了达到高性能和高可靠性,往往采用分布式架构设计,由一个主控板对整个框式设备的其他业务单板进行管理并同步每张业务单板的相关信息。框式设备一般包括主控板、背板、交换板以及业务单板等。主控板通过背板和其他所有业务单板连接,对业务单板进行管理;背板提供业务单板的供电、数据、管理、控制平面的各种通道;业务板用来接收和发送数据,主控板控制数据在交换机内部的走向;交换网板用于交换机内部的数据交换。但是,交换芯片硬件问题或者通信链路硬件问题等都可能导致板间通信链路异常,异常场景包括单通和互相不通,而通信链路异常会进一步导致板卡复位或者整框复位。At present, in the field of frame-type equipment such as communication equipment and industrial control equipment, in order to achieve high performance and high reliability, a distributed architecture design is often adopted, in which a main control board manages other service boards of the entire frame-type equipment and synchronizes the relevant information of each service board. Frame-type equipment generally includes a main control board, a backplane, a switch board, and a service board. The main control board is connected to all other service boards through the backplane to manage the service boards; the backplane provides various channels for the power supply, data, management, and control planes of the service boards; the service boards are used to receive and send data, and the main control board controls the direction of data inside the switch; the switch board is used for data exchange inside the switch. However, hardware problems of the switching chip or hardware problems of the communication link may cause abnormal communication links between boards. Abnormal scenarios include single-pass and mutual disconnection, and abnormal communication links will further cause board resets or resets of the entire frame.
以主用主控板和备用主控板之间的主备通信链路为例,当前采用软件心跳机制确保系统主备可靠性。主控板的主备管理进程中心跳机制线程每秒发送一个L2组播报文,通告自己的身份信息。而主机、备机或背板硬件问题都有可能导致主备通信链路异常,其中一种异常场景是主机无法收到备机发送的心跳组播报文,主机便会发送老化备机报文至备机,备机收到老化报文就复位自身,但是备机复位后之后会产生双主冲突,因为双主冲突主机也会复位自身,进一步导致整框复位,从而影响当前业务。Taking the master-slave communication link between the active main control board and the standby main control board as an example, the software heartbeat mechanism is currently used to ensure the master-slave reliability of the system. The heartbeat mechanism thread in the master-slave management process of the main control board sends an L2 multicast message per second to announce its own identity information. Hardware problems of the host, standby machine or backplane may cause abnormalities in the master-slave communication link. One abnormal scenario is that the host cannot receive the heartbeat multicast message sent by the standby machine. The host will send an aging standby message to the standby machine. The standby machine will reset itself after receiving the aging message. However, after the standby machine is reset, a dual-master conflict will occur. Because of the dual-master conflict, the host will also reset itself, further causing the entire frame to reset, thereby affecting the current business.
发明内容Summary of the invention
本申请的主要目的在于提供一种框式设备通信链路异常处理方法、框式设备及计算机可读存储介质,旨在解决现有技术中难以避免通信链路异常导致双主冲突、进一步导致整框复位的技术问题。The main purpose of the present application is to provide a method for handling abnormal communication links of a frame-type device, a frame-type device and a computer-readable storage medium, aiming to solve the technical problem in the prior art that it is difficult to avoid dual-master conflicts caused by abnormal communication links, which further leads to the reset of the entire frame.
为实现上述目的,本申请提供一种框式设备通信链路异常处理方法,所述框式设备通信链路异常处理方法应用于主用主控板,包括以下步骤:To achieve the above object, the present application provides a method for handling abnormalities in a communication link of a frame-type device, wherein the method is applied to a main control board and comprises the following steps:
计算候选集群列表中最优集群内所有缓存集群的缓存亲和性;其中,缓存亲和性为业务作业所需数据与当前缓存数据之间的耦合关联性;Calculate the cache affinity of all cache clusters in the optimal cluster in the candidate cluster list; cache affinity is the coupling correlation between the data required by the business job and the current cache data;
计算最优缓存集群中所有缓存节点的节点资源评分;其中,资源评分为表征资源空闲状态的信息;Calculate the node resource scores of all cache nodes in the optimal cache cluster; wherein the resource score is information representing the idle state of the resource;
选择最优缓存集群的最优缓存节点进行调度。 Select the optimal cache node of the optimal cache cluster for scheduling.
为实现上述目的,本申请提供一种框式设备通信链路异常处理方法,所述框式设备通信链路异常处理方法应用于备用主控板,包括以下步骤:To achieve the above-mentioned purpose, the present application provides a method for handling abnormalities in a communication link of a frame-type device, and the method for handling abnormalities in a communication link of a frame-type device is applied to a standby main control board, and comprises the following steps:
在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;After the standby main control board fails to receive the first heartbeat message from the active main control board within a preset third time period, performing an island detection on the standby main control board to determine whether the standby main control board is an island board;
在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命状态;After the standby main control board becomes an island board, setting the standby main control board to a non-working standby state;
在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。After the standby main control board in the non-working standby state within the preset fourth time period receives the first heartbeat message from the active main control board, the standby main control board exits the non-working standby state.
为实现上述目的,本申请提供一种框式设备通信链路异常处理方法,所述框式设备通信链路异常处理方法应用于业务单板,包括以下步骤:To achieve the above object, the present application provides a method for handling abnormalities in a communication link of a frame-type device, wherein the method is applied to a service board and comprises the following steps:
在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;After the service board fails to receive the second heartbeat message from the current main control board within a preset fifth time period, performing an island detection on the service board to determine whether the service board is an island board;
在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。After the service board is an isolated board, the current board is reset and restarted, wherein the current board is the service board.
本申请还提供一种框式设备,所述框式设备包括主用主控板、备主用主控板、业务单板、存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序配置为实现如上所述的框式设备通信链路异常处理方法的步骤。The present application also provides a frame-type device, which includes a primary main control board, a backup main control board, a service board, a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program is configured to implement the steps of the frame-type device communication link exception handling method as described above.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述框式设备通信链路异常处理方法的步骤。The present application also provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the above-mentioned frame device communication link exception processing method are implemented.
本申请还提供一种应用于主用主控板的第一装置,所述第一装置包括:The present application also provides a first device applied to a main control board, the first device comprising:
第一孤岛检测模块,用于在所述主用主控板在预设第一时长内未收到备用主控板的第一心跳报文之后,对所述主用主控板进行孤岛检测,确定所述主用主控板是否为孤岛板卡;A first island detection module, configured to perform an island detection on the main control board after the main control board fails to receive a first heartbeat message from the standby control board within a preset first time period, to determine whether the main control board is an island board;
第一状态切换模块,用于在确定所述主用主控板为孤岛板卡之后,将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板;A first state switching module is used to set the main main control board to a non-working standby state and set the standby main control board to the current main control board after determining that the main main control board is an island board;
状态回切模块,用于在预设第二时长内处于所述非工作待命状态的所述主用主控板收到所述备用主控板的所述第一心跳报文之后,将所述主用主控板退出所述非工作待命状态并将所述主用主控板设置为当前主控板。The state switching module is used to make the main control board in the non-working standby state exit the non-working standby state and set the main control board as the current main control board after the main control board in the non-working standby state receives the first heartbeat message from the backup main control board within a preset second time period.
本申请还提供一种应用于备用主控板的第二装置,所述第二装置包括:The present application also provides a second device applied to a standby main control board, the second device comprising:
第二孤岛检测模块,用于在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;A second island detection module is used to perform an island detection on the standby main control board after the standby main control board fails to receive the first heartbeat message from the active main control board within a preset third time period to determine whether the standby main control board is an island board;
第二状态切换模块,用于在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命 状态;The second state switching module is used to set the standby main control board to a non-working standby state after the standby main control board becomes an island board. state;
状态恢复模块,用于在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。The state recovery module is used to make the standby main control board exit the non-working standby state after the standby main control board in the non-working standby state receives the first heartbeat message from the active main control board within a preset fourth time period.
本申请还提供一种应用于业务单板的第三装置,所述第三装置包括:The present application also provides a third device applied to a service board, the third device comprising:
第三孤岛检测模块,用于在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;A third island detection module is used to perform island detection on the service board after the service board fails to receive the second heartbeat message of the current main control board within a preset fifth time length to determine whether the service board is an island board;
复位重启模块,用于在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。The reset and restart module is used to reset and restart the current board after the service board is an island board, wherein the current board is the service board.
本申请公开了一种框式设备通信链路异常处理方法、框式设备及计算机可读存储介质,在所述主用主控板在预设第一时长内未收到备用主控板的第一心跳报文之后,对所述主用主控板进行孤岛检测,确定所述主用主控板是否为孤岛板卡;在确定所述主用主控板为孤岛板卡之后,将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板;在预设第二时长内处于所述非工作待命状态的所述主用主控板收到所述备用主控板的所述第一心跳报文之后,将所述主用主控板退出所述非工作待命状态并将所述主用主控板设置为当前主控板。The present application discloses a method for handling abnormalities in a communication link of a frame-type device, a frame-type device, and a computer-readable storage medium. After the active main control board does not receive a first heartbeat message from a backup main control board within a preset first time period, an island detection is performed on the active main control board to determine whether the active main control board is an island board; after determining that the active main control board is an island board, the active main control board is set to a non-working standby state and the backup main control board is set to be the current main control board; after the active main control board in the non-working standby state receives the first heartbeat message from the backup main control board within a preset second time period, the active main control board exits the non-working standby state and sets the active main control board as the current main control board.
在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命状态;在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。After the standby main control board does not receive the first heartbeat message from the active main control board within a preset third time period, an island detection is performed on the standby main control board to determine whether the standby main control board is an island board; after the standby main control board is an island board, the standby main control board is set to a non-working standby state; after the standby main control board in the non-working standby state receives the first heartbeat message from the active main control board within a preset fourth time period, the standby main control board exits the non-working standby state.
在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。After the service board does not receive the second heartbeat message from the current main control board within the preset fifth time period, an island detection is performed on the service board to determine whether the service board is an island board; after the service board is an island board, the current board is reset and restarted, wherein the current board is the service board.
通过在主用主控板、备用主控板和业务单板中引入孤岛检测,确定板卡是否为孤岛板卡。当当前单板为孤岛状态的主控板时,通过将其设置为非工作待命状态并监听伙伴板的第一心跳报文,若接收到伙伴板的第一心跳报文,则可以确定当前单板为正常单板,主控板可通过退出非工作待命状态的方式自恢复主备通信链路的正常通信而不必通过复位重启恢复正常通信,避免重启复位后的主控板和当前主控板之间的双主冲突导致整框复位。By introducing island detection in the active main control board, standby main control board and service board, it is determined whether the board is an island board. When the current board is an island main control board, it is set to a non-working standby state and listens to the first heartbeat message of the partner board. If the first heartbeat message of the partner board is received, it can be determined that the current board is a normal board. The main control board can restore the normal communication of the active-standby communication link by exiting the non-working standby state without having to restore normal communication by resetting and restarting, thus avoiding the dual-master conflict between the main control board after the reset and the current main control board, which causes the entire frame to reset.
当当前单板为孤岛状态的业务单板时,直接可以确定当前单板为异常单板,通过复位重启业务单板恢复当前主控板和业务单板之间的正常通信,避免由于当前主控板和业务单板之间的通信链路异常导致 当前主控板复位重启、导致整框复位,从而通过引入孤岛检测,根据孤岛状态执行对应单板的恢复通信链路正常的动作,避免由于主备通信链路异常以及当前主控板和业务单板之间的通信链路异常导致整框复位,保证当前业务的正常运行处理。When the current board is an isolated service board, it can be directly determined that the current board is an abnormal board. By resetting and restarting the service board, normal communication between the current main control board and the service board can be restored to avoid abnormal communication links between the current main control board and the service board. The current main control board is reset and restarted, causing the entire frame to reset. Therefore, by introducing island detection, the corresponding board is executed according to the island state to restore the normal communication link to avoid the entire frame being reset due to abnormal primary and standby communication links and abnormal communication links between the current main control board and the service board, thereby ensuring the normal operation of the current business.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例方案涉及的硬件运行环境的运行设备的结构示意图;FIG1 is a schematic diagram of the structure of an operating device of a hardware operating environment involved in an embodiment of the present application;
图2为本申请实施例方案涉及的框式设备通信链路异常处理方法一实施例的流程示意图;FIG2 is a flow chart of an embodiment of a method for handling abnormal communication links of a frame-type device according to an embodiment of the present application;
图3为本申请实施例方案涉及的框式设备一实施例的系统示意图;FIG3 is a system diagram of an embodiment of a frame device according to an embodiment of the present application;
图4为本申请实施例方案涉及的框式设备通信链路异常处理方法另一实施例的流程示意图;FIG4 is a flow chart of another embodiment of a method for handling abnormal communication links of a frame-type device according to an embodiment of the present application;
图5为本申请实施例方案涉及的框式设备通信链路异常处理方法另一实施例的流程示意图;FIG5 is a flow chart of another embodiment of a method for handling abnormal communication links of a frame-type device according to an embodiment of the present application;
图6为本申请实施例方案涉及的第一装置的示意图;FIG6 is a schematic diagram of a first device involved in an embodiment of the present application;
图7为本申请实施例方案涉及的第二装置的示意图;FIG7 is a schematic diagram of a second device involved in an embodiment of the present application;
图8为本申请实施例方案涉及的第三装置的示意图。FIG8 is a schematic diagram of a third device involved in an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.
参照图1,图1为本申请实施例方案涉及的硬件运行环境的运行设备结构示意图。Refer to Figure 1, which is a schematic diagram of the operating device structure of the hardware operating environment involved in the embodiment of the present application.
如图1所示,该运行设备可以包括:处理器1001,例如中央处理器(Central Processing Unit,CPU),通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity,WI-FI)接口)。存储器1005可以是高速的随机存取存储器(Random Access Memory,RAM)存储器,也可以是稳定的非易失性存储器(Non-Volatile Memory,NVM),例如磁盘存储器。存储器1005还可以是独立于前述处理器1001的存储装置。As shown in FIG1 , the operating device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may include a standard wired interface and a wireless interface (such as a wireless fidelity (Wireless-Fidelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) memory, or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. The memory 1005 may also be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图1中示出的结构并不构成对运行设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art will appreciate that the structure shown in FIG. 1 does not constitute a limitation on the operating device, and may include more or fewer components than shown, or a combination of certain components, or a different arrangement of components.
如图1所示,作为一种存储介质的存储器1005中可以包括操作系统、数据存储模块、网络通信模块、用户接口模块以及计算机程序。As shown in FIG. 1 , the memory 1005 as a storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a computer program.
在图1所示的运行设备中,网络接口1004主要用于与其他设备进行数据通信;用户接口1003主要用于与用户进行数据交互;本申请运行设备中的处理器1001、存储器1005可以设置在运行设备中,所 述运行设备通过处理器1001调用存储器1005中存储的计算机程序,并执行以下操作:In the running device shown in FIG1 , the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in the running device of the present application can be set in the running device. The operating device calls the computer program stored in the memory 1005 through the processor 1001, and performs the following operations:
所述框式设备通信链路异常处理方法应用于主用主控板,包括以下步骤:The frame-type device communication link abnormality processing method is applied to the main control board, and includes the following steps:
在所述主用主控板在预设第一时长内未收到备用主控板的第一心跳报文之后,对所述主用主控板进行孤岛检测,确定所述主用主控板是否为孤岛板卡;After the active main control board does not receive the first heartbeat message from the standby main control board within a preset first time period, performing an island detection on the active main control board to determine whether the active main control board is an island board;
在确定所述主用主控板为孤岛板卡之后,将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板;After determining that the main main control board is an island board, setting the main main control board to a non-working standby state and setting the standby main control board to a current main control board;
在预设第二时长内处于所述非工作待命状态的所述主用主控板收到所述备用主控板的所述第一心跳报文之后,将所述主用主控板退出所述非工作待命状态并将所述主用主控板设置为当前主控板。After the active main control board in the non-working standby state within the preset second time period receives the first heartbeat message from the standby main control board, the active main control board exits the non-working standby state and is set as the current main control board.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作:In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板的步骤之后,还包括:After the step of setting the active main control board to a non-working standby state and setting the standby main control board to a current main control board, the method further includes:
在预设第二时长内处于所述非工作待命状态的所述主用主控板未收到所述备用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述主用主控板。After the active main control board in the non-working standby state does not receive the first heartbeat message from the standby main control board within a preset second time period, the current board is reset and restarted, wherein the current board is the active main control board.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作:In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述框式设备通信链路异常处理方法应用于备用主控板,包括以下步骤:The frame-type device communication link abnormality processing method is applied to a standby main control board, comprising the following steps:
在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;After the standby main control board fails to receive the first heartbeat message from the active main control board within a preset third time period, performing an island detection on the standby main control board to determine whether the standby main control board is an island board;
在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命状态;After the standby main control board becomes an island board, setting the standby main control board to a non-working standby state;
在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。After the standby main control board in the non-working standby state within the preset fourth time period receives the first heartbeat message from the active main control board, the standby main control board exits the non-working standby state.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作:In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述将所述备用主控板设置为非工作待命状态的步骤之后,还包括:After the step of setting the standby main control board to a non-working standby state, the method further includes:
在预设第四时长内处于所述非工作待命状态的所述备用主控板未收到所述主用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述备用主控板。After the standby main control board in the non-working standby state does not receive the first heartbeat message from the active main control board within a preset fourth time period, the current board is reset and restarted, wherein the current board is the standby main control board.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作:In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述框式设备通信链路异常处理方法应用于业务单板,包括以下步骤:The frame device communication link exception processing method is applied to a service board, and comprises the following steps:
在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;After the service board fails to receive the second heartbeat message from the current main control board within a preset fifth time period, performing an island detection on the service board to determine whether the service board is an island board;
在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。After the service board is an isolated board, the current board is reset and restarted, wherein the current board is the service board.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作: In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述复位重启当前单板的步骤之后,还包括:After the step of resetting and restarting the current board, the method further includes:
对当前单板进行孤岛检测,在复位重启后的当前单板为孤岛板卡之后,进行上报告警。Perform island detection on the current board, and report an alarm if the current board becomes an island board after reset and restart.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作:In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述进行孤岛检测的步骤,包括:The steps of performing island detection include:
在当前单板是主控板之后,确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板;After the current single board is a main control board, determining whether a partner board of the main control board is running, and determining whether there is an online service single board of the main control board;
在所述主控板的伙伴板正在运行或所述主控板存在在线的业务单板之后,或者在当前单板不是主控板之后,确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信;After the partner board of the main control board is running or the main control board has an online service board, or after the current board is not the main control board, determining whether there is transparent internal process communication between the current board and other boards within a preset sixth time period;
在预设第六时长内当前单板与其它单板之间不存在透明内部进程通信之后,确定当前单板为孤岛板卡。After there is no transparent internal process communication between the current board and other boards within a preset sixth time period, it is determined that the current board is an island board.
在一实施例中,处理器1001可以调用存储器1005中存储的计算机程序,还执行以下操作:In one embodiment, the processor 1001 may call a computer program stored in the memory 1005 and further perform the following operations:
所述确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板的步骤,包括:The step of determining whether the partner board of the main control board is running and determining whether there is an online service single board on the main control board includes:
通过背板获取所述伙伴板的运行信号,在所述运行信号为高电平之后,确定所述伙伴板正在运行;Acquire an operation signal of the partner board through a backplane, and determine that the partner board is running after the operation signal is at a high level;
通过读取所述业务单板的在位寄存器状态,并基于所述在位寄存器状态确定所述业务单板是否在线。By reading the status of the in-place register of the service board, it is determined whether the service board is online based on the status of the in-place register.
本申请实施例提供了一种框式设备通信链路异常处理方法,参照图2,在框式设备通信链路异常处理方法的一实施例中,所述框式设备通信链路异常处理方法应用于主用主控板,包括以下步骤:The present application embodiment provides a method for handling abnormal communication links of a frame device. Referring to FIG. 2 , in one embodiment of the method for handling abnormal communication links of a frame device, the method for handling abnormal communication links of a frame device is applied to a main main control board, and includes the following steps:
步骤A1,在所述主用主控板在预设第一时长内未收到备用主控板的第一心跳报文之后,对所述主用主控板进行孤岛检测,确定所述主用主控板是否为孤岛板卡。Step A1, after the active main control board does not receive the first heartbeat message from the standby main control board within a preset first time period, perform an island detection on the active main control board to determine whether the active main control board is an island board.
参照图3,框式设备包括主用主控板、备用主控板、业务单板和背板。其中,主控板至少包括:CPU,板间通讯网口,控制面交换芯片(管理板间通讯);业务单板至少包括:CPU,板间通讯网口;背板至少包括:单板的供电、数据、管理、控制平面的各种通道。框式设备系统由一个或多个框组成,每个框由两个主控单板和若干个其他单板组成。主控单板包含DHCP(Dynamic Host Configuration Protocol,动态主机配置协议)服务端、板间通讯网口、交换芯片(负责管理与其他单板的通讯网络)和能够访问背板存储信息的通讯通道。整个系统都只有一个主用主控,包含版本管理服务端和DHCP服务端,主要负责对整个系统的运行版本进行统一管理、本框内其他单板机框信息管理和传递。其他单板包含DHCP客户端、板间通讯网口(与主控板交换芯片端口连接),在正常启动后,主要负责设备业务功能的正常运行。Referring to Figure 3, the frame device includes a main control board, a standby control board, a service board and a backplane. Among them, the main control board includes at least: CPU, inter-board communication network port, control plane switching chip (management inter-board communication); the service board includes at least: CPU, inter-board communication network port; the backplane includes at least: various channels of the board's power supply, data, management, and control plane. The frame device system consists of one or more frames, each of which consists of two main control boards and several other boards. The main control board includes a DHCP (Dynamic Host Configuration Protocol) server, an inter-board communication network port, a switching chip (responsible for managing the communication network with other boards) and a communication channel that can access the backplane storage information. The entire system has only one main control, including a version management server and a DHCP server, which is mainly responsible for unified management of the running version of the entire system, management and transmission of other board frame information in this frame. Other boards include a DHCP client and an inter-board communication network port (connected to the main control board switching chip port). After normal startup, they are mainly responsible for the normal operation of the equipment business functions.
主用主控板在预设第一时长内未收到备用主控板的第一心跳报文,则说明主备通信链路异常,则说明要么主用主控板出错,要么备用主控板出错,要么主用主控板和备用主控板均出错。因此,在这种情 况下,对主用主控板进行孤岛检测,确定主用主控板是否为孤岛板卡、判断主用主控板是否处于孤岛状态,确定是否是主用主控板出错。示例性的,预设第一时长可以设置为120秒。If the active main control board does not receive the first heartbeat message from the standby main control board within the preset first time period, it means that the active and standby communication links are abnormal, which means that either the active main control board is wrong, the standby main control board is wrong, or both the active and standby main control boards are wrong. Under such circumstances, an island detection is performed on the main control board to determine whether the main control board is an island board, whether the main control board is in an island state, and whether the main control board is faulty. Exemplarily, the preset first duration can be set to 120 seconds.
在本实施例中,主备通信链路异常的原因仅考虑主用主控板和/或备用主用板出错的情况,暂不考虑通信链路出错或者其它装置、部件出错的情况。In this embodiment, the cause of the abnormality of the primary and standby communication links only considers the situation that the primary main control board and/or the standby main board have errors, and does not consider the situation that the communication link has errors or other devices or components have errors.
步骤A2,在确定所述主用主控板为孤岛板卡之后,将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板。Step A2: after determining that the active main control board is an island board, setting the active main control board to a non-working standby state and setting the standby main control board as a current main control board.
如果确定主用主控板是孤岛板卡,则说明是因为主用主控板出错导致的主备通信链路异常。当主用主控板为孤岛状态时,通过将其设置为非工作待命状态并监听伙伴板(备用主控板)的第一心跳报文;同时,将当前主控板从主用主控板切换至备用主控板。If it is determined that the active main control board is an island board, it means that the active and standby communication links are abnormal because of an error in the active main control board. When the active main control board is in an island state, it is set to a non-working standby state and monitors the first heartbeat message of the partner board (standby main control board); at the same time, the current main control board is switched from the active main control board to the standby main control board.
如果处于孤岛状态的主用主控板进行重启复位,则会导致复位后的主用主控板和备用主控板之间存在双主冲突而导致整框复位。所以,在本实施例中,提出主用主控板的一种非工作待命状态,此时整框只有备用主控板一个当前主控板,并不会由于非工作待命状态的主用主控板而产生双主冲突。If the active main control board in the island state is restarted and reset, it will cause a dual master conflict between the reset active main control board and the standby main control board, causing the entire frame to reset. Therefore, in this embodiment, a non-working standby state of the active main control board is proposed, at which time the entire frame has only one current main control board, the standby main control board, and no dual master conflict will occur due to the active main control board in the non-working standby state.
步骤A31,在预设第二时长内处于所述非工作待命状态的所述主用主控板收到所述备用主控板的所述第一心跳报文之后,将所述主用主控板退出所述非工作待命状态并将所述主用主控板设置为当前主控板。Step A31, after the active main control board in the non-working standby state within the preset second time period receives the first heartbeat message from the standby main control board, the active main control board exits the non-working standby state and sets the active main control board as the current main control board.
若在预设第二时长内处于非工作待命状态的主用主控板接收到备用主控板的第一心跳报文,则可以确定处于孤岛状态的主用主控板通过在非工作待命状态下的自恢复、已恢复正常,主用主控板可通过退出非工作待命状态的方式恢复主备通信链路的正常通信,而不必通过复位重启恢复正常通信;同时,将当前主控板从备用主控板切换至已恢复正常的主用主控板,主用主控板完成主控板的接管。在本实施例中,不限定将当前主控板从备用主控板切换至已恢复正常的主用主控板的时机,不限定确定该时机的方法。示例性的,预设第二时长可以设置为5分钟。If the active main control board in a non-working standby state receives the first heartbeat message from the standby main control board within the preset second time period, it can be determined that the active main control board in an island state has recovered to normal through self-recovery in a non-working standby state, and the active main control board can restore the normal communication of the active-standby communication link by exiting the non-working standby state without having to restore normal communication by resetting and restarting; at the same time, the current main control board is switched from the standby main control board to the active main control board that has recovered to normal, and the active main control board completes the takeover of the main control board. In this embodiment, the timing of switching the current main control board from the standby main control board to the active main control board that has recovered to normal is not limited, and the method of determining the timing is not limited. Exemplarily, the preset second time period can be set to 5 minutes.
预设第二时长内处于非工作待命状态的主用主控板是否收到备用主控板的第一心跳报文,设置该判断条件的目的在于:确定主用主控板是否完成自恢复、主备通信链路是否恢复至正常。如果主用主控板完成自恢复、主备通信链路恢复正常,那么主用主控板就会收到备用主控板的第一心跳报文,主用主控板就可以重新成为当前主控板。Whether the active main control board in the non-working standby state receives the first heartbeat message from the standby main control board within the preset second time period, the purpose of setting this judgment condition is to determine whether the active main control board has completed self-recovery and whether the active-standby communication link has returned to normal. If the active main control board has completed self-recovery and the active-standby communication link has returned to normal, then the active main control board will receive the first heartbeat message from the standby main control board, and the active main control board can become the current main control board again.
主用主控板和备用主控板互为伙伴板。无论当前主控板是主用主控板还是备用主控板,当前主控板都会向伙伴板定时发送第一心跳报文,用于检测主备通信链路是否正常。The active main control board and the standby main control board are partner boards. Regardless of whether the current main control board is the active main control board or the standby main control board, the current main control board will periodically send the first heartbeat message to the partner board to detect whether the active and standby communication links are normal.
示例性的,所述将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板的步骤之后,还包括:Exemplarily, after the step of setting the active main control board to a non-working standby state and setting the standby main control board to a current main control board, the process further includes:
步骤A32,在预设第二时长内处于所述非工作待命状态的所述主用主控板未收到所述备用主控板的 所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述主用主控板。Step A32: The active main control board in the non-working standby state does not receive the signal from the standby main control board within the preset second time period. After the first heartbeat message is sent, the current board is reset and restarted, wherein the current board is the active main control board.
若在预设第二时长内处于非工作待命状态的主用主控板并没有接收到备用主控板的第一心跳报文,则可以确定处于孤岛状态的主用主控板通过在非工作待命状态下的自恢复、不能恢复正常,更说明在将确定已出错的主用主控板通过切换至非工作待命状态的方式并不能使其恢复正常,此时就需要复位重启来恢复主用主控板。在非工作待命状态下的重启并不会跟在工作状态下的重启一样导致与备用主控板之间的双主冲突。所以,复位重启非工作待命状态的主用主控板,尝试恢复主用主控板至正常。If the active main control board in the non-working standby state does not receive the first heartbeat message from the backup main control board within the preset second time period, it can be determined that the active main control board in the island state cannot be restored to normal through self-recovery in the non-working standby state, which further indicates that the active main control board that has been determined to have an error cannot be restored to normal by switching it to the non-working standby state. At this time, it is necessary to reset and restart to restore the active main control board. Restarting in the non-working standby state will not cause a dual-master conflict with the backup main control board like restarting in the working state. Therefore, reset and restart the active main control board in the non-working standby state to try to restore the active main control board to normal.
另外,后续可以再通过相关主备竞争策略,在从非工作待命状态恢复正常的主用主控板与备用主控板之间确定新的当前主控板。在本实施例中不限定,从非工作待命状态恢复正常的主用主控板与备用主控板确定新的当前主控板的方式。In addition, the new current main control board can be determined between the active main control board and the standby main control board that have recovered from the non-working standby state through the relevant active-standby competition strategy. In this embodiment, the method of determining the new current main control board between the active main control board and the standby main control board that have recovered from the non-working standby state is not limited.
在本实施例中,通过在主用主控板中引入孤岛检测,确定板卡是否为孤岛板卡。当当前单板为孤岛状态的主控板时,通过将其设置为非工作待命状态并监听伙伴板的第一心跳报文,若接收到伙伴板的第一心跳报文,则可以确定当前单板为正常单板,主控板可通过退出非工作待命状态的方式自恢复主备通信链路的正常通信而不必通过复位重启恢复正常通信,避免重启复位后的主控板和当前主控板之间的双主冲突导致整框复位。从而通过引入孤岛检测,根据孤岛状态执行对应单板的恢复通信链路正常的动作,避免由于主备通信链路异常导致整框复位,保证当前业务的正常运行处理。In this embodiment, by introducing island detection in the main main control board, it is determined whether the board is an island board. When the current single board is a main control board in an island state, by setting it to a non-working standby state and monitoring the first heartbeat message of the partner board, if the first heartbeat message of the partner board is received, it can be determined that the current single board is a normal single board, and the main control board can self-restore the normal communication of the main and standby communication links by exiting the non-working standby state without having to restore normal communication by resetting and restarting, thereby avoiding the dual-master conflict between the main control board after the reset and the current main control board causing the entire frame to reset. Therefore, by introducing island detection, the action of restoring the normal communication link of the corresponding single board is executed according to the island state, thereby avoiding the reset of the entire frame due to the abnormality of the main and standby communication links, and ensuring the normal operation and processing of the current business.
在本申请框式设备通信链路异常处理方法的另一实施例中,参照图4,所述框式设备通信链路异常处理方法应用于备用主控板,包括以下步骤:In another embodiment of the method for handling abnormal communication link of a frame device of the present application, referring to FIG. 4 , the method for handling abnormal communication link of a frame device is applied to a standby main control board, comprising the following steps:
步骤B1,在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;Step B1, after the standby main control board does not receive the first heartbeat message from the active main control board within a preset third time period, performing an island detection on the standby main control board to determine whether the standby main control board is an island board;
备用主控板在预设第三时长内未收到主用主控板的第一心跳报文,则说明主备通信链路异常,则说明要么主用主控板出错,要么备用主控板出错,要么主用主控板和备用主控板均出错。因此,在这种情况下,对备用主控板进行孤岛检测,确定备用主控板是否为孤岛板卡、判断备用主控板是否处于孤岛状态,确定是否是备用主控板出错。If the standby main control board does not receive the first heartbeat message from the active main control board within the preset third time period, it means that the active-standby communication link is abnormal, which means that either the active main control board is faulty, the standby main control board is faulty, or both the active main control board and the standby main control board are faulty. Therefore, in this case, an island detection is performed on the standby main control board to determine whether the standby main control board is an island board, whether the standby main control board is in an island state, and whether the standby main control board is faulty.
步骤B2,在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命状态;Step B2, after the standby main control board is an island board, setting the standby main control board to a non-working standby state;
如果确定备用主控板是孤岛板卡,则说明是因为备用主控板出错导致的主备通信链路异常。当备用主控板为孤岛状态时,通过将其设置为非工作待命状态并监听伙伴板(备用主控板)的第一心跳报文;同时,当前主控板一直保持为主用主控板不变。If it is determined that the standby main control board is an island board, it means that the master-slave communication link is abnormal due to an error in the standby main control board. When the standby main control board is in an island state, it is set to a non-working standby state and monitors the first heartbeat message of the partner board (standby main control board); at the same time, the current main control board remains unchanged as the main control board.
如果处于孤岛状态的备用主控板进行重启复位,则会导致复位后的备用主控板和一直扮演当前主控板的主用主控板之间存在双主冲突而导致整框复位。所以,在本实施例中,提出备用主控板的一种非工 作待命状态,此时整框只有主用主控板一个当前主控板,并不会由于非工作待命状态的备用主控板的重启复位而产生双主冲突。If the standby main control board in the island state is restarted and reset, there will be a dual master conflict between the reset standby main control board and the main main control board that has been acting as the current main control board, resulting in the reset of the entire frame. In standby state, the entire frame has only one active main control board, and no dual-active conflict will occur due to the restart and reset of the standby main control board in the non-working standby state.
步骤B31,在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。Step B31: After the standby main control board in the non-working standby state receives the first heartbeat message from the active main control board within the preset fourth time period, the standby main control board exits the non-working standby state.
若在预设第四时长内处于非工作待命状态的备用主控板接收到主用主控板的第一心跳报文,则可以确定处于孤岛状态的备用主控板通过在非工作待命状态下的自恢复、已恢复正常,备用主控板可通过退出非工作待命状态的方式恢复主备通信链路的正常通信,而不必通过复位重启恢复正常通信。If the standby main control board in the non-working standby state receives the first heartbeat message from the active main control board within the preset fourth time period, it can be determined that the standby main control board in the island state has returned to normal through self-recovery in the non-working standby state. The standby main control board can restore the normal communication of the active-standby communication link by exiting the non-working standby state without having to restore normal communication through resetting and restarting.
预设第四时长内处于非工作待命状态的备用主控板是否收到主用主控板的第一心跳报文,设置该判断条件的目的在于:确定备用主控板是否完成自恢复、主备通信链路是否恢复至正常。如果备用主控板完成自恢复、主备通信链路恢复正常,那么备用主控板就会收到主用主控板的第一心跳报文,备用主控板至此恢复正常。Whether the standby main control board in the non-working standby state receives the first heartbeat message of the main control board within the preset fourth time period, the purpose of setting this judgment condition is to determine whether the standby main control board has completed self-recovery and whether the main-standby communication link has returned to normal. If the standby main control board completes self-recovery and the main-standby communication link has returned to normal, then the standby main control board will receive the first heartbeat message of the main control board, and the standby main control board has returned to normal.
示例性的,所述将所述备用主控板设置为非工作待命状态的步骤之后,还包括:Exemplarily, after the step of setting the standby main control board to a non-working standby state, the method further includes:
步骤B32,在预设第四时长内处于所述非工作待命状态的所述备用主控板未收到所述主用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述备用主控板。Step B32, after the standby main control board in the non-working standby state does not receive the first heartbeat message from the active main control board within a preset fourth time period, reset and restart the current board, wherein the current board is the standby main control board.
若在预设第四时长内处于非工作待命状态的备用主控板并没有接收到主用主控板的第一心跳报文,则可以确定处于孤岛状态的备用主控板通过在非工作待命状态下的自恢复、不能恢复正常,更说明在将确定已出错的备用主控板通过切换至非工作待命状态的方式并不能使其恢复正常,此时就需要复位重启来恢复备用主控板。在非工作待命状态下的重启并不会跟在工作状态下的重启一样导致与备用主控板之间的双主冲突。所以,可以通过复位重启非工作待命状态的备用主控板,尝试恢复备用主控板至正常。If the standby main control board in the non-working standby state does not receive the first heartbeat message from the main main control board within the preset fourth time period, it can be determined that the standby main control board in the island state cannot be restored to normal through self-recovery in the non-working standby state, which further indicates that the standby main control board that has been determined to have an error cannot be restored to normal by switching it to the non-working standby state. At this time, it is necessary to reset and restart to restore the standby main control board. Restarting in the non-working standby state will not cause a dual-master conflict with the standby main control board like restarting in the working state. Therefore, you can try to restore the standby main control board to normal by resetting and restarting the standby main control board in the non-working standby state.
另外,由于当前主控板一直是主用主控板,当前主控板并没有主备主控板之间切换,所以,后续也无需再通过相关主备竞争策略,在从非工作待命状态恢复正常的备用主控板与主用主控板之间确定新的当前主控板。In addition, since the current main control board has always been the active main control board and has not switched between the active and standby main control boards, there is no need to use the relevant active and standby competition strategy to determine the new current main control board between the standby main control board that has recovered from the non-working standby state and the active main control board.
在本实施例中,通过在备用主控板中引入孤岛检测,确定板卡是否为孤岛板卡。当当前单板为孤岛状态的主控板时,通过将其设置为非工作待命状态并监听伙伴板的第一心跳报文,若接收到伙伴板的第一心跳报文,则可以确定当前单板为正常单板,主控板可通过退出非工作待命状态的方式自恢复主备通信链路的正常通信而不必通过复位重启恢复正常通信,避免重启复位后的主控板和当前主控板之间的双主冲突导致整框复位。从而通过引入孤岛检测,根据孤岛状态执行对应单板的恢复通信链路正常的动作,避免由于主备通信链路异常导致整框复位,保证当前业务的正常运行处理。In this embodiment, by introducing island detection in the standby main control board, it is determined whether the board is an island board. When the current single board is a main control board in an island state, by setting it to a non-working standby state and monitoring the first heartbeat message of the partner board, if the first heartbeat message of the partner board is received, it can be determined that the current single board is a normal single board, and the main control board can self-restore the normal communication of the main and standby communication links by exiting the non-working standby state without having to restore normal communication by resetting and restarting, thereby avoiding the dual-master conflict between the main control board after the reset and the current main control board causing the entire frame to reset. Therefore, by introducing island detection, the action of restoring the normal communication link of the corresponding single board is executed according to the island state, thereby avoiding the reset of the entire frame due to the abnormality of the main and standby communication links, and ensuring the normal operation and processing of the current business.
在本申请框式设备通信链路异常处理方法的另一实施例中,参照图5,所述框式设备通信链路异常处理方法应用于业务单板,包括以下步骤: In another embodiment of the method for handling abnormal communication link of a frame device of the present application, referring to FIG. 5 , the method for handling abnormal communication link of a frame device is applied to a service board, comprising the following steps:
步骤C1,在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;Step C1, after the service board does not receive the second heartbeat message from the current main control board within a preset fifth time period, performing an island detection on the service board to determine whether the service board is an island board;
业务单板在预设第五时长内未收到当前主控板的第二心跳报文,则说明当前主控板与业务单板之间的通信链路异常,则说明要么当前主控板出错,要么业务单板出错,要么当前主控板和业务单板均出错。因此,在这种情况下,对业务单板进行孤岛检测,确定业务单板是否为孤岛板卡、判断业务单板是否处于孤岛状态,确定是否是业务单板出错。If the service board does not receive the second heartbeat message from the current main control board within the preset fifth time period, it means that the communication link between the current main control board and the service board is abnormal, which means that either the current main control board is wrong, the service board is wrong, or both the current main control board and the service board are wrong. Therefore, in this case, an island detection is performed on the service board to determine whether the service board is an island board, whether the service board is in an island state, and whether the service board is wrong.
步骤C2,在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。Step C2, after the service board is an island board, resetting and restarting the current board, wherein the current board is the service board.
如果确定业务单板是孤岛板卡,则说明是因为业务单板出错导致的当前主控板与业务单板之间的通信链路异常。由于当前主控板与业务单板之间为一对多的通信关系,一个业务单元仅需要对一个当前主控板负责/通信,所以,当业务单板为孤岛状态时,可以直接复位重启业务单板,而不需要将其也设置为非工作待命状态并监听当前主控板的第二心跳报文。If it is determined that the service board is an island board, it means that the communication link between the current main control board and the service board is abnormal due to an error in the service board. Since the communication relationship between the current main control board and the service board is one-to-many, a service unit only needs to be responsible for/communicate with one current main control board. Therefore, when the service board is in an island state, the service board can be directly reset and restarted without setting it to a non-working standby state and monitoring the second heartbeat message of the current main control board.
无论当前主控板是主用主控板还是备用主控板,当前主控板都会向业务单板定时发送第二心跳报文,用于当前主控板与业务单板之间的通信链路是否正常。No matter the current main control board is the active main control board or the standby main control board, the current main control board will periodically send a second heartbeat message to the service board to check whether the communication link between the current main control board and the service board is normal.
在已有框式设备的使用中,一般认为当前主控板与业务单板之间的通信链路不会出现故障,很少有对此类故障进行检测的方法,就更加少有的从业务单板的角度对此类故障进行检测的方法。因此,一般当前主控板没有也不会下发第二心跳报文至业务单板,用于业务单板的前述通信链路故障检测及处理。In the use of existing frame-type equipment, it is generally believed that the communication link between the current main control board and the service board will not fail, and there are few methods to detect such failures, and even fewer methods to detect such failures from the perspective of the service board. Therefore, the current main control board generally does not and will not send a second heartbeat message to the service board for the aforementioned communication link failure detection and processing of the service board.
而在本实施例中,通过在业务单板中引入孤岛检测,确定板卡是否为孤岛板卡。当当前单板为孤岛状态的业务单板时,直接可以确定当前单板为异常单板,通过复位重启业务单板恢复当前主控板和业务单板之间的正常通信,避免由于当前主控板和业务单板之间的通信链路异常导致当前主控板复位重启、导致整框复位。从而通过引入孤岛检测,根据孤岛状态执行对应单板的恢复通信链路正常的动作,避免由于当前主控板和业务单板之间的通信链路异常导致整框复位,保证当前业务的正常运行处理。In this embodiment, by introducing island detection in the service board, it is determined whether the board is an island board. When the current board is a service board in an island state, it can be directly determined that the current board is an abnormal board, and the normal communication between the current main control board and the service board is restored by resetting and restarting the service board, so as to avoid resetting and restarting the current main control board and resetting the entire frame due to abnormal communication link between the current main control board and the service board. Thus, by introducing island detection, the action of restoring the normal communication link of the corresponding board is executed according to the island state, so as to avoid resetting the entire frame due to abnormal communication link between the current main control board and the service board, and ensure the normal operation of the current business.
在本申请框式设备通信链路异常处理方法的另一实施例中,所述复位重启当前单板的步骤之后,还包括:In another embodiment of the method for handling abnormal communication link of a frame-type device of the present application, after the step of resetting and restarting the current board, the method further includes:
对当前单板进行孤岛检测,在复位重启后的当前单板为孤岛板卡之后,进行上报告警。Perform island detection on the current board, and report an alarm if the current board becomes an island board after reset and restart.
在上述实施例中复位重启当前单板之后,重新对当前单板进行孤岛检测,重新确定与当前单板相关的通信链路是否再次出现异常。在复位重启后的当前单板为孤岛板卡之后,则说明其相关通信链路再次出现异常。如果同一个单板恢复正常后重复出现异常,则需要进行上报告警,避免重复恢复正常带来的资源浪费,避免重复恢复却仍无法解决当前单板的通信异常问题,需要提示相关管理人员进行及时处理。 In the above embodiment, after resetting and restarting the current single board, the current single board is re-islanded to determine whether the communication link related to the current single board is abnormal again. After the current single board after resetting and restarting is an island board, it means that its related communication link is abnormal again. If the same single board is abnormal again after returning to normal, it is necessary to report an alarm to avoid the waste of resources caused by repeated restoration to normal, and to avoid repeated restoration but still unable to solve the abnormal communication problem of the current single board, and it is necessary to prompt the relevant management personnel to handle it in time.
在本申请框式设备通信链路异常处理方法的另一实施例中,所述进行孤岛检测的步骤,包括:In another embodiment of the method for handling abnormal communication links of frame-type devices of the present application, the step of performing island detection includes:
步骤D1,在当前单板是主控板之后,确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板;Step D1, after the current single board is a main control board, determining whether the partner board of the main control board is running, and determining whether there is an online service single board of the main control board;
步骤D2,在所述主控板的伙伴板正在运行或所述主控板存在在线的业务单板之后,或者在当前单板不是主控板之后,确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信;Step D2, after the partner board of the main control board is running or the main control board has an online service board, or after the current board is not the main control board, determining whether there is transparent internal process communication between the current board and other boards within a preset sixth time length;
如果当前单板是主控板,则确定主控板的伙伴板是否正在运行,并确定主控板是否存在在线的业务单板。如果主控板的伙伴板正在运行,或者,主控板存在在线的业务单板,或者,主控板的伙伴板正在运行且主控板存在在线的业务单板,则确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信(TIPC,Transparent Inter-process Communication)。If the current board is the main control board, determine whether the partner board of the main control board is running, and determine whether there is an online service board for the main control board. If the partner board of the main control board is running, or there is an online service board for the main control board, or the partner board of the main control board is running and there is an online service board for the main control board, determine whether there is transparent internal process communication (TIPC) between the current board and other boards within the preset sixth time period.
通过设置确定主控板的伙伴板是否正在运行以及确定主控板是否存在在线的业务单板的前置条件,提前确定主控板是否存在与之预先建立通信链路的伙伴板或者业务单板。如果不存在建立有通信链路的伙伴板或者业务单板,就可以不通过确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信的方式,而提前确定主控板为孤岛板卡。也就是说,只有在确定主控板存在通信链路时,才需要通过确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信的方式,确定该主控板是否为孤岛板卡。By setting the preconditions for determining whether the partner board of the main control board is running and whether there is an online service board on the main control board, it is determined in advance whether there is a partner board or service board with which the main control board has established a communication link in advance. If there is no partner board or service board with which a communication link is established, it is possible to determine in advance that the main control board is an island board without determining whether there is transparent internal process communication between the current board and other boards within the preset sixth time period. In other words, only when it is determined that there is a communication link with the main control board, it is necessary to determine whether the main control board is an island board by determining whether there is transparent internal process communication between the current board and other boards within the preset sixth time period.
另外,如果当前单板不是主控板、是业务单板,由于业务板卡与业务板卡之间通过背板直接通信,则可以直接根据同样的确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信的方式,确定业务板卡是否为孤岛板卡。In addition, if the current single board is not the main control board but a business single board, since the business boards communicate directly with each other through the backplane, it is possible to directly determine whether the business board is an island board based on the same method of determining whether there is transparent internal process communication between the current single board and other single boards within the preset sixth time period.
步骤D3,在预设第六时长内当前单板与其它单板之间不存在透明内部进程通信之后,确定当前单板为孤岛板卡。Step D3: after there is no transparent internal process communication between the current board and other boards within a preset sixth time period, determine that the current board is an island board.
如果在预设第六时长内当前单板与其它单板之间不存在透明内部进程通信之后,则可以确定当前单板为孤岛板卡,当前单板处于无法收到其它单板发送的任何消息的状态。If there is no transparent internal process communication between the current board and other boards within the preset sixth time period, it can be determined that the current board is an island board, and the current board is in a state of being unable to receive any message sent by other boards.
示例性的,所述确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板的步骤,包括:Exemplarily, the step of determining whether the partner board of the main control board is running and determining whether the main control board has an online service board includes:
步骤D11,通过背板获取所述伙伴板的运行信号,在所述运行信号为高电平之后,确定所述伙伴板正在运行;Step D11, obtaining the operation signal of the partner board through the backplane, and after the operation signal is at a high level, determining that the partner board is running;
如果当前单板是主控板,则判断伙伴板的run(运行)信号的电平是否为高电平。ADM管理进程上电后会设置当前单板EPLD(Erasable Programmable Logic Device,可擦除可编辑逻辑器件))上run信号寄存器。run信号电平升高,高电平信号便可以通过背板送到伙伴板,伙伴板据此就可以判断此主控板是否在运行。 If the current board is the main control board, it determines whether the partner board's run signal level is high. After the ADM management process is powered on, it will set the run signal register on the current board EPLD (Erasable Programmable Logic Device). When the run signal level increases, the high-level signal can be sent to the partner board through the backplane, and the partner board can determine whether the main control board is running.
伙伴板的run信号电平为高电平则说明伙伴板处于在位运行状态,则继续判断其它条件,如后续的是否存在透明内部进程通信的条件;run信号电平为低电平则说明伙伴板并处于在位运行状态,此时便不需要再进行后续的其他条件的判断。在一实施例中,若某单板被检测出不处于孤岛状态,则说明与该单板相关的通信链路故障原因与之无关,对检测出不处于孤岛状态的单板不进行处理。因此,在run信号电平为低电平时,可以直接判断当前单板不为孤岛状态,对当前单板可以不进行处理。If the run signal level of the partner board is high, it means that the partner board is in the in-place running state, and other conditions will continue to be judged, such as whether there are subsequent transparent internal process communication conditions; if the run signal level is low, it means that the partner board is not in the in-place running state, and there is no need to judge other subsequent conditions. In one embodiment, if a single board is detected to be not in an island state, it means that the cause of the communication link failure related to the single board is irrelevant, and the single board detected to be not in an island state will not be processed. Therefore, when the run signal level is low, it can be directly judged that the current single board is not in an island state, and the current single board can be processed.
步骤D12,通过读取所述业务单板的在位寄存器状态,并基于所述在位寄存器状态确定所述业务单板是否在线。Step D12, reading the status of the in-place register of the service board, and determining whether the service board is online based on the status of the in-place register.
判断外围单板(线卡)硬件是否在线。通过读取单板在位寄存器,在一实施例中,bit2:0(第0、1、2比特位)全为1,则认为外围单板在线,否则认为外围单板离线。所有外围单板都离线情况下直接判断本单板不为孤岛状态,检测出单板自身不为孤岛状态则不处理;有外围单板在线则继续执行检测。连续130s本cpu节点与其他cpu节点TIPC全部断链,则设置本单板为孤岛状态,记录本单板孤岛状态日志并上传主用主控保存,上报单板为孤岛状态的告警。在位寄存器状态的孤岛状态的判断和处理与前述运行信号的类似,在此不做赘述。Determine whether the peripheral board (line card) hardware is online. By reading the board in-place register, in one embodiment, if bit2:0 (bits 0, 1, and 2) are all 1, the peripheral board is considered to be online, otherwise the peripheral board is considered to be offline. When all peripheral boards are offline, it is directly determined that the board is not in an island state. If it is detected that the board itself is not in an island state, no processing will be performed; if there are peripheral boards online, the detection will continue. If the TIPC of this cpu node and other cpu nodes are all disconnected for 130 consecutive seconds, the board is set to an island state, the island state log of this board is recorded and uploaded to the main control for storage, and an alarm that the board is in an island state is reported. The judgment and processing of the island state of the in-place register state are similar to the aforementioned operating signal, and will not be repeated here.
参照图6,此外,本申请实施例还提供一种应用于主用主控板的第一装置,所述第一装置包括:Referring to FIG. 6 , in addition, an embodiment of the present application further provides a first device applied to a primary main control board, the first device comprising:
第一孤岛检测模块M1,用于在所述主用主控板在预设第一时长内未收到备用主控板的第一心跳报文之后,对所述主用主控板进行孤岛检测,确定所述主用主控板是否为孤岛板卡;A first island detection module M1 is used to perform an island detection on the main main control board after the main main control board fails to receive the first heartbeat message from the standby main control board within a preset first time period to determine whether the main main control board is an island board;
第一状态切换模块M2,用于在确定所述主用主控板为孤岛板卡之后,将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板;A first state switching module M2 is used to set the main main control board to a non-working standby state and set the standby main control board to the current main control board after determining that the main main control board is an island board;
状态回切模块M3,用于在预设第二时长内处于所述非工作待命状态的所述主用主控板收到所述备用主控板的所述第一心跳报文之后,将所述主用主控板退出所述非工作待命状态并将所述主用主控板设置为当前主控板。The state switching module M3 is used to make the main control board in the non-working standby state exit from the non-working standby state and set the main control board as the current main control board after the main control board in the non-working standby state receives the first heartbeat message from the backup main control board within a preset second time period.
示例性的,所述第一装置还包括第一复位重启模块,用于:Exemplarily, the first device further includes a first resetting and restarting module, configured to:
在预设第二时长内处于所述非工作待命状态的所述主用主控板未收到所述备用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述主用主控板。After the active main control board in the non-working standby state does not receive the first heartbeat message from the standby main control board within a preset second time period, the current board is reset and restarted, wherein the current board is the active main control board.
参照图7,此外,本申请实施例还提供一种应用于备用主控板的第二装置,所述第二装置包括:Referring to FIG. 7 , in addition, the embodiment of the present application further provides a second device applied to a standby main control board, the second device comprising:
第二孤岛检测模块N1,用于在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;A second island detection module N1 is used to perform an island detection on the standby main control board after the standby main control board fails to receive the first heartbeat message from the active main control board within a preset third time period to determine whether the standby main control board is an island board;
第二状态切换模块N2,用于在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命状态; The second state switching module N2 is used to set the standby main control board to a non-working standby state after the standby main control board becomes an island board;
状态恢复模块N3,用于在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。The state recovery module N3 is used to make the standby main control board exit the non-working standby state after the standby main control board receives the first heartbeat message from the active main control board during a preset fourth time period.
示例性的,所述第二装置还包括第二复位重启模块,用于:Exemplarily, the second device further includes a second resetting and restarting module, configured to:
在预设第四时长内处于所述非工作待命状态的所述备用主控板未收到所述主用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述备用主控板。After the standby main control board in the non-working standby state does not receive the first heartbeat message from the active main control board within a preset fourth time period, the current board is reset and restarted, wherein the current board is the standby main control board.
参照图8,此外,本申请实施例还提供一种应用于业务单板的第三装置,所述第三装置包括:Referring to FIG. 8 , in addition, the embodiment of the present application further provides a third device applied to a service board, the third device comprising:
第三孤岛检测模块P1,用于在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;A third island detection module P1 is used to perform island detection on the service board after the service board does not receive the second heartbeat message of the current main control board within a preset fifth time length to determine whether the service board is an island board;
复位重启模块P2,用于在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。The reset and restart module P2 is used to reset and restart the current board after the service board becomes an island board, wherein the current board is the service board.
示例性的,第一装置、第二装置和第三装置均还包括重复孤岛检测模块,用于:Exemplarily, the first device, the second device, and the third device all further include a repeated island detection module, which is used to:
所述复位重启当前单板的步骤之后,After the step of resetting and restarting the current board,
对当前单板进行孤岛检测,在复位重启后的当前单板为孤岛板卡之后,进行上报告警。Perform island detection on the current board, and report an alarm if the current board becomes an island board after reset and restart.
示例性的,第一装置、第二装置和第三装置均还包括孤岛检测实现模块,用于:Exemplarily, the first device, the second device, and the third device all further include an island detection implementation module, which is used to:
在当前单板是主控板之后,确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板;After the current single board is a main control board, determining whether a partner board of the main control board is running, and determining whether there is an online service single board of the main control board;
在所述主控板的伙伴板正在运行或所述主控板存在在线的业务单板之后,或者在当前单板不是主控板之后,确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信;After the partner board of the main control board is running or the main control board has an online service board, or after the current board is not the main control board, determining whether there is transparent internal process communication between the current board and other boards within a preset sixth time period;
在预设第六时长内当前单板与其它单板之间不存在透明内部进程通信之后,确定当前单板为孤岛板卡。After there is no transparent internal process communication between the current board and other boards within a preset sixth time period, it is determined that the current board is an island board.
示例性的,孤岛检测实现模块还用于:Exemplarily, the island detection implementation module is also used for:
通过背板获取所述伙伴板的运行信号,在所述运行信号为高电平之后,确定所述伙伴板正在运行;Acquire an operation signal of the partner board through a backplane, and determine that the partner board is running after the operation signal is at a high level;
通过读取所述业务单板的在位寄存器状态,并基于所述在位寄存器状态确定所述业务单板是否在线。By reading the status of the in-place register of the service board, it is determined whether the service board is online based on the status of the in-place register.
本申请提供的第一装置、第二装置和第三装置,采用上述实施例中的框式设备通信链路异常处理方法,解决现有技术中难以避免通信链路异常导致双主冲突、进一步导致整框复位的技术问题。与现有技术相比,本申请实施例提供的第一装置、第二装置和第三装置的有益效果与上述实施例提供的框式设备通信链路异常处理方法的有益效果相同,且第一装置、第二装置和第三装置中的其他技术特征与上述实施例方法公开的特征相同,在此不做赘述。The first device, the second device, and the third device provided in the present application adopt the frame device communication link abnormality processing method in the above embodiment to solve the technical problem in the prior art that it is difficult to avoid the communication link abnormality leading to dual-master conflict and further leading to the reset of the entire frame. Compared with the prior art, the beneficial effects of the first device, the second device, and the third device provided in the embodiment of the present application are the same as the beneficial effects of the frame device communication link abnormality processing method provided in the above embodiment, and the other technical features in the first device, the second device, and the third device are the same as the features disclosed in the above embodiment method, which will not be repeated here.
此外,本申请实施例还提供一种框式设备,所述框式设备包括主用主控板、备主用主控板、业务单板、存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序配 置为实现如上所述的框式设备通信链路异常处理方法的步骤。In addition, an embodiment of the present application further provides a frame device, the frame device comprising a main control board, a backup main control board, a service board, a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being configured to The steps are to implement the method for handling abnormal communication link of the frame device as described above.
此外,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的框式设备通信链路异常处理方法的步骤。In addition, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the frame device communication link exception processing method as described above are implemented.
在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。In this article, the terms "comprises", "includes" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system that includes a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or system. In the absence of more restrictions, an element defined by the sentence "comprises a ..." does not exclude the presence of other identical elements in the process, method, article or system that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, and includes a number of instructions for a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。 The above are only preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the present application specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present application.

Claims (10)

  1. 一种框式设备通信链路异常处理方法,其中,所述框式设备通信链路异常处理方法应用于主用主控板,包括以下步骤:A method for handling abnormalities in a communication link of a frame-type device, wherein the method for handling abnormalities in a communication link of a frame-type device is applied to a main control board, and comprises the following steps:
    在所述主用主控板在预设第一时长内未收到备用主控板的第一心跳报文之后,对所述主用主控板进行孤岛检测,确定所述主用主控板是否为孤岛板卡;After the active main control board does not receive the first heartbeat message from the standby main control board within a preset first time period, performing an island detection on the active main control board to determine whether the active main control board is an island board;
    在确定所述主用主控板为孤岛板卡之后,将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板;After determining that the main main control board is an island board, setting the main main control board to a non-working standby state and setting the standby main control board to a current main control board;
    在预设第二时长内处于所述非工作待命状态的所述主用主控板收到所述备用主控板的所述第一心跳报文之后,将所述主用主控板退出所述非工作待命状态并将所述主用主控板设置为当前主控板。After the active main control board in the non-working standby state within the preset second time period receives the first heartbeat message from the standby main control board, the active main control board exits the non-working standby state and is set as the current main control board.
  2. 如权利要求1所述的框式设备通信链路异常处理方法,其中,所述将所述主用主控板设置为非工作待命状态并将所述备用主控板设置为当前主控板的步骤之后,还包括:The method for handling abnormal communication link of a frame-type device according to claim 1, wherein after the step of setting the active main control board to a non-working standby state and setting the standby main control board as the current main control board, it further comprises:
    在预设第二时长内处于所述非工作待命状态的所述主用主控板未收到所述备用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述主用主控板。After the active main control board in the non-working standby state does not receive the first heartbeat message from the standby main control board within a preset second time period, the current board is reset and restarted, wherein the current board is the active main control board.
  3. 一种框式设备通信链路异常处理方法,其中,所述框式设备通信链路异常处理方法应用于备用主控板,包括以下步骤:A method for handling abnormalities in a communication link of a frame-type device, wherein the method for handling abnormalities in a communication link of a frame-type device is applied to a standby main control board, and comprises the following steps:
    在所述备用主控板在预设第三时长内未收到主用主控板的第一心跳报文之后,对所述备用主控板进行孤岛检测,确定所述备用主控板是否为孤岛板卡;After the standby main control board fails to receive the first heartbeat message from the active main control board within a preset third time period, performing an island detection on the standby main control board to determine whether the standby main control board is an island board;
    在所述备用主控板为孤岛板卡之后,将所述备用主控板设置为非工作待命状态;After the standby main control board becomes an island board, setting the standby main control board to a non-working standby state;
    在预设第四时长内处于所述非工作待命状态的所述备用主控板收到所述主用主控板的所述第一心跳报文之后,将所述备用主控板退出所述非工作待命状态。After the standby main control board in the non-working standby state within the preset fourth time period receives the first heartbeat message from the active main control board, the standby main control board exits the non-working standby state.
  4. 如权利要求3所述的框式设备通信链路异常处理方法,其中,所述将所述备用主控板设置为非工作待命状态的步骤之后,还包括:The method for handling abnormal communication link of a frame-type device according to claim 3, wherein after the step of setting the standby main control board to a non-working standby state, it further comprises:
    在预设第四时长内处于所述非工作待命状态的所述备用主控板未收到所述主用主控板的所述第一心跳报文之后,复位重启当前单板,其中,所述当前单板为所述备用主控板。After the standby main control board in the non-working standby state does not receive the first heartbeat message from the active main control board within a preset fourth time period, the current board is reset and restarted, wherein the current board is the standby main control board.
  5. 一种框式设备通信链路异常处理方法,其中,所述框式设备通信链路异常处理方法应用于业务单板,包括以下步骤:A method for handling abnormalities in a communication link of a frame-type device, wherein the method for handling abnormalities in a communication link of a frame-type device is applied to a service single board, and comprises the following steps:
    在所述业务单板在预设第五时长内未收到当前主控板的第二心跳报文之后,对所述业务单板进行孤岛检测,确定所述业务单板是否为孤岛板卡;After the service board fails to receive the second heartbeat message from the current main control board within a preset fifth time period, performing an island detection on the service board to determine whether the service board is an island board;
    在所述业务单板为孤岛板卡之后,复位重启当前单板,其中,所述当前单板为所述业务单板。 After the service board is an isolated board, the current board is reset and restarted, wherein the current board is the service board.
  6. 如权利要求2、4或5所述的框式设备通信链路异常处理方法,其中,所述复位重启当前单板的步骤之后,还包括:The method for handling abnormal communication link of a frame device according to claim 2, 4 or 5, wherein after the step of resetting and restarting the current board, the method further comprises:
    对当前单板进行孤岛检测,在复位重启后的当前单板为孤岛板卡之后,进行上报告警。Perform island detection on the current board, and report an alarm if the current board becomes an island board after reset and restart.
  7. 如权利要求1至5任一项所述的框式设备通信链路异常处理方法,其中,所述进行孤岛检测的步骤,包括:The method for handling abnormal communication link of a frame-type device according to any one of claims 1 to 5, wherein the step of performing island detection comprises:
    在当前单板是主控板之后,确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板;After the current single board is a main control board, determining whether a partner board of the main control board is running, and determining whether there is an online service single board of the main control board;
    在所述主控板的伙伴板正在运行或所述主控板存在在线的业务单板之后,或者在当前单板不是主控板之后,确定在预设第六时长内当前单板与其它单板之间是否存在透明内部进程通信;After the partner board of the main control board is running or the main control board has an online service board, or after the current board is not the main control board, determining whether there is transparent internal process communication between the current board and other boards within a preset sixth time period;
    在预设第六时长内当前单板与其它单板之间不存在透明内部进程通信之后,确定当前单板为孤岛板卡。After there is no transparent internal process communication between the current board and other boards within a preset sixth time period, it is determined that the current board is an island board.
  8. 如权利要求7所述的框式设备通信链路异常处理方法,其中,所述确定所述主控板的伙伴板是否正在运行,并确定所述主控板是否存在在线的业务单板的步骤,包括:The method for handling abnormal communication links of a frame-type device according to claim 7, wherein the step of determining whether the partner board of the main control board is running and determining whether there is an online service single board of the main control board comprises:
    通过背板获取所述伙伴板的运行信号,在所述运行信号为高电平之后,确定所述伙伴板正在运行;Acquire an operation signal of the partner board through a backplane, and determine that the partner board is running after the operation signal is at a high level;
    通过读取所述业务单板的在位寄存器状态,并基于所述在位寄存器状态确定所述业务单板是否在线。By reading the status of the in-place register of the service board, it is determined whether the service board is online based on the status of the in-place register.
  9. 一种框式设备,其中,所述框式设备包括主用主控板、备主用主控板、业务单板、存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序配置为实现如权利要求1至8中任一项所述的框式设备通信链路异常处理方法的步骤。A frame device, wherein the frame device includes a main control board, a backup main control board, a service board, a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program is configured to implement the steps of the frame device communication link exception handling method as described in any one of claims 1 to 8.
  10. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至8中任一项所述的框式设备通信链路异常处理方法的步骤。 A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the frame device communication link exception processing method according to any one of claims 1 to 8 are implemented.
PCT/CN2023/101886 2022-12-06 2023-06-21 Communication link anomaly processing method for frame-based device, frame-based device, and medium WO2024119777A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211559772.X 2022-12-06

Publications (1)

Publication Number Publication Date
WO2024119777A1 true WO2024119777A1 (en) 2024-06-13

Family

ID=

Similar Documents

Publication Publication Date Title
US5875290A (en) Method and program product for synchronizing operator initiated commands with a failover process in a distributed processing system
US7787388B2 (en) Method of and a system for autonomously identifying which node in a two-node system has failed
US6012150A (en) Apparatus for synchronizing operator initiated commands with a failover process in a distributed processing system
CN105933407B (en) method and system for realizing high availability of Redis cluster
CN102916825A (en) Management equipment of dual-computer hot standby system, management method and dual-computer hot standby system
CN102394914A (en) Cluster brain-split processing method and device
CN103019889A (en) Distributed file system and failure processing method thereof
CN112217658B (en) Stacking and splitting processing method and device
CN111585835B (en) Control method and device for out-of-band management system and storage medium
CN107071189B (en) Connection method of communication equipment physical interface
JP2008172592A (en) Cluster system, computer and its abnormality detection method
CN110958151B (en) Keep-alive detection method, keep-alive detection device, node, storage medium and communication system
CN113867915A (en) Task scheduling method, electronic device and storage medium
WO2024119777A1 (en) Communication link anomaly processing method for frame-based device, frame-based device, and medium
CN110351122B (en) Disaster recovery method, device, system and electronic equipment
CN111262745A (en) Information processing platform redundancy system design
JP4806382B2 (en) Redundant system
CN118158067A (en) Frame type equipment communication link exception handling method, frame type equipment and medium
CN110912837B (en) VSM system-based main/standby switching method and device
US11954509B2 (en) Service continuation system and service continuation method between active and standby virtual servers
CN100490343C (en) A method and device for realizing switching between main and backup units in communication equipment
KR100832890B1 (en) Process obstacle lookout method and recovery method for information communication
CN113742142A (en) Method for managing SATA hard disk by storage system and storage system
JP2016151965A (en) Redundant configuration system and redundant configuration control method
US11853175B2 (en) Cluster system and restoration method that performs failover control