CN109918230B

CN109918230B - Method and system for recovering abnormity of service board card

Info

Publication number: CN109918230B
Application number: CN201910124873.6A
Authority: CN
Inventors: 项东阳
Original assignee: Hangzhou DPTech Technologies Co Ltd
Current assignee: Hangzhou DPTech Technologies Co Ltd
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2021-01-26
Anticipated expiration: 2039-02-20
Also published as: CN109918230A

Abstract

The application provides a method and a system for recovering abnormity of a service board card. A method for recovering an exception of a service board card comprises the following steps: a CPU on a main control board card sends an access request message to the FPGA; the FPGA receives the access request message, analyzes the PCIe bus address space address of the service board card to be accessed carried in the access request message, issues the access request message to the service board card to be accessed according to the PCIe bus address space address of the service board card to be accessed, judges whether response data returned by the service board card to be accessed are received, if not, determines that the access of the service board card to be accessed fails, reports abnormal interruption to a CPU on the main control board card, and stores the PCIe bus address space address of the service board card to be accessed into a cache; after receiving the abnormal interruption reported by the FPGA, the CPU on the main control board reads the PCIe bus address space address of the service board to be accessed from the cache, determines the abnormal service board to be accessed, and sends a reset or restart instruction to the FPGA.

Description

Method and system for recovering abnormity of service board card

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method and a system for recovering an exception of a service board.

Background

In the centralized control frame type device, the management and control of all the service boards are uniformly handled by the CPU on the main control board, the service boards and the main control board are connected by a PCIe (Peripheral Component Interconnect express) bus, and the PCIe bus is widely applied in a manner that the service boards can share a broadband independently by virtue of its high-speed transmission efficiency. However, with the increase and expansion of the functions of the centralized control frame type device, more and more service board cards to be managed by the CPU on the main control board card are provided, and due to the channel link and the hardware of the centralized control frame type device, the abnormal condition of the service board cards occurs occasionally, and if the abnormal condition of the service board cards cannot be processed in time, the centralized control frame type device may be down.

The existing technical scheme is to process the exception of the service board card by a soft and hard combination mode: the CPU on the main control board card accesses the service board card through the PCIe bus, when the message sent to the service board card by the CPU on the main control board card cannot respond to the service board card for a long time, the CPU on the main control board card judges that the service board card is abnormal, at the moment, a special register in the CPU on the main control board card is set, and after a subsequent application program detects the corresponding register set, the abnormal service board card is determined and reset, so that the condition that the whole centralized control frame type equipment is crashed due to the abnormality of a single service board card can be prevented.

However, in the existing technical scheme, the instruction detection and the register inside the CPU are mainly realized, and corresponding functions need to be integrated inside the CPU, which greatly increases the hardware cost.

Disclosure of Invention

In view of this, the present application provides a method and a system for recovering an exception of a service board.

Specifically, the method is realized through the following technical scheme:

a method for recovering an abnormal service board card is characterized by being applied to centralized control frame type equipment, wherein the centralized control frame type equipment comprises a main control board card, an FPGA and at least one service board card, the FPGA is respectively connected with the main control board card and the at least one service board card, and the method comprises the following steps:

in the preparation stage: a Central Processing Unit (CPU) on the main control board card distributes corresponding PCIe bus address space for each service board card according to PCIe bus address space configuration information, and the address range of the PCIe bus address space distributed for each service board card is stored according to a board card distribution topological graph;

in the treatment stage: a CPU on a main control board card sends an access request message to the FPGA;

the FPGA receives the access request message and analyzes a PCIe bus address space address of the service board card to be accessed, wherein the PCIe bus address space address is carried in the access request message;

the FPGA issues the access request message to the corresponding service board card to be accessed according to the PCIe bus address space address of the service board card to be accessed;

the FPGA judges whether response data returned by the service board card to be accessed are received;

if not, the FPGA determines that the access to the service board card to be accessed fails, reports the abnormal interrupt to a CPU on a main control board card, and stores the PCIe bus address space address of the service board card to be accessed into a cache;

after receiving the abnormal interruption reported by the FPGA, a CPU on the main control board card reads the PCIe bus address space address of the service board card to be accessed from the cache, matches the address range of the PCIe bus address space allocated to each service board card in advance and determines the abnormal service board card to be accessed;

and the CPU on the main control board card sends a reset or restart instruction to the FPGA so that the FPGA executes reset or restart operation on the abnormal service board card to be accessed according to the reset or restart instruction.

The utility model provides a business integrated circuit board abnormity recovery system which characterized in that is applied to centralized control frame equipment, centralized control frame equipment includes main control integrated circuit board, FPGA and at least one business integrated circuit board, FPGA respectively with the main control integrated circuit board at least one business integrated circuit board is connected, the system includes:

By adopting the technical scheme provided by the application, the internal instruction detection and register implementation of the CPU are not required, the corresponding function of the internal integration of the CPU is not required, and the hardware cost is greatly reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a diagram illustrating hardware connections in accordance with an exemplary embodiment of the present application;

FIG. 2 is another hardware connection diagram shown in an exemplary embodiment of the present application;

fig. 3 is an interaction flow diagram of a method for recovering an exception of a service board according to an exemplary embodiment of the present application;

fig. 4 is a card distribution topology diagram according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with aspects of the present application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Firstly, a description is given of a method for recovering a service board from an abnormality, which is provided by an embodiment of the present application and is applied to a centralized control frame type device, where the centralized control frame type device includes a main control board, an FPGA, and at least one service board, and the FPGA is connected to the main control board and the at least one service board, and the method mainly includes:

a preparation stage: a Central Processing Unit (CPU) on the main control board card distributes corresponding PCIe bus address space for each service board card according to PCIe bus address space configuration information, and the address range of the PCIe bus address space distributed for each service board card is stored according to a board card distribution topological graph;

In the background art, as shown in an exemplary hardware connection diagram shown in fig. 1, a CPU on a main control board accesses a service board through a PCIe bus, when a message sent to the service board by the CPU on the main control board cannot receive a response of the service board for a long time, the CPU on the main control board determines that the service board is abnormal, a dedicated register inside the CPU on the main control board is set, and after detecting the corresponding register set, a subsequent application program determines the abnormal service board and resets the service board, so that occurrence of a downtime situation of the entire centralized control frame device due to an abnormality of a single service board can be prevented. Although the condition that the whole centralized control frame type equipment is down due to the abnormity of the single service board card can be prevented, the internal instruction detection and the register of the CPU are mainly used for realizing the internal function of the CPU, and the hardware cost can be greatly improved.

To this end, as shown in fig. 2, an exemplary hardware connection diagram is provided, in which an FPGA is connected to a CPU and a service board on a main control board respectively, in case that the FPGA determines that the access to the service board card fails, the PCIe bus address space address of the corresponding service board card to be accessed is stored in the cache (in the memory shown in fig. 2), reporting abnormal interruption to a CPU on the main control board, reading the PCIe bus address space address of the service board to be accessed from the cache after the CPU on the main control board receives the abnormal interruption reported by the FPGA, matching with the address range of PCIe bus address space pre-allocated for each service board card, determining the abnormal service board card to be accessed, subsequently sending a reset or restart instruction to the FPGA by the CPU on the main control board card, and the FPGA executes resetting or restarting operation on the abnormal service board card to be accessed according to the resetting or restarting instruction. Therefore, by adopting the technical scheme provided by the application, the occurrence of the downtime of the whole centralized control frame type equipment caused by the abnormality of the single service board card can be prevented, the instruction detection and the register inside the CPU are not required to realize, the corresponding function inside the CPU is not required to be integrated, and the hardware cost is greatly reduced. For further explanation of the present application, the following examples are provided:

as shown in fig. 3, an interactive flow diagram of a method for recovering an exception of a service board according to an embodiment of the present application is shown, where the method includes the following steps:

in the preparation stage:

s301, a CPU on the main control board card distributes corresponding PCIe bus address space for each service board card according to PCIe bus address space configuration information, and the address range of the PCIe bus address space distributed for each service board card is stored according to a board card distribution topological graph;

in the application, a CPU on a main control board pre-allocates a corresponding PCIe bus address space for each service board, and stores an address range of the PCIe bus address space allocated for each service board according to a board distribution topology, and stores a correspondence between the service board and the PCIe bus address space on the main control board and an FPGA, respectively, where the board distribution topology is similar to an equipment tree as shown in fig. 4, meaning that the address range of the PCIe bus address space allocated for each service board is stored according to a tree structure.

The CPU on the main control board card distributes corresponding PCIe bus address space for each service board card in advance according to PCIe bus address space configuration information, and the PCIe bus address space configuration information is obtained according to the PCIe bus address space required by each service board card. For example, the CPU on the main control board allocates 6MPCIe bus address space to the service board 1, allocates 5MPCIe bus address space to the service board 2, allocates 10MPCIe bus address space … … to the service board 3 in advance according to the PCIe bus address space configuration information, and stores the address range of the PCIe bus address space allocated to each service board according to the board distribution topology.

Preferably, after distributing the corresponding PCIe bus address space for each service board, the CPU on the main control board may scan the corresponding PCIe devices on each service board of the centralized control frame device in a recursive manner, distribute the corresponding PCIe bus address space for the scanned PCIe devices based on the PCIe bus address space distributed for each service board according to a depth-first principle, and store the address information of the distributed PCIe bus address space. For example, 6MPCIe bus address spaces are allocated to the service board 1, the CPU on the main control board may scan corresponding PCIe devices on each service board of the centralized control frame device in a recursive manner, and allocate corresponding 2MPCIe bus address spaces to the scanned PCIe devices 1 based on the 6MPCIe bus address spaces allocated to the service board 1 according to a depth-first principle, and store address information of the allocated PCIe bus address spaces.

In addition, the CPU on the main control board opens a certain size of space in the Cache in advance, for example, a 4byte memory space, which is specially used for write-back of the PCIe bus address space address of the service board to be accessed, and stores the first address of the Cache space into an FPGA register, for example, an FPGA dedicated register SMB Cache Addr, so as to facilitate the CPU on the main control board to read.

In the treatment stage:

s302, a CPU on a main control board sends an access request message to an FPGA;

when the CPU on the main control board card accesses the service board card or a PCIe device on the service board card, an access request message is sent to the FPGA, so that the FPGA sends the access request message to the corresponding service board card or the PCIe device on the service board card.

S303, the FPGA receives the access request message and analyzes the PCIe bus address space address of the service board card to be accessed carried by the access request message;

s304, the FPGA issues the access request message to the corresponding service board card to be accessed according to the PCIe bus address space address of the service board card to be accessed;

the FPGA receives an access request message issued by a CPU on the main control board, analyzes a PCIe bus address space address of a service board card to be accessed carried by the access request message, for example, if the CPU on the main control board needs to access the service board card 1, the PCIe bus address space address corresponding to the service board card 1 carried in the access request message is analyzed. The FPGA can query the corresponding relationship between the locally stored service board card and the PCIe bus address space according to the PCIe bus address space address corresponding to the service board card to be accessed, and issue the access request message to the corresponding service board card to be accessed, for example, issue the access request message to the service board card 1.

S305, judging whether response data returned by the service board card to be accessed is received by the FPGA;

and setting a timer in the FPGA, sending the access request message to the corresponding service board card to be accessed, judging whether response data returned by the service board card to be accessed is received in a preset time period, if the response data returned by the service board card to be accessed is not received in the preset time period, determining that the service board card to be accessed fails to be accessed, and otherwise, receiving the response data returned by the service board card to be accessed in real time.

S306, if not, the FPGA determines that the access to the service board card to be accessed fails, reports the abnormal interrupt to a CPU on the main control board card, and stores the PCIe bus address space address of the service board card to be accessed into a cache;

if the response data returned by the service board card to be accessed is not received, determining that the service board card to be accessed fails to be accessed, preferably, if the response data returned by the service board card to be accessed is not received within a preset time period, the FPGA continuously transmits the access request message to the corresponding service board card according to the preset transmission times of the access request message, and if the FPGA continuously transmits the access request message to the corresponding service board card according to the preset transmission times of the access request message, the FPGA still does not receive the response data of the service board card to be accessed, determining that the service board card to be accessed fails to be accessed.

For example, if response data returned by the service board card to be accessed is not received within a preset time period, the FPGA continuously issues the access request message to the corresponding service board card according to the preset number of times (5 times) of sending the access request message, judges whether response data returned by the service board card to be accessed is received or not every time the access request message is sent, and determines that the service board card to be accessed fails to be accessed if the response data returned by the service board card to be accessed is not received in the period of continuously issuing the access request message to the corresponding service board card.

And under the condition that the access of the service board card to be accessed is determined to be failed, the FPGA stops sending an access request message to the service board card to be accessed, reports abnormal interruption to a CPU on the main control board card, and stores the PCIe bus address space address of the service board card to be accessed into a pre-opened cache.

Preferably, in order to prevent the CPU on the main control board from being abnormal due to timeout of access to the service board to be accessed, the FPGA returns the self-organized invalid response data to the CPU on the main control board when determining that the access to the service board to be accessed fails.

S307, after the CPU on the main control board card receives the abnormal interruption reported by the FPGA, the PCIe bus address space address of the service board card to be accessed is read from the cache and is matched with the address range of the PCIe bus address space allocated to each service board card in advance, and the abnormal service board card to be accessed is determined;

after the CPU on the main control board card receives the abnormal interruption reported by the FPGA, the PCIe bus address space address of the service board card to be accessed in the cache is read and matched with the address range of the PCIe bus address space allocated to each service board card in advance, the abnormal service board card to be accessed can be determined, and the abnormal PCIe device on the service board card to be accessed can be further determined.

And stopping the access of the service board card to be accessed by the CPU on the main control board card under the condition of determining the abnormal service board card to be accessed.

S308, the CPU on the main control board sends a reset or restart instruction to the FPGA, so that the FPGA executes reset or restart operation on the abnormal service board to be accessed according to the reset or restart instruction.

The method comprises the following steps that under the condition that a CPU on a main control board card determines an abnormal service board card to be accessed, the abnormal service board card to be accessed is reset or restarted, and specifically, the method comprises the following steps: the CPU on the main control board sends a reset or restart instruction to the FPGA, the FPGA performs reset or restart operation on the abnormal service board to be accessed according to the reset or restart instruction, the CPU on the subsequent main control board sends an initialization instruction to the FPGA, and the FPGA performs initialization operation on the abnormal service board to be accessed according to the initialization instruction.

Through the above description of the technical scheme provided by the application, when the FPGA determines that the access to the service board card to be accessed fails, the PCIe bus address space address of the corresponding service board card to be accessed is stored in the cache, and the CPU reports the abnormal interrupt to the main control board, after the CPU on the main control board receives the abnormal interrupt reported by the FPGA, the PCIe bus address space address of the service board card to be accessed is read from the cache and matched with the address range of the PCIe bus address space allocated to each service board card in advance, the abnormal service board card to be accessed is determined, and the CPU on the subsequent main control board card sends a reset or restart instruction to the FPGA, so that the FPGA performs reset or restart operation on the abnormal service board card to be accessed according to the reset or restart instruction. Therefore, by adopting the technical scheme provided by the application, the occurrence of the downtime of the whole centralized control frame type equipment caused by the abnormality of the single service board card can be prevented, the instruction detection and the register inside the CPU are not required to realize, the corresponding function inside the CPU is not required to be integrated, and the hardware cost is greatly reduced.

In addition, after the above steps, the CPU on the main control board can access the service board to be accessed again, if the service board to be accessed after reset or restart still has a problem, the CPU on the main control board resets or restarts the abnormal service board to be accessed, and if the problem still cannot be solved, the CPU on the main control board executes a power-off operation on the service board to be accessed (specifically, sends a power-off instruction to the FPGA, and the FPGA executes the power-off operation on the service board to be accessed according to the power-off instruction), and prompts a maintenance person to perform processing.

Corresponding to the embodiment of the service board abnormality recovery method, the present application further provides an embodiment of a service board abnormality recovery system, which is applied to a centralized control frame device, where the centralized control frame device includes a main control board, an FPGA, and at least one service board, where the FPGA is connected to the main control board and the at least one service board, and the system includes:

The system implementation process is detailed in the implementation process of the corresponding steps in the method, and is not described herein again.

Through the above description of the technical solution provided in the embodiment of the present application, when the FPGA determines that the access to the service board card to be accessed fails, the PCIe bus address space address of the corresponding service board card to be accessed is stored in the cache, and the CPU reports the abnormal interrupt to the main control board, after the CPU on the main control board receives the abnormal interrupt reported by the FPGA, the PCIe bus address space address of the service board card to be accessed is read from the cache, and is matched with the address range of the PCIe bus address space allocated to each service board card in advance, the abnormal service board card to be accessed is determined, and the CPU on the subsequent main control board card sends a reset or restart instruction to the FPGA, so that the FPGA performs a reset or restart operation on the abnormal service board card to be accessed according to the reset or restart instruction. Therefore, by adopting the technical scheme provided by the application, the occurrence of the downtime of the whole centralized control frame type equipment caused by the abnormality of the single service board card can be prevented, the instruction detection and the register inside the CPU are not required to realize, the corresponding function inside the CPU is not required to be integrated, and the hardware cost is greatly reduced.

For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The foregoing is directed to embodiments of the present invention, and it is understood that various modifications and improvements can be made by those skilled in the art without departing from the spirit of the invention.

Claims

1. A method for recovering an abnormal service board card is characterized by being applied to centralized control frame type equipment, wherein the centralized control frame type equipment comprises a main control board card, an FPGA and at least one service board card, the FPGA is respectively connected with the main control board card and the at least one service board card, and the method comprises the following steps:

2. The method of claim 1, wherein the determining, by the FPGA, whether response data returned by the service board to be accessed is received includes:

and the FPGA judges whether response data returned by the service board card to be accessed is received within a preset time period.

3. The method according to claim 2, wherein if not, the FPGA determining that the access to the service board to be accessed fails comprises:

if response data returned by the service board card to be accessed are not received within a preset time period, the FPGA continuously sends the access request message to the corresponding service board card according to the preset sending times of the access request message;

if the FPGA does not receive the response data of the service board card to be accessed in the period of continuously sending the access request message to the corresponding service board card according to the preset sending times of the access request message, determining that the service board card to be accessed fails to be accessed.

4. The method of claim 1, further comprising:

5. The method according to any one of claims 1 to 4, further comprising:

and under the condition that the FPGA determines that the access to the service board card to be accessed fails, the self-organized invalid response data is returned to the CPU on the main control board card, so that the CPU on the main control board card cannot be abnormal due to overtime access.

6. A service board abnormity recovery system is characterized by being applied to centralized control frame type equipment, wherein the centralized control frame type equipment comprises a main control board, an FPGA and at least one service board, the FPGA is respectively connected with the main control board and the at least one service board, and the system is used for realizing the following method:

7. The system according to claim 6, wherein the FPGA specifically determines whether response data returned by the service board card to be accessed is received by the following method:

8. The system according to claim 7, wherein the FPGA determines that the access to the service board card to be accessed fails by specifically:

9. The system of claim 6, wherein the system is further configured to implement the method of:

10. The system according to any of claims 6 to 9, characterized in that the system is further adapted to implement the method of: