CN117573403A

CN117573403A - Method, device, equipment and medium for processing exception of system-level chip

Info

Publication number: CN117573403A
Application number: CN202311536057.9A
Authority: CN
Inventors: 杨明伟; 高堂成; 何鹏飞; 陈静静
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-20

Abstract

The invention discloses a method, a device, equipment and a medium for processing abnormality of a system-on-chip, and relates to the technical field of electronics. In the debugging and developing process of the system-level chip, the method aims at the problem site saved by the core dump file aiming at the first type of abnormality with high severity, so that the debugging of a joint test working group is not needed, a user can check the information needed by the problem through the content in the core dump file, the problem positioning efficiency is improved, and the debugging and developing period of the system-level chip is shortened; for the second type of abnormality with lower severity, the information of the second type of abnormality is sent to the abnormality processing subsystem provided with the abnormality recovery strategy corresponding to the second type of abnormality, so that the abnormality can be processed according to the abnormality processing subsystem, the influence of the second type of abnormality on the service provided by the server is reduced, the influence on the system is reduced, and meanwhile, the usability of the system-level chip is ensured.

Description

Method, device, equipment and medium for processing exception of system-level chip

Technical Field

The present invention relates to the field of electronic technologies, and in particular, to a method, an apparatus, a device, and a medium for processing an exception of a system-on-chip.

Background

In the server, a plurality of external System On Chip (SOC) chips are connected through peripheral component interconnect express (Peripheral Component Interconnect Express, PCIE), and the existence of the System on Chip provides various high-performance services for the server, thereby greatly improving the performance of the server. The development of the server external system-level chip comprises two main stages: a development and debugging stage and a production stage.

In the debugging process of a system level chip integrating a plurality of hardware engine cores, the probability of the occurrence of the abnormality of the engine in the chip is relatively high compared with that of the product stage, the abnormality is even when the abnormality occurs, and the follow-up reproduction and investigation are difficult without the relevant information of the problem site.

Therefore, in the development and debugging stage of the system-level chip, how to recover the generated abnormality to reduce the influence of the abnormality on the system and ensure the usability of the system is a technical problem which needs to be solved by the person skilled in the art.

Disclosure of Invention

The invention aims to provide a method, a device, equipment and a medium for processing abnormality of a system-on-chip, which are used for solving the problem that the system performance is affected due to the fact that the abnormality is not recovered in the development and debugging stage of the system-on-chip, so that the usability of the system is reduced.

In order to solve the technical problems, the invention provides a method for processing abnormality of a system-on-chip, which is applied to the debugging and developing process of the system-on-chip, and comprises the following steps:

under the condition that the abnormality of a hardware engine core in the system-level chip is detected, obtaining abnormality information through interrupting a callback function;

acquiring first-type abnormal information and second-type abnormal information in the abnormal information; wherein the severity of the first type of anomaly is greater than the severity of the second type of anomaly;

transmitting the information of the second type of exception to an exception handling subsystem, and establishing a core dump file corresponding to the first type of exception in the interrupt callback function according to the information of the first type of exception; the exception handling subsystem is used for presetting an exception recovery strategy corresponding to the second class exception;

and processing the first type of exception according to the core dump file and processing the second type of exception by using the exception processing subsystem.

On the one hand, the core dump file contains hardware state information and software state information; the hardware state information comprises current version information of the system-level chip, serial numbers of the system-level chip, register information of a central processing unit, current state machine information of hardware and hardware cache information; the software state information comprises stack information, global variables and software state information;

The information of the second type of exception sent to the exception handling subsystem comprises error codes, unique identification codes of hardware engine cores, control block addresses and processing completion function pointers; the error code is used for representing an abnormal occurrence, the unique identification code of the hardware engine core is used for representing abnormal hardware, the control block address is used for representing a business block address with the abnormal occurrence, and the processing completion function pointer is used for representing a function pointer for recovering a current abnormal interrupt mask.

In another aspect, the processing the first type of exception according to the core dump file includes:

uploading the core dump file to a server host, and storing the core dump file to a preset path in the server host;

and controlling the restarting of the server host in a preset time, so that the server host acquires the core dump file from the preset path after restarting, and analyzing and recovering the first type of anomalies according to the information in the core dump file.

On the other hand, when the abnormality recovery policy is determined by an abnormality processing information table and an abnormality recovery behavior table, before the second type of abnormality is processed according to the abnormality recovery policy corresponding to the second type of abnormality preset in the abnormality processing subsystem, the method further includes:

Establishing the exception handling information table and the exception recovery behavior table in the exception handling subsystem; the exception handling information table comprises an information table and a general information table of each hardware engine core in the system-in-chip; the abnormal behavior recovery table comprises a plurality of behavior tables of the hardware engine cores;

correspondingly, the processing the second type of exception according to the preset exception recovery strategy corresponding to the second type of exception in the exception processing subsystem comprises the following steps:

and processing the second type of exception by using the exception handling information table and the exception recovery behavior table in the exception handling subsystem.

On the other hand, the general information table and the information tables of the hardware engine cores comprise unique identification codes of error codes, error code masks, reporting modes, exception recovery modes and exception response modules; the unique identification code of the abnormal response module restores the abnormality according to the abnormality restoring behavior table and in combination with the reporting mode; the abnormal recovery mode at least comprises one of task retry, notification and reconfiguration of the hardware engine core;

The processing the second type of exception by using the exception handling information table and the exception recovery behavior table in the exception handling subsystem comprises:

acquiring an information table of a current hardware engine core corresponding to the unique identification code of the current hardware engine core from the exception handling information table according to the unique identification code of the current hardware engine core in the information of the second type of exceptions sent to the exception handling subsystem;

matching a current error code in the information of the second type of exception sent to the exception handling subsystem with a current error code in an information table of the current hardware engine core;

if the current error code is detected to be matched with the current error code, acquiring a current abnormal recovery behavior table corresponding to the current hardware engine core from the abnormal recovery behavior table according to a unique identification code of a current abnormal response module in an information table of the current hardware engine core;

finding a corresponding current abnormal recovery behavior function pointer from the current abnormal recovery behavior table according to a current abnormal recovery mode in the information table of the current hardware engine core;

recovering the second class of anomalies by using the current anomaly recovery behavior function;

If the current error code is not matched with the current error code, matching the current error code with the current error code in the general information table;

if the current error code is not matched with the current error code in the general information table, outputting prompt information for representing search failure;

if the current error code is detected to be matched with the current error code in the general information table, acquiring a current abnormal recovery behavior table corresponding to the current hardware engine core from the abnormal recovery behavior table according to a unique identification code of a current abnormal response module in the general information table;

finding a corresponding current abnormal recovery behavior function pointer from the current abnormal recovery behavior table according to a current abnormal recovery mode in the general information table;

recovering the second type of abnormality by using the current abnormality recovery behavior function.

In another aspect, after the recovering the second class of anomalies by using the current anomaly recovery behavior function, the method further includes:

judging whether a processing completion function in the information of the second type of abnormality is empty or not;

If not, executing the processing completion function;

and recovering the interrupt mask corresponding to the second exception according to the processing completion function.

determining to report the second exception to the server host or record the second exception in the system-in-chip according to the reporting mode in the general information table or the information table of the current hardware engine core;

wherein, determining the reporting mode includes:

determining that the reporting mode is to report the second abnormality to the server host when the severity of the second abnormality is detected to be greater than a preset value and less than the severity of the first abnormality and/or when the server host is detected to send prompt information for representing that the abnormality is reported to the system-in-a-chip;

and under the condition that the system-on-chip is not damaged and/or when the severity of the second abnormality is detected to be smaller than a preset value, determining that the reporting mode is recording the second abnormality in the system-on-chip.

In order to solve the technical problem, the invention also provides a device for processing the exception of the system-on-chip, which is applied to the debugging and developing process of the system-on-chip, and comprises:

the first acquisition module is used for acquiring abnormal information through interrupting a callback function under the condition that the abnormality of a hardware engine core in the system-in-chip is detected;

the second acquisition module is used for acquiring the first type of abnormal information and the second type of abnormal information in the abnormal information; wherein the severity of the first type of anomaly is greater than the severity of the second type of anomaly;

the establishing and processing module is used for establishing a core dump file corresponding to the first type of exception according to the information of the first type of exception in the interrupt callback function and processing the first type of exception according to the core dump file;

and the sending and processing module is used for sending the information of the second type of abnormality to an abnormality processing subsystem, and processing the second type of abnormality according to an abnormality recovery strategy corresponding to the second type of abnormality preset in the abnormality processing subsystem.

In order to solve the above technical problem, the present invention further provides an apparatus for processing an exception of a system-on-chip, including:

A memory for storing a computer program;

and the processor is used for realizing the steps of the method for processing the abnormality of the system-in-chip when executing the computer program.

In order to solve the above technical problem, the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for processing an exception of a system-on-chip described above.

The method for processing the abnormality of the system-on-chip provided by the invention is applied to the debugging and developing process of the system-on-chip, and comprises the following steps: under the condition that the abnormality of a hardware engine core in the system-level chip is detected, obtaining abnormality information through interrupting a callback function; acquiring first-type abnormal information and second-type abnormal information in abnormal information; wherein the severity of the first type of anomaly is greater than the severity of the second type of anomaly; establishing a core dump file corresponding to the first type of exception in the interrupt callback function according to the information of the first type of exception, and processing the first type of exception according to the core dump file; and sending the information of the second type of abnormality to an abnormality processing subsystem, and processing the second type of abnormality according to an abnormality recovery strategy corresponding to the second type of abnormality preset in the abnormality processing subsystem.

The invention has the advantages that in the debugging and developing process of the system-level chip, the problem site is saved by the core dump file aiming at the first type of abnormality with high severity, so that the debugging of a joint test working group (Joint Test Action Group, JSAG) is not needed, a user can check the information needed by the problem through the content in the core dump file, the problem positioning efficiency is improved, and the debugging and developing period of the system-level chip is shortened; for the second type of abnormality with lower severity, the information of the second type of abnormality is sent to the abnormality processing subsystem provided with the abnormality recovery strategy corresponding to the second type of abnormality, so that the abnormality can be processed according to the abnormality processing subsystem, the influence of the second type of abnormality on the service provided by the server is reduced, the influence on the system is reduced, and meanwhile, the usability of the system-level chip is ensured.

In addition, the core dump file contains hardware state information and software state information, and the information of the second type of exception sent to the exception handling subsystem comprises error codes, unique identification codes of hardware engine cores, control block addresses and processing completion function pointers, so that problem sites can be stored according to the core dump file and the information sent to the exception handling subsystem, and therefore, when the exception is recovered, the problem is not required to be reproduced, and the exception recovery efficiency is greatly improved.

The core dump file is uploaded to the server host, and the host is restarted, so that the duration of chip abnormality is reduced, and the subsequent abnormality investigation and positioning are facilitated.

An exception handling information table and an exception recovery behavior table are established in the exception handling subsystem, so that the handling of the second exception is completed in a table look-up mode, the handling of the second exception is facilitated, and the handling efficiency of the second exception is greatly improved; the exception handling information table comprises an information table and a general information table of each hardware engine core, and the general information table and the information table of each hardware engine core comprise unique identification codes of error codes, error code masks, reporting modes, exception recovery modes and exception response modules, namely, the exception handling information table of each hardware engine core and the general information table can be used for handling the exception, and the exception handling information table of the hardware engine core is firstly used for handling the exception, and if the exception handling information table is not matched, the general information table is used for handling the exception.

By resuming the interrupt mask, it can be reported again. The report frequency of alarm interrupt is reduced, if a certain alarm interrupt is frequently reported, a normal thread cannot be scheduled due to frequent entering of an interrupt processing callback function, and the service cannot normally run; in addition, if a certain error alarm is recovered, the error alarms of the same type which are possibly reported later are meaningless and do not need to be processed.

And determining a reporting mode according to the severity of the second abnormality and the requirement of the server host, so that the reporting mode can meet the actual requirement more.

In addition, the invention also provides a processing device of the system-level chip abnormality, processing equipment of the system-level chip abnormality and a computer readable storage medium, which have the same or corresponding technical characteristics as the processing method of the system-level chip abnormality, and the effects are the same as those of the processing method.

Drawings

For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a schematic diagram of a server according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for handling exceptions in a system-on-chip provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating information related to a core dump file portion according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an error type anomaly information retrieval relationship according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a processing manner when a chip hardware exception occurs according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method for handling deadly exceptions according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for handling error type exceptions according to an embodiment of the present invention;

FIG. 8 is a block diagram of an apparatus for handling exceptions in a system-on-chip provided by an embodiment of the present invention;

fig. 9 is a block diagram of an apparatus for handling an exception of a system-on-chip according to another embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

The core of the invention is to provide a method, a device, equipment and a medium for processing the abnormality of a system-level chip, so as to solve the problem that the system performance is affected due to the fact that the abnormality is not recovered in the development and debugging stage of the system-level chip, and the usability of the system is reduced.

In a server, many external system-on-chip chips are connected through a peripheral component interconnect express. Fig. 1 is a schematic diagram of a server according to an embodiment of the present invention, as shown in fig. 1, a server host 1 in the server is connected to each system-on-chip 2 through a plurality of peripheral component interconnects. The existence of the system-on-chip provides various high-performance services for the server, and the performance of the server is greatly improved. The development of the server external system-level chip comprises two main stages: a development and debugging stage and a production stage. In the development and debugging stage, the simulator and debug means can intuitively and conveniently acquire all the current information of the chip, but in a system-in-a-chip system, a plurality of modules and a plurality of developers are debugged at the same time, and resources such as the simulator are limited, so that the condition that the simulator is not available during debugging is easy to occur, and the debugging progress is influenced. In addition, in the debugging process of a system-level chip integrating a plurality of engine hardware engine cores, the probability of occurrence of abnormality of the engine in the chip is much higher than that of a product stage, and many abnormality is even abnormality, no correlation of a problem site exists, reproduction and investigation are difficult to carry out subsequently, in addition, when abnormality occurs, how to recover is carried out, the influence of the abnormality on a system is reduced, and the usability of the system is guaranteed to be an important problem when abnormality occurs in the chip debugging stage.

Therefore, in the embodiment of the present invention, a processing scheme is provided when an exception occurs in a hardware engine core of a server system chip, specifically, in the debugging and developing process of the server system chip, by performing a software and hardware core dump (also called core file or crash dump) for a high-hazard exception in an abort callback function, which is a file containing the memory contents of the program when the computer program crashes or aborts; aiming at the low-hazard recoverable abnormality, the low-hazard recoverable abnormality is sent to an abnormality processing subsystem in an asynchronous mode to carry out restorative operation, and relevant information of the abnormality is recorded, so that the cause of the abnormality is conveniently analyzed later, tuning operation is carried out, the occurrence frequency of the abnormality is reduced, and the usability of a system-in-chip is improved.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. The method for processing the exception of the system-on-chip provided by the embodiment of the invention is applied to the debugging and developing process of the system-on-chip. Fig. 2 is a flowchart of a method for processing an exception of a system-on-chip according to an embodiment of the present invention, as shown in fig. 2, where the method includes:

s10: under the condition that the abnormality of a hardware engine core in the system-level chip is detected, obtaining abnormality information through interrupting a callback function;

s11: acquiring first-type abnormal information and second-type abnormal information in abnormal information; wherein the severity of the first type of anomaly is greater than the severity of the second type of anomaly;

s12: establishing a core dump file corresponding to the first type of exception in the interrupt callback function according to the information of the first type of exception, and processing the first type of exception according to the core dump file;

s13: and sending the information of the second type of abnormality to an abnormality processing subsystem, and processing the second type of abnormality according to an abnormality recovery strategy corresponding to the second type of abnormality preset in the abnormality processing subsystem.

In order to process the exception of the system-on-chip, it is first necessary to determine whether the hardware engine core in the system-on-chip is abnormal. The method for judging the abnormality of the hardware engine core in the system-in-chip is not limited, and if the value of the register is detected to be not in the preset range, the abnormality of the hardware engine core is determined.

In order to record a scene when an abnormality occurs, the cause of the abnormality is conveniently analyzed later, and in the embodiment, the abnormality information is obtained by interrupting a callback function. The interrupt callback function is a function registered by the application program and provided for the operating system or the driving program, when an interrupt event occurs, the operating system or the driving program can call the interrupt callback function registered by the application program and transfer the related information of the interrupt event to the callback function as a parameter, and the function of the interrupt callback function is that the application program can process the interrupt event.

After the abnormal information is obtained through the interrupt callback function, the abnormal information is split into first-type abnormal information and second-type abnormal information according to the severity of the abnormality. In the embodiment of the invention, the severity of the first type of abnormality is set to be higher than the severity of the second type of abnormality. The first type of exception may be referred to as a fatal exception and the second type of exception may be referred to as an erroneous type exception. The fatal abnormality is high in hazard and the erroneous abnormality is low in hazard. After the first type of abnormal information and the second type of abnormal information are obtained, different measures are adopted to process the first type of abnormal information and the second type of abnormal information in order to process the abnormality.

The operation of the whole system can be influenced by the deadly abnormality, the harm is very high, at the moment, the system service is not credible and cannot be recovered, the processing and the investigation are needed in time, and the processing is needed at the first time when the abnormality does not occur. After the fatal exception occurs, the software and hardware core dump file is directly carried out in the interrupt callback function, and the site where the problem occurs is saved so as to carry out subsequent investigation and positioning. The effect of the core dump file operation on the overall system traffic need not be considered at this time, as the traffic is already untrusted at this time. The core dump file record information comprises two parts, namely hardware information and software information. And synchronous processing is carried out on deadly abnormality, conventional software core dump file operation is expanded, and key hardware information is saved on the basis of key software information, so that subsequent software personnel and hardware personnel can cooperate to check. The hardware state information comprises current version information of a system-level chip, serial numbers of the system-level chip, register information of a central processing unit, current state machine information of hardware and hardware cache information; the software state information is the current key information of each software module, and the software state information comprises stack information, global variables and software state information.

Fig. 3 is related information of a core dump file portion provided in an embodiment of the present invention, where, as shown in fig. 3, the core dump file includes hardware status information and software status information. The hardware state information comprises system-level chip information, central processing unit information and hardware engine core information; the system-level chip information comprises chip version information, the central processor information comprises central processor register information, and the hardware engine core information comprises each hardware engine core configuration; the software state information comprises firmware basic information, stack information, operating system related information and used memory related information, wherein the firmware basic information comprises a version number, and the stack information comprises a stack base address. After the core dump file is used for storing the problem on site, reporting the problem to the host, restarting the host, and analyzing according to the core dump file after restarting.

The running state of the system is not affected by the error type abnormality, but the occurrence of the abnormality can cause the failure of the currently processed task, so the requirement on timeliness is not high during processing, and the abnormality information is delivered to an abnormality processing subsystem for processing by sending the abnormality data to an abnormality queue. The information of the second type of exception sent to the exception handling subsystem comprises error codes, unique identification codes of hardware engine cores, control block addresses and processing completion function pointers; the error code is used for representing the abnormal occurrence, the unique identification code of the hardware engine core is used for representing abnormal hardware, the control block address is used for representing the abnormal occurrence business block address, and the processing completion function pointer is used for representing the function pointer for recovering the current abnormal interrupt mask. And after the abnormal data is sent to the abnormal queue, the abnormal processing subsystem takes the abnormal data out of the abnormal queue and performs abnormal processing.

In the method provided by the embodiment of the invention, in the debugging and developing process of the system-level chip, the problem site is saved through the core dump file aiming at the first type of abnormality with high severity, so that the debugging of a joint test working group is not needed, a user can check the information required by the problem through the content in the core dump file, the problem positioning efficiency is improved, and the debugging and developing period of the system-level chip is shortened; for the second type of abnormality with lower severity, the information of the second type of abnormality is sent to the abnormality processing subsystem provided with the abnormality recovery strategy corresponding to the second type of abnormality, so that the abnormality can be processed according to the abnormality processing subsystem, the influence of the second type of abnormality on the service provided by the server is reduced, the influence on the system is reduced, and meanwhile, the usability of the system-level chip is ensured.

To improve the efficiency of handling first-type exceptions, in some embodiments, handling first-type exceptions from a core dump file includes:

uploading the core dump file to a server host, and storing the core dump file on a preset path in the server host;

and controlling the restarting of the server host within a preset time, so that the server host acquires a core dump file from a preset path after restarting, and analyzing and recovering the first type of abnormality according to the information in the core dump file.

The corresponding preset path and preset time are not limited, and are determined according to actual conditions. In order to improve the efficiency of the first exception handling, the first preset time set in practice cannot be too long.

In order to improve the efficiency of the second-class exception handling when the second-class exception handling subsystem is used for the second-class exception handling, in some embodiments, the exception recovery strategy is determined by an exception handling information table and an exception recovery behavior table, and before the second-class exception is handled according to the exception recovery strategy corresponding to the second-class exception preset in the exception handling subsystem, the method further comprises:

an exception handling information table and an exception recovery behavior table are established in an exception handling subsystem; the exception handling information table comprises an information table and a general information table of each hardware engine core in the system-level chip; the abnormal behavior recovery table includes behavior tables of a plurality of hardware engine cores. In order to restore the exception corresponding to all the hardware engine cores, in the implementation, each hardware engine core corresponds to a behavior table of its own hardware engine core, and if there are three hardware engine cores, namely A1, A2, and A3, the abnormal behavior restoration table includes three behavior tables of the hardware engine cores, namely a behavior table of the hardware engine core A1, a behavior table of the hardware engine core A2, and a behavior table of the hardware engine core A3.

Correspondingly, the processing of the second type of exception according to the preset exception recovery strategy corresponding to the second type of exception in the exception processing subsystem comprises the following steps:

and processing the second type of exception by using an exception handling information table and an exception recovery behavior table in the exception handling subsystem.

Specifically, the general information table and the information table of each hardware engine core comprise unique identification codes of error codes, error code masks, reporting modes, abnormal recovery modes and abnormal response modules; the unique identification code of the abnormal response module restores the abnormality according to the abnormality restoring behavior table and in combination with the reporting mode; the abnormal recovery mode at least comprises one of task retry, notification and reconfiguration of a hardware engine core;

the processing of the second type of exception by using the exception handling information table and the exception recovery behavior table in the exception handling subsystem comprises the following steps:

acquiring an information table of the current hardware engine core corresponding to the unique identification code of the current hardware engine core from the exception handling information table according to the unique identification code of the current hardware engine core in the information of the second type of exceptions sent to the exception handling subsystem;

matching a current error code in the information of the second type of exception sent to the exception handling subsystem with a current error code in an information table of a current hardware engine core;

If the current error code is detected to be matched with the current error code, acquiring a current abnormal recovery behavior table corresponding to the current hardware engine core from the abnormal recovery behavior table according to the unique identification code of the current abnormal response module in the information table of the current hardware engine core;

finding a corresponding current abnormal recovery behavior function pointer from the current abnormal recovery behavior table according to the current abnormal recovery mode in the information table of the current hardware engine core;

if the current error code is detected to be not matched with the current error code, matching the current error code with the current error code in the general information table;

if the current error code is not matched with the current error code in the general information table, outputting prompt information for representing the search failure;

if the current error code is detected to be matched with the current error code in the general information table, acquiring a current abnormal recovery behavior table corresponding to the current hardware engine core from the abnormal recovery behavior table according to the unique identification code of the current abnormal response module in the general information table;

finding a corresponding current abnormal recovery behavior function pointer from the current abnormal recovery behavior table according to the current abnormal recovery mode in the general information table;

Recovering the behavior function by utilizing the current abnormality refers to recovering against the second type of abnormality.

In the method provided by the embodiment, aiming at error type exception, the exception handling subsystem uses the exception handling information table and the exception handling behavior table to define handling behaviors when exception occurs, and searches the general exception handling information table and the engine exception handling information table according to unique codes and error codes of hardware engine cores where the exception occurs, and finally calls the exception handling behaviors dynamically registered by each module to complete exception recovery, thereby reducing the influence of the hardware engine core exception on a system and improving the availability of the system.

Fig. 4 is a schematic diagram of an error type exception information retrieval relationship provided in an embodiment of the present invention, where as shown in fig. 4, the error type exception information retrieval relationship includes exception data (information about a second type exception sent to an exception handling subsystem), an exception handling information table (global exception handling information table), an exception recovery behavior table (global exception recovery behavior table), and a module exception recovery behavior table. The exception data comprises error codes, unique identification codes of hardware engine cores, control block addresses and processing completion function pointers; the exception handling information table includes information tables of the respective hardware engine cores (e.g., a hardware engine core 1 information table, a hardware engine core 2 information table … …, a hardware engine core n information table), and a general information table. The information table and the general information table of each hardware engine core comprise error codes, error code masks, reporting modes, abnormal recovery modes and unique identification codes of abnormal response modules. The abnormal recovery behavior table includes the behavior table of each hardware engine core, and the recovery mode is recorded in the module abnormal recovery behavior table, such as task retry, notification and hardware engine core reconfiguration, which are listed in fig. 4. After the abnormal data is sent to the abnormal queue, the abnormal processing subsystem takes the abnormal data out of the abnormal queue for abnormal processing, and the abnormal processing subsystem comprises four main steps: exception lookup, exception reporting, exception recovery, and call processing completion functions.

In the whole exception handling process, most important are two tables, an exception handling information table and an exception recovery behavior table, and the index relationship of the tables is shown in fig. 4. The contents of the exception handling information table are ready during initialization, and the exception handling behavior table is registered by the corresponding driver of each hardware engine core during initialization. The index relation of the two tables is predefined, and because errors reported by the hardware engine core are clear, the corresponding exception handling process after occurrence is clear, and the whole exception handling process is the process of table lookup execution.

Exception lookup relies on the unique code and error code of the hardware engine core in the exception data, and if the error code matches the error code and error code in the exception handling information table, the lookup is successful. When searching, firstly searching a hardware engine core exception handling information table corresponding to the unique code of the hardware engine core in the exception data, obtaining corresponding exception handling information, if searching fails, searching the corresponding general exception handling information table, and if searching fails, returning an error.

After recovering for the second class of anomalies using the current anomaly recovery behavior function, further comprising:

And if the host computer is required to be reported or only a recording module in the chip is required to be reported according to the abnormal reporting mode in the abnormal processing information, so that the subsequent inquiry is convenient. Specifically, according to the reporting mode in the general information table or the information table of the current hardware engine core, determining to report the second exception to the server host or record the second exception in the system-in-chip;

the method for determining the reporting mode comprises the following steps:

determining that the reporting mode is to report the second abnormality to the server host under the condition that the severity of the second abnormality is detected to be larger than a preset value and smaller than the severity of the first abnormality and/or that the server host sends prompt information for representing that the acquisition abnormality is reported to the system-in-chip;

and under the condition that the system-level chip is not damaged and/or when the severity of the second abnormality is detected to be smaller than a preset value, determining that the reporting mode is to record the second abnormality in the system-level chip.

During the abnormal recovery, the abnormal recovery behavior table registered by the abnormal response module is found in the abnormal recovery behavior table according to the unique identification code of the abnormal response module in the abnormal processing information table, then the corresponding abnormal recovery behavior function pointer is found from the abnormal recovery behavior table of the module according to the abnormal recovery mode, and the abnormal recovery is executed. Common exception recovery methods include task retries, notifications, hardware engine core reconfiguration, etc., and different processing actions are selected for different exceptions, which are predefined in the table.

In an implementation, in order to reduce the reporting frequency of the alarm interrupt, after recovering from the second type of abnormality by using the current abnormality recovery behavior function, the method further includes:

judging whether a processing completion function in the information of the second type of exception is empty or not;

if not, executing a processing completion function;

And executing a processing completion function, namely recovering an interrupt mask corresponding to the exception so that the interrupt mask can be reported again. The processing completion function pointer is put in an exception handling thread to be executed, and the method has two purposes, namely, the reporting frequency of alarm interrupt is reduced, if a certain alarm interrupt is frequently reported, the normal thread cannot be scheduled and the service cannot normally run due to frequent entering of the interrupt handling callback function, and furthermore, fatal exception cannot be handled due to frequent entering of the error class exception; secondly, if a certain error alarm is recovered, the error alarms of the same type which are possibly reported later are meaningless and do not need to be processed.

In order that those skilled in the art will better understand the present invention, the following description will proceed with reference being made to the accompanying drawings and detailed description. Fig. 5 is a schematic diagram of a processing manner when a chip hardware exception occurs, where a server host is connected to a system-in-chip through a peripheral component interconnect express (pci express) according to an embodiment of the present invention. When the hardware engine core in the server system level chip is abnormal, abnormal information is notified to software in an interrupt mode, in an interrupt callback processing function, the abnormal is classified according to the severity of the abnormal and the influence degree on the system, the abnormal with higher hazard is classified as fatal abnormal, the abnormal with lower hazard and recoverable is classified as error abnormal, and different types of abnormal are processed. The deadly abnormality needs to be synchronously processed, the timeliness of the stored problem site is guaranteed to the greatest extent, the subsequent checking and positioning are facilitated, the error abnormality needs to be asynchronously processed, recovery is carried out on the premise of guaranteeing that the processing of the deadly abnormality is not affected, and the usability of the system is guaranteed. Fig. 5 shows 3 hardware engine cores, which are a first hardware engine core, a second hardware engine core and a third hardware engine core, if the first hardware engine core is abnormal, acquiring abnormal information of the first hardware engine core through a first interrupt callback function; if the second hardware engine core is abnormal, acquiring information of the second hardware engine core abnormality through a second interrupt callback function; if the third hardware engine core is abnormal, acquiring abnormal information of the abnormal third hardware engine core through a third interrupt callback function; the exception information is classified, if the exception information is fatal exception, the problem site is saved by using a core dump file, and if the exception information is erroneous, the exception information is sent to an exception queue for processing.

In the callback function of the abnormal interrupt, different interrupt bits in the interrupt status code represent different types of exceptions, and when hardware design is carried out, the possible exceptions are all clear, so each exception has corresponding error codes and classifications, and whether subsequent operation is carried out is judged according to the types of the exceptions. Fig. 6 is a flowchart of a method for handling deadly exceptions according to an embodiment of the present invention, as shown in fig. 6, the method includes:

s14: reporting an abnormal interrupt by a hardware engine core of the system-level chip;

s15: judging whether the abnormal condition is fatal; if yes, go to step S16;

s16: executing a core dump file in the interrupt callback function;

s17: reporting the abnormal information to a server host through a core dump file interface;

s18: restarting the chip;

s19: and (5) carrying out core dump file log analysis and checking the problems.

In the abnormal interrupt processing function, if the abnormal interrupt processing function is an error type abnormality, the abnormal information is organized into abnormal data and sent to an abnormal processing queue, and an abnormal processing subsystem takes out the data from the abnormal processing queue to perform abnormal processing. FIG. 7 is a flowchart of a method for handling error type exceptions according to an embodiment of the present invention, as shown in FIG. 7, the method includes:

S20: reporting an abnormal interrupt by a hardware engine core of the system-level chip;

s21: judging whether the error type is abnormal; if yes, go to step S22;

s22: sending the exception to an exception handling queue in the interrupt callback function;

s23: the exception handling subsystem obtains exception data from the queue;

s24: searching an exception handling information table to obtain an exception recovery mode;

s25: performing exception recovery;

s26: performing exception reporting;

s27: judging whether the processing completion function is null; if not, go to step S28; if yes, ending;

s28: a process completion function is executed.

In the recovery process of error type exception handling, sending primary data to an exception handling queue in an interrupt function; the exception handling subsystem obtains exception data from the exception queue; the exception handling subsystem obtains exception handling information from a global exception handling information table according to the unique module identification code and the error code in the exception data; according to the exception handling information, searching an exception handling behavior table, acquiring an exception handling function pointer, and executing an exception recovery behavior; according to the reporting mode in the exception handling information, executing exception reporting; and if the processing completion function in the abnormal data is not null, executing the processing completion function, and recovering the interrupt mask corresponding to the abnormality.

The embodiment of the invention mainly solves the processing problem of the system-level chip when the hardware engine core is abnormal in the debugging and development stage, aims at the deadly abnormality with higher hazard, saves the problem site in time, does not need to use JTAG debugging, does not need to reproduce the problem, enables software personnel and hardware personnel to see various information required by the problem debugging and development through files saved by the software and hardware core dump files, improves the positioning efficiency of the problem, and shortens the debugging and development period of the system-level chip.

Aiming at error type anomalies with lower hazard, the anomaly processing subsystem executes predefined anomaly recovery behaviors which are confirmed with corresponding drivers and business personnel, and can eliminate corresponding anomalies. The elimination of error type abnormality can reduce the influence on the service provided by the server, reduce the influence on the system and ensure the usability of the system-in-chip.

In the above embodiments, the present invention further provides embodiments corresponding to a device for processing an exception of a system-on-chip and a device for processing an exception of a system-on-chip, where the method for processing an exception of a system-on-chip is described in detail. It should be noted that the present invention describes an embodiment of the device portion from two angles, one based on the angle of the functional module and the other based on the angle of the hardware.

Fig. 8 is a block diagram of an apparatus for handling exceptions of a system-on-chip according to an embodiment of the present invention. The embodiment is based on the angle of the functional module, and comprises:

the first obtaining module 10 is configured to obtain, when detecting that a hardware engine core in the system-in-chip is abnormal, abnormal information through an interrupt callback function;

a second obtaining module 11, configured to obtain information of a first type of abnormality and information of a second type of abnormality in the abnormality information; wherein the severity of the first type of anomaly is greater than the severity of the second type of anomaly;

the establishing and processing module 12 is configured to establish a core dump file corresponding to the first type of exception according to the information of the first type of exception in the interrupt callback function, and process the first type of exception according to the core dump file;

the sending and processing module 13 is configured to send information of the second type exception to the exception handling subsystem, and process the second type exception according to an exception recovery policy corresponding to the second type exception preset in the exception handling subsystem.

The setup and processing module 12 specifically includes:

the uploading module is used for uploading the core dump file to the server host and storing the core dump file to a preset path in the server host;

The control module is used for controlling the restarting of the server host in a preset time, so that the server host can acquire a core dump file from a preset path after restarting, and the first type of abnormality is analyzed and recovered according to the information in the core dump file.

The system-in-chip exception handling apparatus further includes:

the building module is used for building an exception handling information table and an exception recovery behavior table in the exception handling subsystem; the exception handling information table comprises an information table and a general information table of each hardware engine core in the system-level chip; the abnormal behavior recovery table comprises behavior tables of a plurality of hardware engine cores;

correspondingly, the sending and processing module 13 specifically includes:

the first processing module is used for processing the second type of exception by utilizing an exception handling information table and an exception recovery behavior table in the exception handling subsystem.

The information table of each hardware engine core comprises an error code, an error code mask, a reporting mode, an abnormal recovery mode and a unique identification code of an abnormal response module; the unique identification code of the abnormal response module restores the abnormality according to the abnormality restoring behavior table and in combination with the reporting mode; the abnormal recovery mode at least comprises one of task retry, notification and reconfiguration of a hardware engine core;

The first processing module specifically includes:

the third acquisition module is used for acquiring an information table of the current hardware engine core corresponding to the unique identification code of the current hardware engine core from the exception handling information table according to the unique identification code of the current hardware engine core in the information of the second type exception sent to the exception handling subsystem;

the first matching module is used for matching the current error code in the information of the second type of exception sent to the exception handling subsystem with the current error code in the information table of the current hardware engine core;

the fourth acquisition module is used for acquiring a current abnormal recovery behavior table corresponding to the current hardware engine core from the abnormal recovery behavior table according to the unique identification code of the current abnormal response module in the information table of the current hardware engine core if the current error code is detected to be matched with the current error code;

the first finding module is used for finding out a corresponding current abnormal recovery behavior function pointer from the current abnormal recovery behavior table according to the current abnormal recovery mode in the information table of the current hardware engine core;

the first recovery module is used for recovering against the second type of abnormality by utilizing the current abnormality recovery behavior function;

The second matching module is used for matching the current error code with the current error code in the general information table if the current error code is detected to be not matched with the current error code;

the output module is used for outputting prompt information for representing search failure if the current error code is detected to be not matched with the current error code in the general information table;

the fifth acquisition module is used for acquiring a current abnormal recovery behavior table corresponding to the current hardware engine core from the abnormal recovery behavior table according to the unique identification code of the current abnormal response module in the general information table if the current error code is detected to be matched with the current error code in the general information table;

the second finding module is used for finding out a corresponding current abnormal recovery behavior function pointer from the current abnormal recovery behavior table according to the current abnormal recovery mode in the general information table;

and the second recovery module is used for recovering against the second type of abnormality by utilizing the current abnormality recovery behavior function.

The device for processing the exception of the system-in-chip further comprises:

the judging module is used for judging whether the processing completion function in the second type of abnormal information is empty or not; if not, the execution module is triggered,

The execution module is used for executing the processing completion function;

and the third recovery module is used for recovering the interrupt mask corresponding to the second exception according to the processing completion function.

The system-in-chip exception handling apparatus further includes:

the recording module is used for determining to report the second exception to the server host or record the second exception in the system-in-chip according to the reporting mode in the general information table or the information table of the current hardware engine core;

the system-in-chip exception handling apparatus further includes: and the first determining module is used for determining the reporting mode.

The first determining module specifically includes:

the second determining module is used for determining that the reporting mode is to report the second abnormality to the server host when the severity of the second abnormality is detected to be larger than a preset value and smaller than the severity of the first abnormality and/or when the server host is detected to send prompt information for representing acquisition abnormality reporting to the system-in-chip;

and the third determining module is used for determining that the reporting mode is to record the second abnormality in the system-in-chip under the condition that the system-in-chip is not damaged and/or when the severity of the second abnormality is detected to be smaller than a preset value.

Since the embodiments of the apparatus portion and the embodiments of the method portion correspond to each other, the embodiments of the apparatus portion are referred to the description of the embodiments of the method portion, and are not repeated herein. And has the same advantageous effects as the above-mentioned method of handling exceptions of the system-on-chip.

Fig. 9 is a block diagram of an apparatus for handling an exception of a system-on-chip according to another embodiment of the present invention. The processing apparatus of the abnormality of the system-on-chip of the present embodiment includes, based on the hardware angle, as shown in fig. 9:

a memory 20 for storing a computer program;

a processor 21 for implementing the steps of the method of handling exceptions of a system-on-chip as mentioned in the above embodiments when executing a computer program.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in hardware in at least one of a digital signal processor (Digital Signal Processor, DSP), a Field programmable gate array (Field-Programmable Gate Array, FPGA), a programmable logic array (Programmable Logic Array, PLA). The processor 21 may also comprise a main processor, which is a processor for processing data in an awake state, also called central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a graphics processor (Graphics Processing Unit, GPU) for taking care of rendering and drawing of content that the display screen is required to display. In some embodiments, the processor 21 may also include an artificial intelligence (Artificial Intelligence, AI) processor for processing computing operations related to machine learning.

Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, where the computer program, after being loaded and executed by the processor 21, can implement the relevant steps of the method for processing an exception of a system-on-chip disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. The operating system 202 may include Windows, unix, linux, among others. The data 203 may include, but is not limited to, data related to the above-mentioned method of handling exceptions of the system-on-chip, and the like.

In some embodiments, the system-on-chip exception handling device may further include a display 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the structure shown in fig. 9 does not constitute a limitation on the method of handling exceptions for a system-on-chip and may include more or fewer components than shown.

The device for processing the exception of the system-on-chip provided by the embodiment of the invention comprises a memory and a processor, wherein the processor can realize the following method when executing a program stored in the memory: the system-level chip exception handling method has the same effects.

Finally, the invention also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps as described in the method embodiments above.

It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The computer readable storage medium provided by the invention comprises the method for processing the abnormality of the system-on-chip, and the effect is the same as that of the method.

The method, the device, the equipment and the medium for processing the abnormality of the system-on-chip provided by the invention are described in detail. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. The method for processing the exception of the system-on-chip is characterized by being applied to the debugging and development process of the system-on-chip, and comprises the following steps:

establishing a core dump file corresponding to the first type of exception according to the information of the first type of exception in the interrupt callback function, and processing the first type of exception according to the core dump file;

and sending the information of the second type of abnormality to an abnormality processing subsystem, and processing the second type of abnormality according to an abnormality recovery strategy corresponding to the second type of abnormality preset in the abnormality processing subsystem.

2. The method for handling exceptions in a system-on-chip as recited in claim 1, wherein the core dump file contains hardware state information and software state information; the hardware state information comprises current version information of the system-level chip, serial numbers of the system-level chip, register information of a central processing unit, current state machine information of hardware and hardware cache information; the software state information comprises stack information, global variables and software state information;

3. The method for processing the exception of the system-on-chip according to claim 1 or 2, wherein the processing the first type of exception according to the core dump file comprises:

4. The method for processing an exception of a system-in-chip according to claim 2, wherein the exception recovery policy is determined by an exception handling information table and an exception recovery behavior table, and before the processing of the second type of exception according to the exception recovery policy corresponding to the second type of exception preset in the exception handling subsystem, the method further comprises:

5. The method for processing an exception of a system-in-chip according to claim 4, wherein the common information table and the information table of each hardware engine core each include unique identifiers of error codes, error code masks, reporting modes, exception recovery modes, and exception response modules; the unique identification code of the abnormal response module restores the abnormality according to the abnormality restoring behavior table and in combination with the reporting mode; the abnormal recovery mode at least comprises one of task retry, notification and reconfiguration of the hardware engine core;

6. The method for processing an exception on a system-on-chip of claim 5, further comprising, after said recovering from said second type of exception using said current exception recovery behavior function,:

if not, executing the processing completion function;

7. The method for processing an exception on a system-on-chip of claim 5, further comprising, after said recovering from said second type of exception using said current exception recovery behavior function,:

wherein, determining the reporting mode includes:

8. An apparatus for processing exception of a system-on-chip, the apparatus being applied to a system-on-chip debugging and developing process, the apparatus comprising:

9. An apparatus for handling exceptions in a system-on-chip, comprising:

a memory for storing a computer program;

A processor for implementing the steps of the method for handling exceptions of a system on chip according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, which when executed by a processor, implements the steps of the method for handling anomalies of a system-on-chip according to any one of claims 1 to 7.