CN115617557A - Abnormality supervision system, abnormality supervision method, storage medium, and vehicle - Google Patents

Abnormality supervision system, abnormality supervision method, storage medium, and vehicle Download PDF

Info

Publication number
CN115617557A
CN115617557A CN202211288134.9A CN202211288134A CN115617557A CN 115617557 A CN115617557 A CN 115617557A CN 202211288134 A CN202211288134 A CN 202211288134A CN 115617557 A CN115617557 A CN 115617557A
Authority
CN
China
Prior art keywords
layer
supervision
unit
exception
subsystem
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211288134.9A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambrian Jixingge Nanjing Technology Co ltd
Original Assignee
Cambrian Jixingge Nanjing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambrian Jixingge Nanjing Technology Co ltd filed Critical Cambrian Jixingge Nanjing Technology Co ltd
Priority to CN202211288134.9A priority Critical patent/CN115617557A/en
Publication of CN115617557A publication Critical patent/CN115617557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0736Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in functional embedded systems, i.e. in a data processing system designed as a combination of hardware and software dedicated to performing a certain function
    • G06F11/0739Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in functional embedded systems, i.e. in a data processing system designed as a combination of hardware and software dedicated to performing a certain function in a data processing system embedded in automotive or aircraft systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/24Resetting means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The disclosure discloses an abnormality supervision system, method, storage medium, and vehicle. The monitoring method realized based on the abnormal monitoring system can reduce the processing load of the MCU with high external safety level; in addition, when the abnormality is repaired, the situation that the whole SoC needs to be reset due to the abnormality of the individual subsystem can be avoided.

Description

Abnormality supervision system, abnormality supervision method, storage medium, and vehicle
Technical Field
The present disclosure relates generally to the field of intelligent driving. More particularly, the present disclosure relates to an abnormality supervision system, method, storage medium, and vehicle.
Background
An Advanced Driver Assistance System (ADAS) is an active safety technology. The safety technology can sense the surrounding environment at any time in the driving process of the automobile, collect data, identify, detect and track static and dynamic objects, combine navigation map data and carry out systematic operation and analysis, so that a driver can perceive possible dangers in advance, and the comfort and safety of automobile driving are effectively improved.
The ADAS has a safety diagnosis mechanism, that is, when hardware or software is abnormal in the system, an abnormal event is reported and repaired, and the existing reporting mechanism is generally implemented based on a MCU-SoC supervision architecture. The MCU (Micro Controller Unit, microcontroller) is an external high-security-level MCU, and is used to monitor abnormal events in SoC (System on Chip) running ADAS.
The existing exception reporting mechanism based on the MCU-SoC supervision architecture mainly has the following defects:
1. the inside of the existing SoC is divided into a plurality of subsystems, including a performance domain subsystem, a real-time domain subsystem, a start domain subsystem, a security domain subsystem and the like. The abnormal events generally occur in each subsystem, and each subsystem uniformly reports the operation abnormality to the MCU, which causes the MCU to have an excessively high load.
2. The software of a subsystem is blocked due to the software/hardware failure of the subsystem, the whole SoC needs to be reset, and the repair cost is too high.
Disclosure of Invention
To address at least one or more of the above-mentioned technical problems, the present disclosure proposes, in various aspects, an exception supervision scheme for a system on chip, in which at least two layers of supervision units are divided within the system on chip, and each subsystem in the system on chip is independently supervised.
In a first aspect, the present disclosure provides an anomaly supervision system for supervising a system-on-chip having a system-on-chip Soc with a plurality of first subsystems and second subsystems, the anomaly supervision system comprising: the first-layer monitoring units run in the first subsystems and are used for monitoring the running abnormity of the first subsystems and selectively reporting the corresponding running abnormity to the second-layer monitoring units according to the abnormity types of the running abnormity; and the second layer of supervision unit is operated in the second subsystem and is used for collecting the operation abnormity of each subsystem in the system on chip.
In a second aspect, the present disclosure provides an anomaly supervision method for supervising a system-on-chip having a system-on-chip Soc with a plurality of first subsystems and second subsystems, the method comprising: the first layer of supervision unit supervises the abnormal operation of the first subsystem; the first layer of monitoring units run in the first subsystem, and the first layer of monitoring units correspond to the first subsystem one by one; if the first subsystem has abnormal operation, the first layer supervision unit selects to report the corresponding abnormal operation to the second layer supervision unit according to the abnormal type of the abnormal operation; the second layer of supervision unit collects the abnormal operation of each subsystem in the system on chip; wherein the second layer supervisory unit operates in the second subsystem.
In a third aspect, the present disclosure provides a computer readable storage medium storing computer program code for an exception supervision method, which when executed, performs the above exception supervision method.
In a fourth aspect, the present disclosure provides a vehicle having a chip system mounted thereon which is supervised by the abnormality supervision system as described above.
By the exception supervision scheme provided above, the scheme disclosed by the invention performs supervision hierarchical division in the system on chip, that is, a specific subsystem (e.g., a security domain subsystem) is set as an intermediate layer (e.g., a second layer supervision unit) for reporting exceptions in the system on chip, so that part of the running exceptions can be repaired in the second layer supervision unit without reporting all the running exceptions to the MCU, and the processing load of the MCU is reduced. And each subsystem is provided with a monitoring unit which can report the independent running abnormity, so that the superior monitoring unit can accurately position the subsystem sent by the running abnormity, and further can carry out targeted repair operation (such as reset) on the subsystem with the running abnormity, thereby avoiding the condition that the whole SoC is required to be reset due to the abnormity of individual subsystems.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to like or corresponding parts and in which:
FIG. 1 illustrates a simplified deployment diagram of an anomaly supervision system and a chip system supervised by the anomaly supervision system according to an embodiment of the disclosure;
FIG. 2 illustrates another schematic deployment diagram of an anomaly supervision system and a chip system supervised by the anomaly supervision system according to an embodiment of the disclosure;
FIG. 3 shows a schematic structural diagram of an FMU deployment in an embodiment of the present disclosure;
fig. 4 is a schematic diagram illustrating a security redundancy mechanism for exception reporting in an embodiment of the disclosure;
FIG. 5 illustrates an exemplary method flow diagram of an anomaly supervision method according to one embodiment of the present disclosure; and
FIG. 6 illustrates an exemplary flow chart of an anomaly supervision method according to another embodiment of the present disclosure;
wherein, the representation relationship between the reference numbers and the modules in the figure is as follows:
10-SoC, 20-MCU, 11-first subsystem, 12-second subsystem, 21-safety system, 111-first layer supervision unit, 121-second layer supervision unit, 211-third layer supervision unit.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," and "third," etc. as may be used in the claims, the description, and the drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when.. Or" once "or" in response to a determination "or" in response to a detection ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
As mentioned in the background art, the existing exception reporting mechanism based on the MCU-SoC monitoring architecture is not suitable for the current highly complex SoC in terms of system design. On one hand, due to the high complexity of the SoC and the large number of abnormal sources, the abnormal sources are usually distributed in different SoC subsystems, and each subsystem cannot report the abnormality to the MCU directly. Even if the abnormal signals of all the subsystems are converged to the MCU, the MCU supervision software design is complicated and the load is too high. On the other hand, each SoC subsystem often runs an independent software system, and when the MCU detects that the subsystem software is jammed due to a software/hardware fault of a certain subsystem of the SoC, the whole SoC chip is reset, which brings a large cost.
In view of this, the embodiments of the present disclosure provide an exception supervision system to meet the system function security requirements of such high performance SoC chips. In some embodiments, the exception oversight scheme of the present disclosure both satisfies the time constraints of the security mechanisms and covers as seamlessly as possible the multitude of exception checkpoints distributed within the SoC. Thus, the exception supervision scheme of the disclosed embodiments introduces a design idea of multi-level supervision, distributing supervision behavior (optionally also diagnostic behavior) to different levels of hardware entities at the system level.
Fig. 1 is a schematic deployment relationship diagram of an anomaly monitoring system and a chip system monitored by the anomaly monitoring system. As shown, the SoC is included in the chip system requiring security supervision in the present disclosure. The SoC includes a plurality of processing cores, and the processing cores run subsystems (depending on actual situations, one processing core may run one subsystem, or a plurality of subsystems may run on one processing core through a virtual core partition manner, which is only an example of one case in the figure). In the disclosed embodiment, a plurality of subsystems running on the SoC are hierarchically supervised by an exception supervision system. That is, a subsystem with a higher hardware security level (for example, a processing core corresponding to the subsystem has a lockstep mechanism and a memory isolation mechanism) is selected as the second subsystem 12, and a second layer of monitoring unit 121 is deployed in the second subsystem to collect the operation exception of each subsystem in the system on chip; the other subsystems except the second subsystem in the SoC are used as first subsystems 11, a first-layer supervision unit 111 is deployed in each first subsystem to supervise the running exception of the first subsystem, and according to the exception type of the running exception, the corresponding running exception is selected to be reported to the second-layer supervision unit.
And (4) abnormal operation: refers to the exception of a hardware module or software program that occurs during the operation of the subsystem.
The exception type: refers to the class into which the running exception is classified according to a particular rule.
In a more complete system-on-chip scenario, the system-on-chip also includes an MCU. Fig. 2 is another schematic deployment relationship diagram of the anomaly monitoring system and the chip system monitored by the anomaly monitoring system. As shown in the figure, the chip system requiring security supervision in the present disclosure includes an MCU and an SoC. The MCU is an external high-safety-level MCU and is used for monitoring abnormal events in SoC running the ADAS. The deployment of the SoC is similar to that of fig. 1 and will not be repeated here.
A security System 21 (e.g., a security Real Time Operating System) runs in the MCU, and a third layer monitoring unit 211 is deployed in the security System to monitor the running status of each subsystem in the System on chip according to the running exception reported by the first layer monitoring unit and/or the second layer monitoring unit.
In one embodiment, the operational anomalies may be classified according to their severity. The anomaly types include, in order, as to severity, from small to large: warning level exceptions, error level exceptions, and fatal error level exceptions.
In another embodiment, the operation exception may be classified according to the report object. Such as a first reporting type and a second reporting type. The first reporting type is set to report the operation exception to the second layer supervision unit, and the second reporting type is set to report the operation exception to the third layer supervision unit. The reporting object needs to be preset by combining the software and hardware characteristics of the second subsystem and the MCU, and the repair efficiency of various abnormal operations.
In practical applications, the dividing manner of the exception type may be determined according to actual requirements, and is not limited herein.
Taking the types of the exceptions classified according to the severity as an example, when configuring the monitoring units of each layer, it is set that the fatal error-level exception must be reported to the third-layer monitoring unit (that is, the third-layer monitoring unit is configured with an exception repair mechanism corresponding to the fatal error-level exception, or only the third-layer monitoring unit has the capability of repairing the fatal error-level exception), wherein the fatal error-level exception can be reported step by step (that is, the first-layer safety monitoring unit reports the second safety monitoring unit, and then the second-layer safety monitoring unit reports the third safety monitoring unit), or the first-layer safety monitoring unit directly reports the fatal error-level exception to the third safety monitoring unit. If the exception type is an alarm-level exception or an error-level exception, the first subsystem reports the exception type to the second-layer supervisory unit. When reporting to the second layer of supervision unit, if the warning-level exception or error-level exception is an exception that can be repaired locally by the first layer of safety supervision unit, the exception repair mechanism is also executed at the same time. When the second layer of monitoring unit monitors that the processing state of the warning-level abnormity or the error-level abnormity is 'repaired' in the set monitoring time, the warning-level abnormity or the error-level abnormity does not need to be reported to the third layer of monitoring unit continuously, and the processing load of the MCU is further reduced.
Selection of the second subsystem
The second layer of supervision unit is responsible for supervision of abnormal operation of each subsystem in the system on chip, so that the second subsystem where the second layer of supervision unit is located needs to have a higher safety function, and errors are still reported to the outside when other subsystems cannot continue to operate. The security functions include, but are not limited to: a lockstep mechanism and a memory isolation mechanism.
A step locking mechanism: the corresponding processing core is provided with a hardware self-checking module, can periodically execute the same instruction, compares the results after the same instruction is executed, and determines that the operation is abnormal if the results are inconsistent.
A memory isolation mechanism: the key programs and data are stored in the isolated memory area, so that the isolated memory area is not occupied when other programs are operated.
During system configuration, a processing core with a lock-step mechanism and a memory isolation mechanism is generally selected to deploy the security domain subsystem. Thus, in setting the hierarchy of supervisory units, the security domain subsystem is preferably selected as the second subsystem. And the other performance domain subsystem, the real-time domain subsystem or the starting domain subsystem is used as the first subsystem.
Reporting mode
Heartbeat packet mode: the lower layer supervising unit periodically sends heartbeat packet messages to the upper layer supervising unit through a Serial Peripheral Interface (SPI) bus. The heartbeat packet message comprises an operation state flag bit, and the operation state flag bit indicates whether the corresponding subsystem has abnormal operation. It should be noted that "lower layer" and "upper layer" are relative concepts in the embodiments of the present disclosure, and do not refer to a certain layer of the monitoring unit. For example, the second layer is an upper layer with respect to the first layer, and the second layer is a lower layer with respect to the third layer.
FMU mode: a Fault Management Unit (FMU) is a hardware module, and is used to collect an abnormal interrupt signal reported by a first subsystem to a second subsystem. Specifically, the FMU in the first subsystem collects an abnormal interrupt signal of a hardware module in the subsystem, and reports the abnormal operation to the FMU in the second subsystem in response to an abnormal interrupt signal indicating the abnormal operation. In an implementation scenario, if the abort signal is at a high level, it indicates that the first subsystem sending the abort signal has an abnormal operation; and if the abnormal interrupt signal is in a low level, the first subsystem which sends the abnormal interrupt signal runs normally. And vice versa.
For example, referring to fig. 3, the FMU is provided with multiple levels of associated FMU signals in a tree form, each FMU signal is configured with a unique identifier, and the unique identifier is recorded in an FMU signal routing table of the anomaly monitoring system. Specifically, each first subsystem is provided with a plurality of primary FMU signals (equivalent to leaf nodes of a tree diagram), and the primary FMU signals respectively correspond to various abnormal operations in the first subsystem. And the second subsystem is provided with multi-stage FMU signals which are a secondary FMU signal corresponding to abnormal operation in the second subsystem and a secondary FMU signal corresponding to the first subsystem respectively, and all the secondary FMU signals are connected with one tertiary FMU signal. Only one secondary FMU signal corresponding to the first subsystem can be configured (as shown in FIG. 3), and the primary FMU signals in all the first subsystems are connected to one secondary FMU signal; in another embodiment, a secondary FMU signal may be provided for each of the first subsystems. The specific configuration manner of the secondary FMU signal may be determined according to actual requirements, and is not limited herein.
When a FMU signal of a lower stage is at a high level (the second stage is higher than the first stage), a corresponding FMU signal of the upper stage also presents a high level, is transmitted to the FMU signal of the third stage all the way up, and triggers an Error pin. The MCU can know that the Soc has abnormal operation through the Error pin. The FMU mode is a hardware signal reporting mode, signal transmission timeliness is high, and stability is good.
Error pin: the Error pin is one of the pins in the hardware connection between the SoC and the MCU and is used for Error reporting communication between the SoC and the MCU.
Mailbox mode: the Mailbox is a hardware module, and the Mailbox is configured to enable data transmission among subsystems in the Soc. And a corresponding Mailbox channel is configured between the second subsystem and the first subsystem. The second layer of supervision units collect the running exception of the first layer of supervision units by using the Mailbox and Inter-Process Communication (Inter-Process Communication) software framework (IPC) on the Mailbox. Unlike the FMU reporting path, the Mailbox reporting path includes, in addition to the hardware exception signal, other exceptions, such as a sensor fault connected to the subsystem where the first layer supervisory unit is located, an application fault, and the like.
In practical application, when the lower layer monitoring unit reports the operation abnormality to the upper layer monitoring unit, two or more reporting modes can be adopted to form a safety redundancy mechanism for reporting the abnormality, so that at least one reporting mode can report the success of the operation abnormality to the upper layer monitoring unit, and the operation abnormality can be repaired in time. For example, please refer to fig. 4, where fig. 4 shows that a security redundancy mechanism in a heartbeat packet mode and an FMU mode is used simultaneously, when a hardware module in a first subsystem has an abnormal operation, an operation error is reported to a second subsystem through a heartbeat packet path of a bold line and an FMU path of a thin line at the same time.
Arrangement in connection with supervision units of layers
-first layer supervision units
And (3) operating environment: various subsystems other than the security domain subsystem (e.g., a performance domain subsystem, a real-time domain subsystem, or a boot domain subsystem);
the supervision object: the deployed subsystem determines that the deployed subsystem has abnormal operation by receiving an abnormal interrupt signal in the subsystem;
reporting an object: a second layer supervision unit or a third layer supervision unit;
the reporting mode is as follows: at least one of a heartbeat packet mode, an FMU mode, and a Mailbox mode;
abnormal repair of the object: partial alert level exceptions (e.g., software exceptions and sensor exceptions); if the first layer of supervision unit monitors that the software is abnormal and the sensor is abnormal, the operation abnormity is reported, whether the operation abnormity is locally self-repairable abnormity can be judged, and if the operation abnormity is locally self-repairable abnormity, a first abnormity repairing mechanism is executed. The first exception recovery mechanism is a recovery mechanism set corresponding to the running exception which can be locally recovered by the first layer supervision unit.
-a second level supervision unit
And (3) operating environment: a security domain subsystem (also called a security Island (security Island), where hardware resources of the security Island cannot be accessed from the outside), the security domain subsystem being deployed in a processing core having a lock-step mechanism and a memory isolation mechanism;
the supervision object: a first subsystem and a second subsystem;
reporting an object: a third layer of supervision units;
the reporting mode is as follows: heartbeat packet mode and/or Error pin;
abnormal repair of the object: a warning-level exception or a partial error-level exception;
and after receiving the operation abnormity reported by the first layer of supervision unit, the second layer of supervision unit judges whether the operation abnormity is the abnormity which can be repaired by the first layer of supervision unit, and if not, directly executes a second abnormity repairing mechanism. If yes, setting a first timer, starting timing from the received abnormal operation, and executing a second abnormal repair mechanism when the timing of the first timer exceeds a first repair time limit. The first timer is a timer for waiting for the time length used for abnormal repair counted by the second layer supervision unit. The first repair time limit is a maximum time threshold of the second layer supervision unit waiting for the abnormal repair time length. The second exception recovery mechanism is a recovery mechanism set corresponding to the operation exception, which can be repaired by the first layer supervision unit, and is indicated by the second layer supervision unit.
-a third level supervision unit
And (3) operating environment: a safety system of the MCU; the third layer supervisory unit may be integrated as a software component in the AUTOSAR based MCU firmware. The automotive Open System ARchitecture (AUTOSAR) is a set of standard protocols jointly established by automobile factories, automobile part suppliers, and automobile electronic software systems (e.g., electronic software systems) of all automobiles around the world.
The supervision object: a first subsystem and a second subsystem;
abnormal repair of the object: all fatal error level exceptions, and warning level exceptions;
and after receiving the operation exception reported by the first layer of monitoring unit or the second layer of monitoring unit, the third layer of monitoring unit judges whether the operation exception is an exception which can be repaired by the first layer of monitoring unit or the second layer of monitoring unit, and if not, directly executes a third exception repairing mechanism. If yes, setting a second timer, starting timing from the reception of the abnormal operation, and executing a third abnormal repair mechanism when the timing of the second timer exceeds a second repair time limit. The second timer is a timer for waiting for the time length used for abnormal repair counted by the third layer of supervision units. The second repair time limit is a maximum time threshold of the third layer supervision unit waiting for the abnormal repair time length. The third abnormal repair mechanism is a set of repair mechanisms for repairing the operation abnormality of the first subsystem or the second subsystem.
It should be noted that the exception recovery capabilities of the supervisory units are all downward compatible. That is, the third layer of the supervisory unit has an exception repair mechanism corresponding to the repair of all running exceptions. For the abnormality which can be repaired by the third layer supervision unit, a second timer is set locally in the third layer supervision unit, timing is started from the fact that the operation abnormality reported by the first layer supervision unit or the second layer supervision unit is received, when the second timer exceeds the second repair time limit, the unknown condition is generated, the first layer supervision unit or the second layer supervision unit cannot repair the operation abnormality as usual, at the moment, the third layer supervision unit needs to intervene, and a third abnormal repair mechanism is executed to repair the related operation abnormality. Similarly, the first layer of monitoring unit can locally repair the abnormal operation by self, and the second layer of monitoring unit can also repair the abnormal operation; for the abnormality which can be repaired by the first layer monitoring unit, a first timer is set locally in the second layer monitoring unit, the second layer monitoring unit starts to time after receiving the running abnormality reported by the first layer monitoring unit, when the second timer exceeds the first repairing time limit, the unknown condition is caused, the first layer monitoring unit cannot repair the running abnormality as usual, at the moment, the second layer monitoring unit needs to intervene, and a second abnormality repairing mechanism is executed to repair the related running abnormality.
The exception supervision system described in the embodiment of the present disclosure performs supervision level division in the system on chip, and sets a specific subsystem (e.g., a security domain subsystem) as an intermediate layer (i.e., a second layer supervision unit) for reporting an exception in the system on chip, so that part of the running exceptions can be repaired in the second layer supervision unit without reporting all the running exceptions to the MCU, thereby reducing the processing load of the MCU. In addition, each subsystem is provided with a monitoring unit which can report the independent operation abnormity, so that the superior monitoring unit can accurately position the subsystem sent by the operation abnormity, and further can carry out targeted repair operation (such as reset) on the subsystem with the operation abnormity, thereby avoiding the condition that the whole SoC needs to be reset due to the abnormity of individual subsystems. Furthermore, a repair mechanism is arranged on the first layer of monitoring unit or the second layer of monitoring unit when each subsystem is abnormal in operation, repair is not needed after reporting layer by layer, repair efficiency of abnormal operation is improved, and meanwhile processing burden of the upper layer of monitoring unit is reduced.
Accordingly, based on the above-mentioned anomaly supervision system, an anomaly supervision method is further provided in the embodiments of the present disclosure, please refer to fig. 5, and fig. 5 shows an exemplary method flowchart of the anomaly supervision method according to an embodiment of the present disclosure.
As shown in fig. 5, in step 501, a first layer supervision unit supervises an abnormal operation of a first subsystem; the first-layer supervision units run in the first subsystem, and the first-layer supervision units correspond to the first subsystem one to one.
An operation exception in the embodiments of the present disclosure refers to an exception of a hardware module or a software program occurring during the operation of a subsystem.
As shown in fig. 2, an anomaly supervision system described in the embodiment of the present disclosure includes: the first layer of supervision unit, the second layer of supervision unit and the third layer of supervision unit are respectively deployed in a system operated by a first subsystem, a second subsystem and the MCU of the SoC. The system comprises a plurality of first subsystems of the SoC, wherein each first subsystem is provided with a first-layer supervision unit. The second subsystem of the SoC has one second layer supervision unit, which is deployed on the second subsystem and supervises the operation exception of each first subsystem, and also supervises the operation exception of the second subsystem locally.
In step 502, if an operation abnormality occurs in the first subsystem, the first layer supervisory unit determines an abnormality type of the operation abnormality.
The exception type in the embodiments of the present disclosure refers to a category into which the operation exception is classified according to a specific rule.
In one embodiment of the exception type, the operational exceptions may be classified according to their severity. The anomaly types include, in order, from small to large according to severity: warning level exceptions, error level exceptions, and fatal error level exceptions. The following are exemplary:
the warning-level exception is generally defined as an operation exception which does not affect the normal operation of the SoC, such as an SRAM ECC 1-bit error, and although the operation exception occurs, the subsystem can repair itself according to a parity check mechanism.
An error level exception is generally defined as a running exception that the first subsystem cannot repair itself locally, such as an SRAM ECC 2-bit error, which may fail immediately, yet not fail for a significant period of time.
The fatal error-level exception is generally positioned as an operation exception which can quickly cause the whole SoC system or any subsystem to be out of operation, such as a Watchdog Timer interrupt, and if the Watchdog Timer is interrupted by a timeout, the system operation is usually stuck and can only be reset.
In another embodiment of the exception type, the operation exception may be classified according to the report object. Such as a first reporting type and a second reporting type. The first reporting type is set to report the operation exception to the second layer supervision unit, and the second reporting type is set to report the operation exception to the third layer supervision unit. The reporting object needs to be preset by combining the software and hardware characteristics of the second subsystem and the MCU, and the repair efficiency of various abnormal operations.
In practical applications, the dividing manner of the exception type may depend on actual requirements, and is not limited herein.
When the second-layer monitoring unit receives the operation abnormity reported by the first-layer monitoring unit, the second-layer monitoring unit determines whether the operation abnormity is the abnormity which can be repaired by the first-layer monitoring unit, and if so, a third timer is started; when the running abnormity is repaired, the third timer stops timing; when the timing of the third timer exceeds the reporting time limit, the second layer supervising unit reports the running exception to the third layer supervising unit. The third timer is the time taken by the first layer supervision unit to repair the operation abnormity counted in the second layer supervision unit, and if the time exceeds the preset reporting time limit, the corresponding operation abnormity is reported to the third layer supervision unit.
In step 503, the first layer supervising unit reports the abnormal operation to the second layer supervising unit or the third layer supervising unit according to the abnormal type of the abnormal operation; the second layer of supervision unit runs in the second subsystem, and the third layer of supervision unit runs in the MCU.
Taking the types of the exceptions classified according to the severity as examples (i.e., the exception types include warning-level exception, error-level exception, and fatal error-level exception), it is assumed that the fatal error-level exception must be reported to the third-layer supervisory unit (i.e., the third-layer supervisory unit is configured with an exception repair mechanism corresponding to the fatal error-level exception, or only the third-layer supervisory unit has the capability of repairing the fatal error-level exception). And if the exception type is determined to be warning-level exception or error-level exception, the first-layer supervision unit selects to report the running exception to the second-layer supervision unit.
In the exception supervision method described in the embodiment of the present disclosure, the first layer supervision unit may select to report part of the operation exceptions to the second layer supervision unit (that is, part of the operation exceptions may be repaired in the second layer supervision unit), and all the operation exceptions do not need to be reported to the MCU, which reduces the processing load of the MCU. And each subsystem is provided with a monitoring unit which can report the independent running abnormity, so that the superior monitoring unit can accurately position the subsystem sent by the running abnormity, and further can carry out targeted repair operation (such as reset) on the subsystem with the running abnormity, thereby avoiding the condition that the whole SoC is required to be reset due to the abnormity of individual subsystems.
In practical applications, there are multiple reporting modes of the operation exception, and the disclosure provides another embodiment, and at least two reporting modes can be selected to be executed simultaneously when reporting is performed, so as to form safety redundancy. And if the reporting mechanism of the hardware mode and the reporting mechanism of the software mode exist at the same time, the interrupt state caused by abnormal operation in the hardware circuit can be updated by the software mode, so that the third layer of monitoring unit operated in the MCU can maintain the current operation state of each subsystem in the SoC in time. Referring to FIG. 6, FIG. 6 illustrates an exemplary method flow diagram of an anomaly supervision method according to another embodiment of the present disclosure.
In step 601, the first level supervisory unit monitors the first subsystem for a fatal error level anomaly.
In the embodiment of the present disclosure, the first layer supervisory unit reports the fatal error level abnormality through the FMU mode (hereinafter, a flow corresponding to 602a to 606 a) and the heartbeat packet mode (hereinafter, b flow corresponding to 602b to 606 b) at the same time.
In step 602a, the first layer supervisory unit determines a hardware FMU signal corresponding to the fatal error-level anomaly and sets the hardware FMU signal to a high level.
In step 603a, the second layer supervisory unit sets the Error pin to high level according to the high level hardware FMU signal.
For example, referring to fig. 4, when the corresponding FMU signal in the first subsystem is set to high level, the high level FMU signal in the first subsystem conducts the layer-level FMU signal to the upper level FMU signal, and when the highest level FMU signal is set to high level, the Error pin connected to the SoC and the MCU is also pulled to high level.
In step 604a, the third layer supervisory unit monitors the SoC via the Error pin for an abnormal operation, and starts a second timer.
The second timer starts counting when the Error pin is set to high level, and waits for the operation abnormity to be repaired. The second timer is a timer for waiting for the time length used for abnormal repair counted by the third layer of supervision units.
In step 605a, the third layer supervising unit determines whether the second timer exceeds a second repair time limit.
When the second timer exceeds the second repair time limit, step 606a is performed and the third layer supervisory unit runs a third abnormal repair mechanism. The second repair time limit is a maximum time threshold of the third layer supervision unit waiting for the abnormal repair time length. The third abnormal repair mechanism is a set of repair mechanisms for repairing the abnormal operation of the first subsystem or the second subsystem. When a fatal Error level abnormality occurs in the SoC, the first layer of supervision units and the second layer of supervision units often have no repair capability, and the operation abnormality cannot be repaired all the time, so that the Error pin can be in a high level state for a long time. When the second repair time limit is exceeded, the third tier supervisory unit may perform a third exception repair mechanism (e.g., reset the corresponding first subsystem).
When the second timer exceeds the second repair time limit, step 607 is performed and the third layer supervisory unit determines that the operational anomaly has been repaired.
In step 602b, the first layer supervisory unit periodically updates the current hardware state and updates the next heartbeat packet message.
And if the message is still in the interrupt state, packaging the abnormal operation information into the next heartbeat packet message.
And if the state is still in a normal state, encapsulating the information which runs normally into the next heartbeat packet message.
In step 603b, the second layer supervising unit monitors that the first subsystem has abnormal operation through the heartbeat packet message, and reports the abnormal operation to the third layer supervising unit through the heartbeat packet message.
In step 604b, the second layer supervisory unit determines whether the corresponding first subsystem is repaired normally according to the current heartbeat packet message.
If yes, step 605b is executed, and the second layer supervision unit restores the Error pin to a low level state and feeds back a signal which runs normally to the third layer supervision unit through the heartbeat packet message.
If not, go to step 606b to wait for the third layer supervisory unit to execute the third abnormal repair mechanism.
In step 607, the third level supervisory unit determines that the operational anomaly has been repaired.
In the embodiment disclosed by the invention, the FMU mode and the heartbeat packet mode are executed at the same time, so that a safety redundancy mechanism for reporting the abnormal operation is formed, and the safety of a supervised chip system is improved. And through the flow of the heartbeat package mode, the second layer of monitoring units can know that the corresponding running abnormity is repaired in time and update the Error pin in time, so that the third layer of monitoring units can synchronously run the information that the abnormity is repaired in time, and the monitoring efficiency of the abnormity monitoring system is improved.
Correspondingly, the embodiment of the disclosure also provides a vehicle, and the chip system loaded on the vehicle is supervised by the abnormity supervision system. For a specific exception supervision process, reference may be made to the above embodiments of the exception supervision system and the exception supervision method, which are not described herein again.
It is noted that for the sake of brevity, this disclosure presents some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the disclosed aspects are not limited by the order of acts described. Accordingly, it will be appreciated by those skilled in the art in light of the disclosure or teachings of the present disclosure that certain steps therein may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in that the acts or modules involved are not necessarily required for the implementation of the solution or solutions of the disclosure. In addition, the present disclosure also focuses on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, method or computer program product. Thus, the present invention may be embodied in the form of: the term "computer readable medium" as used herein refers to any tangible medium that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive example) of the computer readable storage medium may include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims (23)

1. An anomaly supervision system for supervising a system-on-chip having a system-on-chip Soc with a plurality of first subsystems and second subsystems, the anomaly supervision system comprising:
the first-layer monitoring units run in the first subsystems and are used for monitoring the running abnormity of the first subsystems and selectively reporting the corresponding running abnormity to the second-layer monitoring units according to the abnormity types of the running abnormity; and
the second layer supervision unit is operated in the second subsystem and used for collecting the operation abnormity of each subsystem in the system on chip.
2. The anomaly supervision system according to claim 1, said system-on-chip further comprising a microcontroller MCU, characterized in that said anomaly supervision system further comprises:
and the third layer of supervision unit runs in the microcontroller and is used for supervising the running state of each subsystem in the system on chip according to the running abnormity reported by the first layer of supervision unit and/or the second layer of supervision unit.
3. The anomaly supervision system according to claim 2,
the second layer supervisory unit is further to: and selecting corresponding operation exception to report to the third layer supervision unit according to the processing state of the operation exception.
4. The anomaly supervision system according to claim 2,
the first layer supervisory unit is further to: and according to the abnormal type of the running abnormity, selecting to report the running abnormity to the second layer supervision unit or the third layer supervision unit.
5. The anomaly supervision system according to claim 4,
the exception types include: alert level exceptions, error level exceptions, and fatal error level exceptions;
correspondingly, when the first layer supervisory unit executes the reporting operation, the method further includes:
if the abnormal type of the operation abnormity is warning grade abnormity or error grade abnormity, reporting the operation abnormity to the second layer supervision unit;
and if the abnormal type of the running exception is a fatal error-level exception, reporting the running exception to the third-layer supervision unit.
6. Anomaly supervision system according to any one of claims 2 to 5,
when the first layer supervising unit executes the reporting operation, the method further comprises: reporting the abnormal operation through at least one mode of a heartbeat package mode, a Fault Management Unit (FMU) mode and a Mailbox mode;
and when the second layer supervising unit executes the reporting operation, the method is further configured to: and reporting the abnormal operation through a heartbeat packet mode and/or an Error pin.
7. The anomaly supervision system according to claim 6,
the heartbeat packet mode includes: the lower layer supervising unit periodically sends heartbeat packet messages to the upper layer supervising unit through a Serial Peripheral Interface (SPI) bus;
the heartbeat packet message comprises an operation state flag bit, and the operation state flag bit indicates whether the corresponding subsystem has abnormal operation.
8. The anomaly supervision system according to claim 6,
the FMU mode comprises:
the FMU in the first subsystem collects an abnormal interrupt signal of a hardware module in the subsystem, responds to the abnormal interrupt signal to indicate abnormal operation, and reports the abnormal operation to the FMU in the second subsystem.
9. Exception supervision system according to any of the claims 2 to 5,
the first layer supervisory unit is further to: determining whether the running exception is an exception which can be repaired by a first layer of supervision unit, and if so, executing a first exception repair mechanism;
the second layer supervisory unit is further to: determining whether the operation exception reported by the first layer of monitoring unit is an exception which can be repaired by the second layer of monitoring unit, if so, executing a second exception repairing mechanism;
and the third layer supervision unit is further configured to: and executing a third abnormal repairing mechanism for the abnormal operation reported by the first layer supervision unit and/or the second layer supervision unit.
10. The anomaly supervision system according to claim 9,
the second layer supervisory unit to: if the first layer of supervision unit executes a first abnormal repairing mechanism, a first timer is started, and when the timing of the first timer exceeds a first repairing time limit, a second abnormal repairing mechanism is executed;
the third tier supervisory unit is further to: and if the second layer supervision unit executes a second abnormal repairing mechanism, starting a second timer, and executing a third abnormal repairing mechanism when the timing of the second timer exceeds a second repairing time limit.
11. Exception supervision system according to any of claims 1 to 5,
the first subsystem comprises: a performance domain subsystem, a real-time domain subsystem or a start-up domain subsystem;
the second subsystem comprises: a security domain subsystem.
12. The exception supervision system according to any of claims 1 to 5, wherein the second subsystem operates in a processing core having a lockstep mechanism and a memory isolation mechanism.
13. An exception supervision method for supervising a system-on-chip having a system-on-chip Soc with a plurality of first subsystems and second subsystems, the method comprising:
the first layer of supervision unit supervises the abnormal operation of the first subsystem; the first-layer supervision units run in the first subsystem, and the first-layer supervision units are in one-to-one correspondence with the first subsystem;
if the first subsystem has abnormal operation, the first layer supervision unit selects whether to report the abnormal operation to a second layer supervision unit according to the abnormal type of the abnormal operation; wherein the second layer supervisory unit operates in the second subsystem.
14. The method of claim 13, wherein the selecting, by the first-layer supervisory unit, whether to report the running exception to the second-layer supervisory unit according to the exception type of the running exception comprises:
the first layer of supervision unit selects to report the running exception to the second layer of supervision unit or the third layer of supervision unit according to the exception type of the running exception; wherein the third layer of supervision units operates in an MCU.
15. The method of claim 14,
the exception types include: alert level exceptions, error level exceptions, and fatal error level exceptions;
correspondingly, the selecting, by the first-layer supervising unit, to report the operation exception to the second-layer supervising unit or the third-layer supervising unit according to the exception type of the operation exception includes:
if the abnormal type of the operation abnormity is warning grade abnormity or error grade abnormity, reporting the operation abnormity to the second layer supervision unit;
and if the abnormal type of the running abnormity is a fatal error grade abnormity, reporting the running abnormity to the third-layer supervision unit.
16. The method of claim 14, wherein after the second-layer supervisory unit receives the operation exception reported by the first-layer supervisory unit, the method further comprises:
the second layer of supervision unit determines whether the running abnormity is the abnormity which can be repaired by the first layer of supervision unit, and if so, a third timer is started;
when the abnormal operation is repaired, the third timer stops timing;
and when the timing of the third timer exceeds an reporting time limit, the second-layer supervision unit reports the running exception to the third-layer supervision unit.
17. The method of claim 14, further comprising:
the second layer supervision unit supervises the abnormal operation of the second subsystem;
and if the second subsystem has abnormal operation, the second layer supervision unit reports the abnormal operation to a third layer supervision unit.
18. The method of claim 13,
and when the first layer supervision unit executes the reporting operation, reporting the running abnormity through at least one mode of a heartbeat package mode, a Failure Management Unit (FMU) mode and a Mailbox mode.
19. The method of claim 17,
and when the second-layer supervision unit executes the reporting operation, reporting the running abnormity through a heartbeat packet mode and/or an Error pin.
20. The method of any of claims 14 to 16, wherein after determining that the first subsystem is malfunctioning, the method further comprises:
the first layer of supervision unit determines whether the running exception is an exception which can be repaired by the first layer of supervision unit, and if so, a first exception repair mechanism is executed;
the second-layer supervision unit determines whether the operation exception is an exception which can be repaired by the second-layer supervision unit, and if so, a second exception repair mechanism is executed;
and the third-layer monitoring unit executes a third abnormal repairing mechanism on the abnormal operation reported by the first-layer monitoring unit and/or the second-layer monitoring unit.
21. The method of claim 20, further comprising:
if the first layer of monitoring unit executes a first abnormal repairing mechanism, the second layer of monitoring unit starts a first timer after receiving the running abnormity reported by the first layer of monitoring unit, and executes a second abnormal repairing mechanism when the timing of the first timer exceeds a first repairing time limit;
and if the second-layer monitoring unit executes a second abnormal repairing mechanism, the third-layer monitoring unit starts a second timer after receiving the abnormal operation reported by the second-layer monitoring unit, and executes a third abnormal repairing mechanism when the timing of the second timer exceeds a second repairing time limit.
22. A computer-readable storage medium, characterized in that a computer program code of an exception supervision method is stored, which, when executed, performs the method of any of claims 13 to 21.
23. A vehicle characterized in that a chip system loaded by the vehicle is supervised by the abnormality supervision system according to any one of claims 1 to 12.
CN202211288134.9A 2022-10-20 2022-10-20 Abnormality supervision system, abnormality supervision method, storage medium, and vehicle Pending CN115617557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211288134.9A CN115617557A (en) 2022-10-20 2022-10-20 Abnormality supervision system, abnormality supervision method, storage medium, and vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211288134.9A CN115617557A (en) 2022-10-20 2022-10-20 Abnormality supervision system, abnormality supervision method, storage medium, and vehicle

Publications (1)

Publication Number Publication Date
CN115617557A true CN115617557A (en) 2023-01-17

Family

ID=84863878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211288134.9A Pending CN115617557A (en) 2022-10-20 2022-10-20 Abnormality supervision system, abnormality supervision method, storage medium, and vehicle

Country Status (1)

Country Link
CN (1) CN115617557A (en)

Similar Documents

Publication Publication Date Title
US7607043B2 (en) Analysis of mutually exclusive conflicts among redundant devices
US9760468B2 (en) Methods and arrangements to collect data
US8713350B2 (en) Handling errors in a data processing system
US7802128B2 (en) Method to avoid continuous application failovers in a cluster
EP2659371B1 (en) Predicting, diagnosing, and recovering from application failures based on resource access patterns
US8108724B2 (en) Field replaceable unit failure determination
US20120239973A1 (en) Managing Errors In A Data Processing System
US20110154091A1 (en) Error log consolidation
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN113672415A (en) Disk fault processing method, device, equipment and storage medium
US20080010494A1 (en) Raid control device and failure monitoring method
CN113147776A (en) Hot backup fault processing system and method for vehicle and vehicle adopting hot backup fault processing system
CN115617557A (en) Abnormality supervision system, abnormality supervision method, storage medium, and vehicle
US8812916B2 (en) Failure data management for a distributed computer system
CN115617558A (en) Vehicle diagnostic system, method, storage medium, and vehicle
US20080008166A1 (en) Method of detecting defective module and signal processing apparatus
CN108762999A (en) A kind of kernel failure collection method and device
CN114217925A (en) Business program operation monitoring method and system for realizing abnormal automatic restart
JP5696492B2 (en) Failure detection apparatus, failure detection method, and failure detection program
CN108897645B (en) Database cluster disaster tolerance method and system based on standby heartbeat disk
CN115150253B (en) Fault root cause determining method and device and electronic equipment
CN118012695A (en) Log data management method and device in distributed cluster
CN107707402B (en) Management system and management method for service arbitration in distributed system
CN116737435A (en) IOT Agent-based 5G camera system diagnosis and upgrading method
JP2017117065A (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination