CN117130832A - Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment - Google Patents

Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment Download PDF

Info

Publication number
CN117130832A
CN117130832A CN202311396176.9A CN202311396176A CN117130832A CN 117130832 A CN117130832 A CN 117130832A CN 202311396176 A CN202311396176 A CN 202311396176A CN 117130832 A CN117130832 A CN 117130832A
Authority
CN
China
Prior art keywords
hardware domain
target process
state
watchdog
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311396176.9A
Other languages
Chinese (zh)
Other versions
CN117130832B (en
Inventor
陈星宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Semidrive Technology Co Ltd
Original Assignee
Nanjing Semidrive Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Semidrive Technology Co Ltd filed Critical Nanjing Semidrive Technology Co Ltd
Priority to CN202311396176.9A priority Critical patent/CN117130832B/en
Publication of CN117130832A publication Critical patent/CN117130832A/en
Application granted granted Critical
Publication of CN117130832B publication Critical patent/CN117130832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1441Resetting or repowering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

The application discloses a monitoring reset method, a system, a chip and electronic equipment of a multi-core heterogeneous system. The method comprises the following steps: responding to that a first target process in an application space of a first operating system is in a fault state by a first watchdog to output a first overtime signal; responding to the first timeout signal through a system error monitoring module, and sending a first fault identification for identifying that the first target process is in a fault state to a second hardware domain; transmitting a first reset instruction to the first hardware domain by using inter-core communication based on the first fault signal through the second hardware domain, and indicating the first hardware domain to restart at least a first target process in a fault state; and responding the kernel space of the first operating system to be in a fault state through a second watchdog to output a second timeout signal, and triggering the first hardware domain to restart through the second timeout signal. Thus, the stability of the multi-core heterogeneous system is improved, and the user experience is improved.

Description

Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment
Technical Field
The application relates to the technical field of multi-core heterogeneous systems, in particular to a monitoring and resetting method, a system, a chip and electronic equipment of the multi-core heterogeneous system.
Background
Watchdog (Watch Dog) is typically used to monitor whether critical objects of the operating system are malfunctioning, such as critical applications in the user space of the operating system, kernel space of the operating system, or both critical applications and kernel space in the user space may be monitored by the watchdog.
If the critical application fails, the critical application itself need only be restarted in most cases to be able to troubleshoot. If the kernel space fails, it is often necessary to restart the operating system to be able to troubleshoot. When using a watchdog to monitor both critical applications and kernel space, it is often indistinguishable whether the watchdog times out due to a critical application failing or the watchdog times out due to a kernel space failing. To ensure troubleshooting, the watchdog is typically configured by default to restart the operating system upon timeout. It is clear that if only critical applications fail, such a solution is disadvantageous for fast troubleshooting and can affect the proper functioning of other applications on the operating system.
Disclosure of Invention
In view of the above problems in the prior art, the present application provides a monitoring and resetting method for a multi-core heterogeneous system, a chip and an electronic device.
The application provides a monitoring reset method of a multi-core heterogeneous system, which is applied to the multi-core heterogeneous system, wherein the multi-core heterogeneous system comprises a first hardware domain, a second hardware domain and a system error monitoring module, the first hardware domain comprises a first watchdog and a second watchdog, and the first hardware domain is used for supporting the operation of a first operating system; the method comprises the following steps:
outputting a first overtime signal by the first watchdog in response to a first target process in an application space of the first operating system being in a fault state;
transmitting, by the system error monitoring module, a first failure identifier for identifying that the first target process is in a failure state to the second hardware domain in response to the first timeout signal;
sending a first reset instruction to the first hardware domain by using inter-core communication based on the first fault signal through the second hardware domain, and indicating the first hardware domain to restart at least a first target process in a fault state;
and responding the kernel space of the first operating system to be in a fault state through the second watchdog to output a second overtime signal, and triggering the first hardware domain to restart through the second overtime signal.
In some embodiments, sending, by the second hardware domain, a first reset instruction to the first hardware domain using inter-core communication based on the first failure signal, instructing the first hardware domain to restart at least a first target process in a failure state, including:
transmitting a first reset instruction to the first hardware domain by inter-core communication based on the first fault signal through the second hardware domain;
and responding to the first reset instruction through a second target process in the application space of the first operating system, and restarting at least the first target process in a fault state.
In some embodiments, the method further comprises:
periodically resetting, by the second target process, a state identifier corresponding to the second target process from a first state value to a second state value;
periodically polling the state identification of each second target process through the first target process;
restarting a second target process corresponding to the state identifier under the condition that the polling determines that the state identifier is a first state value;
in the event that the poll determines that the status flag is a second status value, the status flag is reset from the second status value to the first status value.
In some embodiments, periodically detecting, by the first target process, a status identification of each of the second target processes includes:
and periodically polling the state identification of each second target process through a timing polling thread in the first target process.
In some embodiments, the method further comprises:
and under the condition that the polling determines that the state identifier is a first state value, capturing the process information of a second target process corresponding to the state identifier, and generating a crash log.
In some embodiments, triggering the first hardware domain restart by the second timeout signal includes:
transmitting a second fault identifier for identifying that the kernel space of the first operating system is in a fault state to the second hardware domain by the system error monitoring module in response to the second timeout signal;
and calling a system reset function to trigger the first hardware domain to restart through a system error processing program in the second hardware domain based on the second fault identifier.
In some embodiments, triggering the first hardware domain restart by the second timeout signal includes:
and responding to the second timeout signal through a reset module of the multi-core heterogeneous system, and triggering the multi-core heterogeneous system to restart.
The second aspect of the present application provides a multi-core heterogeneous system, which includes a first hardware domain, a second hardware domain and a system error monitoring module, where the first hardware domain includes a first watchdog and a second watchdog, and the first hardware domain is used to support the operation of a first operating system;
the first watchdog is configured to output a first timeout signal in response to at least one first target process in an application space of the first operating system being in a failure state;
the system error monitoring module is configured to respond to the first timeout signal and send a first fault identification for identifying that the first target process is in a fault state to the second hardware domain;
the second hardware domain is configured to send a first reset instruction to the first hardware domain by utilizing inter-core communication based on the first fault signal, and instruct the first hardware domain to restart at least a first target process in a fault state;
the second watchdog is configured to output a second timeout signal in response to the kernel space of the first operating system being in a fault state, and trigger the first hardware domain to restart through the second timeout signal.
A third aspect of the application provides a chip comprising a multi-core heterogeneous system as described above.
A fourth aspect of the application provides an electronic device comprising a chip as described above.
According to the monitoring and resetting method for the multi-core heterogeneous system, the application space of the first operating system is monitored through the first watchdog, and the kernel space of the first operating system is monitored through the second watchdog. The fault occurs in the application space and the kernel space of the first operating system, and the faults are processed through different fault processing measures. When a first target process in the application space fails, restarting the first target process through the cooperation of the first watchdog, the system error monitoring module and the second hardware domain. When the kernel space fails, the first hardware domain is triggered to restart through the second watchdog. Therefore, the flexibility of fault processing can be improved, the stability of the multi-core heterogeneous system can be improved, and the user experience can be improved.
Drawings
FIG. 1 is a block diagram of one embodiment of a multi-core heterogeneous system of the present application;
FIG. 2 is a block diagram of another embodiment of a multi-core heterogeneous system of the present application;
FIG. 3 is a block diagram of a first hardware domain of yet another embodiment of a multi-core heterogeneous system of the present application;
FIG. 4 is a flow chart of a monitor reset method of the multi-core heterogeneous system of the present application.
Detailed Description
Various aspects and features of the present application are described herein with reference to the accompanying drawings.
It should be understood that various modifications may be made to the embodiments of the application herein. Therefore, the above description should not be taken as limiting, but merely as exemplification of the embodiments. Other modifications within the scope and spirit of the application will occur to persons of ordinary skill in the art.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and, together with a general description of the application given above, and the detailed description of the embodiments given below, serve to explain the principles of the application.
These and other characteristics of the application will become apparent from the following description of a preferred form of embodiment, given as a non-limiting example, with reference to the accompanying drawings.
It is also to be understood that, although the application has been described with reference to some specific examples, many other equivalent forms of implementing the application can be made by those skilled in the art, which are intended to be within the scope of the application as defined in the claims.
The above and other aspects, features and advantages of the present application will become more apparent in light of the following detailed description when taken in conjunction with the accompanying drawings.
Specific embodiments of the present application will be described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application in unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not intended to be limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.
The specification may use the word "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.
An embodiment of the present application provides a multi-core heterogeneous system, and fig. 1 is a block diagram of an embodiment of a multi-core heterogeneous system according to an embodiment of the present application, and referring to fig. 1, the multi-core heterogeneous system may include a first hardware domain 110, a second hardware domain 120, and a system error monitoring module 130.
The first hardware domain 110 includes a first watchdog 111 and a second watchdog 112, and the first hardware domain 110 is configured to support a first operating system running. Alternatively, the first hardware domain 110 and the second hardware domain 120 may each be any one of a plurality of hardware domains of the multi-core heterogeneous system. Optionally, the first operating system includes, but is not limited to, linux operating system, android operating system, and the like. For example, the first hardware domain 110 may be an application domain and the second hardware domain 120 may be a security domain. The application domain may be used to support Linux operating system operations and the security domain may be used to support real-time operating system (RTOS) operations.
The first watchdog 111 is configured to: and responding to the first target process in the application space of the first operating system in a fault state, and outputting a first timeout signal.
For example, the first watchdog 111 may include a first timer circuit for counting time and a first reset circuit configured to be triggered and output the first timeout signal if a count result of the first timer circuit is timeout. For example, the first reset circuit may be configured to output a high level signal or a low level signal as the first timeout signal.
Alternatively, a single first target process in the application space of the first operating system may be monitored by the first watchdog 111, or a plurality of different first target processes may be monitored by the first watchdog 111. In the case where a plurality of first target processes are monitored by the first watchdog 111, the first watchdog 111 may be configured to output a first timeout signal in response to any one of the first target processes being in a failure state.
It should be noted that, the first target process may be various processes in an application space of the first operating system, and the type of the first target process is not limited herein. For example, the first target process may include, but is not limited to, a process of an application, an inter-core communication Daemon (MBX Daemon), or a software watchdog process, among others.
Optionally, the first target process may be provided with a feeding dog thread, and the feeding dog thread may be configured to periodically send a feeding signal to the first watchdog 111, and control the first watchdog 111 to clear the count value through the feeding signal. If the first target process falls into a fault state such as a dead loop or confusion, the feeding thread cannot periodically send a feeding signal to the first watchdog 111 in a preconfigured manner, the first watchdog 111 does not acquire the feeding signal within a unit time period, a timing result overtime occurs, and a first overtime signal can be output.
Alternatively, a plurality of first target processes may be monitored by the first watchdog 111. Each first target process can be provided with a dog feeding thread, and each first target process can have a first state identification corresponding to the first target process. The first state identification may include a first state value and a second state value. For example, the first state value may be 1 and the second state value may be 0. The dog-feeding thread of the first target process may be configured to periodically reset the first status identification from a first status value to a second status value.
A third target process may be further disposed in the application space of the first operating system, and the third target process may be disposed with a timing polling thread and a dog feeding thread. The timed polling thread of the third target process may be configured to periodically poll the first status identification of each of the first target processes. If the first state identification is a second state value, it may be determined that a first target process corresponding to the first state identification is in a non-failed state. If the first state identification is a first state value, it may be determined that a first target process corresponding to the first state identification is in a failed state. The dog feed thread of the third target process may be configured to send a feed signal to the first watchdog 111 if all of the monitored first target processes are in a non-faulty state; and is configured to stop sending a feeding signal to the first watchdog 111 in case any one of the monitored first target processes is in a failure state.
The system error monitoring module 130 is configured to: in response to the first timeout signal, a first failure identification is sent to the second hardware domain 120 for identifying that the first target process is in a failure state.
Alternatively, the system error monitoring module 130 (SEM module) may be independent of the individual hardware domains. Alternatively, the system error monitoring module 130 may be disposed in other hardware domains outside the first hardware domain 110 and the second hardware domain 120. For example, the system error monitoring module 130 may be disposed in a third hardware domain of the multi-core heterogeneous system. Alternatively, the system error monitor module 130 may also be disposed in the second hardware domain 120.
Alternatively, the system error monitoring module 130 may be connected to the first watchdog 111 through a signal interface, and the system error monitoring module 130 may determine, in response to the signal interface receiving a first timeout signal, a first fault identifier corresponding to the first timeout signal. For example, the system error monitoring module 130 may determine a corresponding first failure identification based on an interface identification associated with the first watchdog 111. For example, the system error monitoring module 130 may determine the corresponding first fault identification based on power information of the first timeout signal.
Alternatively, the first crash identifier may include various types of information that can uniquely identify the crash event that the first target process is in a crash state. For example, the system error monitor module 130 may send a digital signal to the second hardware domain 120 that identifies a failure event in which the first target process is in a failure state.
The second hardware domain 120 is configured to: based on the first failure signal, a first reset instruction is sent to the first hardware domain 110 by using inter-core communication, and the first hardware domain 110 is instructed to restart at least a first target process in a failure state.
Optionally, a system error handler corresponding to the system error monitoring module 130 may be disposed in the second hardware domain 120. The second hardware domain 120 may call a system error handler (SEM handler) upon receiving the first failure signal. An inter-core communication program (MBX handler) is invoked by the system error handler based on the first fault signal. The first reset instruction is sent to the first hardware domain 110 by the inter-core communication program using inter-core communication.
Alternatively, in the case where the first watchdog 111 monitors a single first target process, the first hardware domain 110 may restart the first target process based on a first reset instruction. In the case that the first watchdog 111 monitors a plurality of first target processes, the first hardware domain 110 may restart all the first target processes monitored by the first watchdog 111 based on the first reset instruction, and the control logic is relatively simple.
Alternatively, in the case that the first watchdog 111 monitors a plurality of first target processes, the first hardware domain 110 may restart one or more first target processes in a failure state based on a first reset instruction. For example, the first hardware domain 110 may determine a first target process that is in a failed state based on a first state identification of each first target process, and restart the first target process whose first state identification is a first state value. For example, the third target process may be further configured to generate a failed process identification if the polling determines that a certain first state identification is a first state value. In the case that the first hardware domain 110 obtains the first reset instruction, the first target process in the failed state may be restarted based on the failed process identifier.
It will be appreciated that the above manner of restarting the first target process is merely exemplary, and that the purpose of restarting at least the first target process in a failed state may be achieved by other manners when actually applied.
The second watchdog 112 is configured to: and outputting a second timeout signal in response to the kernel space of the first operating system being in a fault state, and triggering the first hardware domain 110 to restart through the second timeout signal.
Optionally, the second watchdog 112 may include a second timer circuit and a second reset circuit, where the second timer circuit is configured to count time, and the second reset circuit may be configured to be triggered and output the second timeout signal if a count result of the second timer circuit is timeout. For example, the second reset circuit may be configured to output a low level signal or a high level signal as the second timeout signal.
Optionally, the second timeout signal may be the same as or different from the first timeout signal. For example, the first timeout signal may be a high level signal and the second timeout signal may be a low level signal. Alternatively, the timeout signal may be a low level signal and the second timeout signal may be a high level signal.
Alternatively, in conjunction with the illustration of fig. 1, the second watchdog 112 may be connected to the system error monitoring module 130, and the second watchdog 112 may send the second timeout signal to the system error monitoring module 130. And sending, by the system error monitoring module 130, a second failure identifier to the second hardware domain 120 in response to the second timeout signal. The second failure identifier is used for identifying that the kernel space of the first operating system is in a failure state. Invoking a system reset function by a system fault handler in the second hardware domain 120 triggers a restart of the first hardware domain 110 based on the second fault identification. In this way, the restart of the first hardware domain 110 can be precisely controlled without affecting the normal operation of other hardware domains.
Illustratively, the first hardware domain 110 may be an application domain and the second hardware domain 120 may be a security domain. The second watchdog 112 may be disposed within the application domain, and the second watchdog 112 may interface with another signal of a system error monitoring module 130 (SEM module). The second watchdog 112 may be configured to monitor kernel space of the application domain, and may send a second timeout signal to a system error monitoring module 130 (SEM module) in response to a failure of kernel space of a Linux operating system in the application domain. The system error monitoring module 130 may send a second failure identification to the security domain based on the interface identification identifying the other signal interface, the second failure identification being capable of identifying a failure event in which the kernel space of the first hardware domain 110 is in a failure state. The security domain may invoke a system fault handler to invoke a system reset function to trigger the first hardware domain 110 to restart based on the second fault identification. In practical applications, the system error monitoring module 130 may also determine the corresponding second fault identifier based on the power information or other information of the second timeout signal, which is not limited to determining the second fault identifier based on the interface identifier.
Optionally, in conjunction with the illustration of fig. 2, the multi-core heterogeneous system may further include a reset module 140 (RSTGEN), and the second watchdog 112 may be connected to the reset module 140. The reset module 140 may trigger the multi-core heterogeneous system to restart in response to the second timeout signal, so as to restart the first hardware domain 110. Thus, the control logic is simple and easy to realize.
During the running process of the first hardware domain 110, the first watchdog 111 continuously monitors a first target process in the application space of the first operating system, and the second watchdog 112 also continuously monitors the kernel space of the first operating system. In most cases, however, it is not simultaneous that the first target process be in a failure state and that the kernel space of the first operating system be in a failure state. Of course, it is not excluded that in rare cases, the kernel space and the application space of the first operating system may also fail at the same time, but because the kernel space fails, the first hardware domain 110 needs to be restarted, so the first hardware domain 110 cannot respond to the first reset instruction, or needs to respond to the first reset instruction after restarting.
According to the multi-core heterogeneous system provided by the embodiment of the application, the first watchdog 111 and the second watchdog 112 are respectively arranged for the application space and the kernel space of the first operating system, the application space of the first operating system is monitored through the first watchdog 111, and the kernel space of the first operating system is monitored through the second watchdog 112. Aiming at the faults of the application space and the kernel space of the first operating system, the multi-core heterogeneous system can be processed through different fault processing measures respectively. When a first target process in the application space fails, the first target process is restarted by cooperation of the first watchdog 111, the system error monitoring module 130 and the second hardware domain 120. When the kernel space fails, the first hardware domain 110 is triggered to restart by the second watchdog 112. Therefore, the flexibility of fault processing can be improved, the stability of the multi-core heterogeneous system can be improved, and the user experience can be improved.
As shown in conjunction with fig. 1 and 2, in some embodiments, the second hardware domain 120 may be specifically configured to: based on the first fault signal, a first reset instruction is sent to the first hardware domain 110 using inter-core communication. A second target process may be running in the application space of the first operating system, and the second target process may be configured to: and responding to the first reset instruction, and restarting at least the first target process in a fault state.
Alternatively, the second target process may include an inter-core communication Daemon (MBX Daemon), through which inter-core communication information directed to the first hardware domain 110 may be monitored. In the case that the MBX Daemon receives the first reset instruction, the first target process in the failed state may be restarted. Or in the case where the first watchdog 111 monitors a plurality of first target processes, the MBX Daemon may restart all the first target processes monitored by the first watchdog 111. Of course, the second target process may be another process other than the inter-core communication daemon.
In some embodiments, in conjunction with the illustration of fig. 3, the second target process may be configured with a state identifier, and the state identifier corresponding to the second target process is hereinafter referred to as the second state identifier. The second state identification may include a first state value and a second state value.
The second target process may be configured to: the second state identification corresponding to the second target process is periodically reset from the first state value to the second state value.
The first target process may be configured to: periodically polling a second state identifier of each of the second target processes; restarting a second target process corresponding to the second state identifier under the condition that the polling determines that the second state identifier is a first state value; and resetting the second state identifier from the second state value to the first state value in the case that the polling determines that the second state identifier is the second state value.
Therefore, the first target process can be restarted through the second target process, one or more second target processes can be monitored through the first target process, the second target process can be restarted under the condition that the second target process is in a fault state, process monitoring in a user space can be more perfect and independent, and the system stability is improved.
Alternatively, in conjunction with the illustration of fig. 3, one or more second target processes may be run, one of which may be the master process. The first target process may be restarted by the host process in response to a first reset instruction sent by the second hardware domain 120.
Optionally, second state identifiers corresponding to the second target processes one by one can be set, and the second state identifiers are used for monitoring whether the corresponding second target processes are in a fault state or not by the first target processes. The second state identification includes a first state value, which may be 1, and a second state value, which may be 0. The first target process may comprise a software watchdog process, which may comprise a timed polling thread. The second target process may be configured to periodically reset the second state identification from 1 to 0. The timed polling thread may be configured to periodically poll a second status identifier of each of the second target processes. If the poll determines that the second state identifier is 0, then the corresponding second target process is determined to be in a healthy state, and the second state identifier may be reset to 1. If the second state identifier is determined to be 1 by polling, the corresponding second target process is indicated not to execute the operation of resetting the second state identifier according to the preset configuration in the current polling period, so that the second target process is further characterized as possibly being in a fault state, and the periodic polling thread can trigger the second target process to restart.
In some embodiments, the first target process may be further configured to: and under the condition that the polling determines that the second state identifier is the first state value, capturing the process information of the second target process corresponding to the second state identifier, and generating a crash log. The crash log can be used for analyzing the fault reason of the second target process, so that the fault hidden danger can be thoroughly eliminated.
Alternatively, the crash log may be generated by a timed polling thread. Or, the first target process may be provided with a log generating thread, and the log generating thread may capture process information of the second target process in a fault state. For example, the kernel call stack (/ proc/< pid >/stack) and the user call stack (pstack < pid >) of the second target process may be grabbed, and a crash log may be generated based on the kernel call stack (/ proc/< pid >/stack) and the user call stack (pstack < pid >).
The embodiment of the application also provides a monitoring and resetting method of the multi-core heterogeneous system, which can be applied to the multi-core heterogeneous system in any embodiment. Fig. 4 is a flowchart of a monitoring and resetting method of a multi-core heterogeneous system according to an embodiment of the present application, and referring to fig. 4, the monitoring and resetting method of an embodiment of the present application may specifically include the following steps.
S210, responding to that a first target process in an application space of the first operating system is in a fault state by the first watchdog to output a first overtime signal.
For example, the first watchdog may include a first timer circuit for counting time and a first reset circuit configured to be triggered and output the first timeout signal if a count result of the first timer circuit is timeout. For example, the first reset circuit may be configured to output a high level signal or a low level signal as the first timeout signal.
Alternatively, a single first target process in the application space of the first operating system may be monitored by the first watchdog, and a plurality of first target processes may also be monitored by the first watchdog. In the case of monitoring a plurality of first target processes by the first watchdog, the first watchdog may be configured to output a first timeout signal in response to any one of the first target processes being in a failure state.
It should be noted that, the first target process may be various processes in an application space of the first operating system, and the type of the first target process is not limited herein. For example, the first target process may include, but is not limited to, a process of an application, an inter-core communication Daemon (MBX Daemon), or a software watchdog process, among others.
For example, the first target process may be provided with a watchdog thread, and the watchdog thread may be configured to periodically send a feeding signal to the first watchdog, and control the first watchdog to clear the count value through the feeding signal. If the first target process falls into a fault state such as a dead cycle or confusion, and the like, the feeding thread cannot periodically send a feeding signal to the first watchdog in a preconfigured mode, the first watchdog does not acquire the feeding signal within a unit time period, a timing result is overtime, and a first overtime signal can be output.
For another example, a plurality of first target processes may be monitored by the first watchdog. Each first target process can be provided with a dog feeding thread, and each first target process can have a first state identification corresponding to the first target process. The first state identification may include a first state value and a second state value. For example, the first state value may be 1 and the second state value may be 0. The dog-feeding thread of the first target process may be configured to periodically reset the first status identification from a first status value to a second status value.
A third target process may be further disposed in the application space of the first operating system, and the third target process may be disposed with a timing polling thread and a dog feeding thread. The timed polling thread of the third target process may be configured to periodically poll the first status identification of each of the first target processes. If the first state identification is a second state value, it may be determined that a first target process corresponding to the first state identification is in a non-failed state. If the first state identification is a first state value, it may be determined that a first target process corresponding to the first state identification is in a failed state. The dog-feeding thread of the third target process may be configured to send a feeding signal to the first watchdog if all of the monitored first target processes are in a non-faulty state; and is configured to stop sending a feeding signal to the first watchdog if any one of the monitored first target processes is in a failure state.
S220, responding to the first timeout signal through the system error monitoring module, and sending a first fault identification for identifying that the first target process is in a fault state to the second hardware domain.
Alternatively, the system error monitoring module (SEM module) may be independent of the individual hardware domains. Alternatively, the system error monitoring module may be disposed in a hardware domain other than the first hardware domain and the second hardware domain. For example, the system error monitoring module may be disposed in a third hardware domain of the multi-core heterogeneous system. Alternatively, the system error monitoring module may also be disposed in the second hardware domain.
Optionally, the system error monitoring module may be connected to the first watchdog through a signal interface, and the system error monitoring module may determine a first fault identifier corresponding to the first timeout signal in response to the signal interface receiving the first timeout signal. For example, the system error monitoring module may determine a corresponding first failure identification based on an interface identification of the first watchdog connection. For example, the system error monitoring module may determine the corresponding first fault identification based on power information of the first timeout signal.
Alternatively, the first crash identifier may include various types of information that can uniquely identify the crash event that the first target process is in a crash state. For example, the system error monitoring module may send a digital signal to the second hardware domain, by which a failure event that the first target process is in a failure state is identified.
S230, based on the first fault signal, the second hardware domain sends a first reset instruction to the first hardware domain by utilizing inter-core communication, and the first hardware domain is instructed to restart at least a first target process in a fault state.
Optionally, a system error processing program corresponding to the system error monitoring module may be disposed in the second hardware domain. The second hardware domain may call a system error handler (SEM handler) upon receiving the first failure signal. An inter-core communication program (MBX handler) is invoked by the system error handler based on the first fault signal. And sending the first reset instruction to the first hardware domain by using inter-core communication through the inter-core communication program.
Alternatively, in the case where the first watchdog monitors a single first target process, the first hardware domain may restart the first target process based on a first reset instruction. In the case that the first watchdog monitors a plurality of first target processes, the first hardware domain may restart all first target processes monitored by the first watchdog based on the first reset instruction, and control logic is relatively simple.
Optionally, in a case that the first watchdog monitors a plurality of first target processes, the first hardware domain may restart the one or more first target processes in a failure state based on a first reset instruction. For example, the first hardware domain may determine a first target process that is in a failed state based on a first state identification of each first target process, and restart the first target process whose first state identification is a first state value. For example, the third target process may be further configured to generate a failed process identification if the polling determines that a certain first state identification is a first state value. And under the condition that the first hardware domain acquires the first reset instruction, restarting the first target process in the fault state based on the fault process identifier.
It will be appreciated that the above manner of restarting the first target process is merely exemplary, and that the purpose of restarting at least the first target process in a failed state may be achieved by other manners when actually applied.
S240, responding that the kernel space of the first operating system is in a fault state through the second watchdog, outputting a second timeout signal, and triggering the first hardware domain to restart through the second timeout signal.
Optionally, the second watchdog may include a second timer circuit and a second reset circuit, where the second timer circuit is configured to count time, and the second reset circuit may be configured to be triggered and output the second timeout signal if a count result of the second timer circuit is timeout. For example, the second reset circuit may be configured to output a low level signal or a high level signal as the second timeout signal.
Optionally, the second timeout signal may be the same as or different from the first timeout signal. For example, the first timeout signal may be a high level signal and the second timeout signal may be a low level signal. Alternatively, the timeout signal may be a low level signal and the second timeout signal may be a high level signal.
Optionally, in conjunction with fig. 1, the second watchdog may also be connected to the system error monitoring module, and the second watchdog may send the second timeout signal to the system error monitoring module. And responding to the second timeout signal through the system error monitoring module, and sending a second fault identification to the second hardware domain. The second failure identifier is used for identifying that the kernel space of the first operating system is in a failure state. And calling a system reset function to trigger the first hardware domain to restart through a system fault handling program in the second hardware domain based on the second fault identification. Therefore, the restarting of the first hardware domain can be accurately controlled without affecting the normal operation of other hardware domains.
Illustratively, the first hardware domain may be an application domain and the second hardware domain may be a security domain. The second watchdog may be disposed within the application domain, and the second watchdog may be connected with another signal interface of a system error monitoring module (SEM module). The second watchdog may be configured to monitor kernel space of the application domain, and may send a second timeout signal to a system error monitoring module (SEM module) in response to a kernel space of, for example, a Linux operating system in the application domain failing. The system error monitoring module may send a second failure identification to the security domain based on the interface identification for identifying the other signal interface, the second failure identification being capable of identifying a failure event in which the kernel space of the first hardware domain is in a failure state. The security domain may invoke a system fault handler to invoke a system reset function to trigger the first hardware domain to restart based on the second fault identification. In practical application, the system error monitoring module may also determine the corresponding second fault identifier based on the power information or other information of the second timeout signal, which is not limited to determining the second fault identifier based on the interface identifier.
Optionally, in cooperation with fig. 2, the multi-core heterogeneous system may further include a reset module, and the second watchdog may be connected to the reset module. And the reset module can respond to the second timeout signal to trigger the multi-core heterogeneous system to restart. In this way, the control logic is simplified and easy to implement.
It should be noted that, monitoring the first target program in the application space of the first operating system by the first watchdog and monitoring the kernel space of the first operating system by the second watchdog are generally synchronous processes. In most cases, however, the first target process is in a failure state and the kernel space of the first operating system is in a failure state are not simultaneously present, so that the steps S210 to S230 are not performed simultaneously with the step S240. In the case that the first target process is in a failure state, the purpose of restarting the first target process can be achieved by performing steps S210 to S230. In the case that the kernel space of the first operating system is in a fault state, the purpose of restarting the first hardware domain and restarting the first operating system can be achieved by executing step S240.
According to the monitoring and resetting method for the multi-core heterogeneous system, the application space of the first operating system is monitored through the first watchdog, and the kernel space of the first operating system is monitored through the second watchdog. In the case that the first target program in the application space is in a fault state, the first watchdog outputs a first overtime signal, responds to the first overtime signal through the system fault monitoring module, sends a first fault signal for identifying that the first target program is in the fault state to the second hardware domain, sends a first reset instruction to the first hardware domain through the second hardware domain based on the first fault signal by utilizing inter-core communication, and indicates that the first hardware domain at least restarts the first target process in the fault state. And when the kernel space of the first operating system is in fault, the second watchdog outputs a second timeout signal to trigger the first hardware domain to restart. Therefore, aiming at the faults of the application space and the kernel space of the operating system, the multi-core heterogeneous system can respectively provide different response measures for processing, the flexibility of fault processing can be improved, the stability of the multi-core heterogeneous system is improved, and the user experience is improved.
In some embodiments, sending, by the second hardware domain, a first reset instruction to the first hardware domain using inter-core communication based on the first failure signal, instructing the first hardware domain to restart at least a first target process in a failure state, including:
transmitting a first reset instruction to the first hardware domain by inter-core communication based on the first fault signal through the second hardware domain;
and responding to the first reset instruction through a second target process in the application space of the first operating system, and restarting at least the first target process in a fault state.
In some embodiments, the method further comprises:
periodically resetting, by the second target process, a state identifier corresponding to the second target process from a first state value to a second state value;
periodically polling the state identification of each second target process through the first target process;
restarting a second target process corresponding to the state identifier under the condition that the polling determines that the state identifier is a first state value;
in the event that the poll determines that the status flag is a second status value, the status flag is reset from the second status value to the first status value.
In some embodiments, periodically detecting, by the first target process, a status identification of each of the second target processes includes:
and periodically polling the state identification of each second target process through a timing polling thread in the first target process.
In some embodiments, the method further comprises:
and under the condition that the polling determines that the state identifier is a first state value, capturing the process information of a second target process corresponding to the state identifier, and generating a crash log.
In some embodiments, triggering the first hardware domain restart by the second timeout signal includes:
transmitting a second fault identifier for identifying that the kernel space of the first operating system is in a fault state to the second hardware domain by the system error monitoring module in response to the second timeout signal;
and calling a system reset function to trigger the first hardware domain to restart through a system error processing program in the second hardware domain based on the second fault identifier.
In some embodiments, triggering the first hardware domain restart by the second timeout signal includes:
and responding to the second timeout signal through a reset module of the multi-core heterogeneous system, and triggering the multi-core heterogeneous system to restart.
The embodiment of the application also provides a chip, which comprises the multi-core heterogeneous system according to any embodiment.
The embodiment of the application also provides electronic equipment, which comprises the chip in any embodiment. Alternatively, the electronic device includes, but is not limited to, a vehicle, a server, a workstation, a personal computer, and the like.
Vehicles in embodiments of the present application may be "automobiles," "vehicles," and "whole vehicles," or other similar terms, including general motor vehicles, such as including passenger cars, SUVs, MPVs, buses, trucks, and other cargo or passenger vehicles, watercraft, including various boats, ships, and aircraft, and the like, including hybrid vehicles, electric vehicles, fuel vehicles, plug-in hybrid vehicles, fuel cell vehicles, and other alternative fuel vehicles. The hybrid vehicle refers to a vehicle having two or more power sources, and the electric vehicle includes a pure electric vehicle, an extended range electric vehicle, and the like, which is not particularly limited in the present application.
The above embodiments are only exemplary embodiments of the present application and are not intended to limit the present application, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this application will occur to those skilled in the art, and are intended to be within the spirit and scope of the application.

Claims (10)

1. The monitoring and resetting method for the multi-core heterogeneous system is characterized by being applied to the multi-core heterogeneous system, wherein the multi-core heterogeneous system comprises a first hardware domain, a second hardware domain and a system error monitoring module, the first hardware domain comprises a first watchdog and a second watchdog, and the first hardware domain is used for supporting the operation of a first operating system; the method comprises the following steps:
outputting a first overtime signal by the first watchdog in response to a first target process in an application space of the first operating system being in a fault state;
transmitting, by the system error monitoring module, a first failure identifier for identifying that the first target process is in a failure state to the second hardware domain in response to the first timeout signal;
sending a first reset instruction to the first hardware domain by using inter-core communication based on the first fault signal through the second hardware domain, and indicating the first hardware domain to restart at least a first target process in a fault state;
and responding the kernel space of the first operating system to be in a fault state through the second watchdog to output a second overtime signal, and triggering the first hardware domain to restart through the second overtime signal.
2. The method of claim 1, wherein sending, by the second hardware domain, a first reset instruction to the first hardware domain using inter-core communication based on the first failure signal, instructing the first hardware domain to restart at least a first target process in a failure state, comprises:
transmitting a first reset instruction to the first hardware domain by inter-core communication based on the first fault signal through the second hardware domain;
and responding to the first reset instruction through a second target process in the application space of the first operating system, and restarting at least the first target process in a fault state.
3. The method according to claim 2, wherein the method further comprises:
periodically resetting, by the second target process, a state identifier corresponding to the second target process from a first state value to a second state value;
periodically polling the state identification of each second target process through the first target process;
restarting a second target process corresponding to the state identifier under the condition that the polling determines that the state identifier is a first state value;
in the event that the poll determines that the status flag is a second status value, the status flag is reset from the second status value to the first status value.
4. A method according to claim 3, wherein periodically detecting, by the first target process, the status identity of each of the second target processes, comprises:
and periodically polling the state identification of each second target process through a timing polling thread in the first target process.
5. A method according to claim 3, characterized in that the method further comprises:
and under the condition that the polling determines that the state identifier is a first state value, capturing the process information of a second target process corresponding to the state identifier, and generating a crash log.
6. The method of claim 1, wherein triggering the first hardware domain restart by the second timeout signal comprises:
transmitting a second fault identifier for identifying that the kernel space of the first operating system is in a fault state to the second hardware domain by the system error monitoring module in response to the second timeout signal;
and calling a system reset function to trigger the first hardware domain to restart through a system error processing program in the second hardware domain based on the second fault identifier.
7. The method of claim 1, wherein triggering the first hardware domain restart by the second timeout signal comprises:
And responding to the second timeout signal through a reset module of the multi-core heterogeneous system, and triggering the multi-core heterogeneous system to restart.
8. The multi-core heterogeneous system is characterized by comprising a first hardware domain, a second hardware domain and a system error monitoring module, wherein the first hardware domain comprises a first watchdog and a second watchdog, and the first hardware domain is used for supporting the operation of a first operating system;
the first watchdog is configured to output a first timeout signal in response to at least one first target process in an application space of the first operating system being in a failure state;
the system error monitoring module is configured to respond to the first timeout signal and send a first fault identification for identifying that the first target process is in a fault state to the second hardware domain;
the second hardware domain is configured to send a first reset instruction to the first hardware domain by utilizing inter-core communication based on the first fault signal, and instruct the first hardware domain to restart at least a first target process in a fault state;
the second watchdog is configured to output a second timeout signal in response to the kernel space of the first operating system being in a fault state, and trigger the first hardware domain to restart through the second timeout signal.
9. A chip comprising the multi-core heterogeneous system of claim 8.
10. An electronic device comprising the chip of claim 9.
CN202311396176.9A 2023-10-25 2023-10-25 Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment Active CN117130832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311396176.9A CN117130832B (en) 2023-10-25 2023-10-25 Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311396176.9A CN117130832B (en) 2023-10-25 2023-10-25 Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment

Publications (2)

Publication Number Publication Date
CN117130832A true CN117130832A (en) 2023-11-28
CN117130832B CN117130832B (en) 2024-02-23

Family

ID=88851162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311396176.9A Active CN117130832B (en) 2023-10-25 2023-10-25 Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN117130832B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117931529A (en) * 2024-03-21 2024-04-26 上海励驰半导体有限公司 Startup management method and device, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120079328A1 (en) * 2010-09-27 2012-03-29 Hitachi Cable, Ltd. Information processing apparatus
US20160132378A1 (en) * 2014-11-12 2016-05-12 Hyundai Motor Company Method and apparatus for controlling watchdog
CN106326055A (en) * 2016-08-29 2017-01-11 四川九洲空管科技有限责任公司 Method for software and hardware crashing detection and resetting of airborne collision avoidance system
CN116048861A (en) * 2023-01-17 2023-05-02 厦门四信通信科技有限公司 Multi-level watchdog design method, device, equipment and storage medium
CN116450390A (en) * 2022-01-07 2023-07-18 荣耀终端有限公司 Watchdog detection method and electronic equipment
CN116450386A (en) * 2022-01-07 2023-07-18 荣耀终端有限公司 Watchdog detection method, device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120079328A1 (en) * 2010-09-27 2012-03-29 Hitachi Cable, Ltd. Information processing apparatus
US20160132378A1 (en) * 2014-11-12 2016-05-12 Hyundai Motor Company Method and apparatus for controlling watchdog
CN106326055A (en) * 2016-08-29 2017-01-11 四川九洲空管科技有限责任公司 Method for software and hardware crashing detection and resetting of airborne collision avoidance system
CN116450390A (en) * 2022-01-07 2023-07-18 荣耀终端有限公司 Watchdog detection method and electronic equipment
CN116450386A (en) * 2022-01-07 2023-07-18 荣耀终端有限公司 Watchdog detection method, device and storage medium
CN116048861A (en) * 2023-01-17 2023-05-02 厦门四信通信科技有限公司 Multi-level watchdog design method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117931529A (en) * 2024-03-21 2024-04-26 上海励驰半导体有限公司 Startup management method and device, electronic device and storage medium

Also Published As

Publication number Publication date
CN117130832B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN117130832B (en) Monitoring reset method and system of multi-core heterogeneous system, chip and electronic equipment
US10095576B2 (en) Anomaly recovery method for virtual machine in distributed environment
JP4586750B2 (en) Computer system and start monitoring method
US8868968B2 (en) Partial fault processing method in computer system
JP2006191338A (en) Gateway apparatus for diagnosing fault of device in bus
CN109144873B (en) Linux kernel processing method and device
CN105868060B (en) Method for operating a data processing unit of a driver assistance system and data processing unit
US6526527B1 (en) Single-processor system
US20220055637A1 (en) Electronic control unit and computer readable medium
CN115904793B (en) Memory transfer method, system and chip based on multi-core heterogeneous system
CN107179911B (en) Method and equipment for restarting management engine
CN114217925A (en) Business program operation monitoring method and system for realizing abnormal automatic restart
US11467865B2 (en) Vehicle control device
WO2006127493A2 (en) Software process monitor
US10514970B2 (en) Method of ensuring operation of calculator
CN113043969A (en) Vehicle function safety monitoring method and system
CN111782515A (en) Web application state detection method and device, server and storage medium
US20190332506A1 (en) Controller and function testing method
CN111338914A (en) Fault notification method and related equipment
CN113711209A (en) Electronic control device
JPH08329006A (en) Fault information system
CN117234787B (en) Method and system for monitoring running state of system-level chip
CN113500913B (en) Drawing assembly of full liquid crystal instrument
CN116991637B (en) Operation control method and device of embedded system, electronic equipment and storage medium
CN113515397B (en) IPMI command processing method, server, and non-transitory computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant