CN110874279B

CN110874279B - Fault positioning method, device and system

Info

Publication number: CN110874279B
Application number: CN201810998164.6A
Authority: CN
Inventors: 杨骁�
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2023-05-30
Anticipated expiration: 2038-08-29
Also published as: CN110874279A

Abstract

The application discloses a fault locating method, device and system. Wherein the method comprises the following steps: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result. The method and the device solve the technical problems that in the prior art, fault information when a fault occurs cannot be timely obtained by a fault positioning method, and positioning accuracy is low.

Description

Fault positioning method, device and system

Technical Field

The present disclosure relates to the field of embedded systems, and in particular, to a fault locating method, device and system.

Background

In the field of business data analysis, there are two demands on the intelligent router, on one hand, the intelligent router can stably report WiFi sniffing data to the cloud end through a long connection channel for a long time, on the other hand, the intelligent router can stably operate for a long time, the phenomenon of hanging can not occur, and the intelligent router can not be recovered, otherwise, the resource and time are wasted. When the router is started, a long connecting pipeline of TCP is established with the cloud server and is used for reporting data, receiving control instructions or system upgrading instructions/data issued by the server and the like, so that if a system-level crash occurs, after the equipment is restarted, the information collected when the abnormality occurs can be sent to a server for analysis, and the system cannot work normally for any reason when the abnormality occurs. However, the hang-up is a fault that hinders information collection, and the program that collects data and accepts instructions on the router system cannot be scheduled and executed, but no external force can restart the device to restore service capability, unless power is turned off, but even then, since all written software logic cannot have an opportunity to run because of hardware outage, there is no opportunity to collect information that can help the scheduling, and at most it is known that the device is restarted manually at the cloud, but the cause of the device abnormality is not known.

Therefore, if the relatively high-priority program is trapped in the core stack and the CPU is densely occupied, such as dead cycles without external memory input and output, other relatively low-priority programs lose the opportunity of obtaining the CPU to execute and are starved for a long time, so that the business logic cannot be normally completed. For the scenario of debugging anomalies, the traditional solution is as follows: the first is that a debugger running some software, such as gdb, can be embedded into other program threads to be debugged to check the executed statement and context, but the disadvantage is that the device loses response when the system is suspended and has no opportunity to run the debugger; the second is a debugger with some hardware, which can observe the memory and registers, but has the defects that the hardware debugger needs to be connected to the hardware equipment by a physical connection line to issue debugging instructions, the finished product machine deployed in the commercial place is not connected and deployed, the condition of the occurrence of the hang-up fault is not so clear, the time of a plurality of days is needed for reproduction, and the hardware debugging is not needed until the hang-up happens; the third is that the system running log can be printed while running in the program, and the text printed to the serial port interrupt is used to analyze the logic which the program has executed before the abnormality occurs when the abnormality occurs, but for the hang-up scene, the code amount of the whole system is very huge because of not knowing exactly what program logic causes the abnormality, and the debugging statement is almost impossible to be added in the program logic which causes the abnormality, so the system running log can only be used as a supplementary means after the general positioning, and the abnormal logic cannot be positioned as a direct cut-in point.

The conventional solution is generally to find that the device is dead by using a watchdog mechanism, if a dog feeding character stops feeding dogs for a certain period of time, a watchdog program is used to trigger restarting of the device, which has the disadvantage that although the device can be separated from an inoperable abnormal dead state, after the next starting, precious abnormal sites are lost completely due to power failure, and development and maintenance personnel have no way to find clues to know why the last time the device is dead. If such hang-up is frequent, the system is frequently killed and restarted by the watchdog, and the last situation forms a continuous restart of the device, and still cannot change the state of stopping the service capability.

Aiming at the problem that the fault positioning method in the prior art cannot acquire fault information in time when a fault occurs, so that the positioning accuracy is low, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the application provides a fault positioning method, device and system, which at least solve the technical problem that the fault positioning method in the prior art cannot acquire fault information in time when a fault occurs, so that the positioning accuracy is low.

According to an aspect of the embodiments of the present application, there is provided a fault locating method, including: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

According to another aspect of the embodiments of the present application, there is also provided a fault locating device, including: the printing module is used for printing the stack area of the kernel stack and the stack area of the user stack if the system is detected to be faulty, so as to obtain a printing result; the storage module is used for storing the printing result and controlling the restarting of the system; and the determining module is used for determining the fault type based on the printing result if the system is restarted.

According to another aspect of the embodiments of the present application, there is also provided a storage medium, including a stored program, where the program controls a device in which the storage medium is located to perform the following steps when running: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

According to another aspect of the embodiments of the present application, there is also provided a processor for running a program, wherein the program executes the following steps: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

According to another aspect of the embodiments of the present application, there is also provided a fault locating system, including: a processor; and a memory, coupled to the processor, for providing instructions to the processor for processing the steps of: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

In the embodiment of the application, if the system is detected to have faults, the stack area of the kernel stack and the stack area of the user stack are printed to obtain a printing result, the printing result is stored, the system is controlled to restart, and if the system is restarted successfully, the fault type is further determined based on the printing result, so that the purpose of accurately positioning the code paragraph causing the faults by utilizing the generated hanging fault is achieved.

It is easy to notice, print kernel stack and user stack before resetting, and from the information that prints and combine the disassemblading of firmware big mother, analyze the binary character to position the code paragraph that the dead cycle takes place, compared with prior art, the unusual scene can not be lost along with the equipment is powered down totally, has reached and has improved the location degree of accuracy, reduces the location cost, promotes the technological effect of location efficiency.

Therefore, the embodiment of the application solves the technical problem that in the prior art, the fault positioning method cannot acquire fault information when a fault occurs in time, so that the positioning accuracy is low.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a fault localization method according to an embodiment of the present application;

FIG. 2 is a flow chart of a fault localization method according to embodiment 1 of the present application;

FIG. 3 is a flow chart of an alternative fault localization method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of code of an alternative print kernel stack and user stack according to an embodiment of the present application;

FIG. 5 is a schematic illustration of an alternative print result according to an embodiment of the present application;

FIG. 6 is a process flow diagram of an alternative interrupt according to an embodiment of the present application;

FIG. 7 is a schematic view of a fault location device according to embodiment 2 of the present application; and

fig. 8 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:

interruption: may refer to the entire process of the processor handling an emergency event occurring during program execution. In the process of program operation, if an emergency occurs outside the system, inside the system or the current program itself, the processor immediately stops the operation of the current program, automatically shifts to a corresponding processing program (interrupt service routine), and returns to the original program operation after the processing is finished, and the whole process is called program interrupt.

Interrupt Service Routine (ISR): interrupt Service Routine. The processor processes the "widgets". It is understood that a service is performed by executing a specific program programmed in advance, and this program for handling "emergency" is called an interrupt service routine.

Watchdog (watchdog): a computer-based approach to enhancing robustness is to create 1 task with the lowest priority, which clears a counter each time it gets run, called a feed dog task; in addition, 1 code segment which periodically and inevitably gets running is created, such as a clock interrupt ISR, and the ISR gets running opportunity to self-increment each time and checks a timer, which is called a watch dog; when the watch dog discovers that the dog feeding task is not operated for a period of time, and the counter is increased to exceed a certain threshold value, judging that the dog feeding task with the lowest priority is in a starvation state, and resetting the equipment when the task with the lowest priority occupies too many CPU tasks in the system; sometimes, the role of the watchdog is also played by the 1 highest priority task. In summary, the key of the whole set of mechanism is: 1 role of feeding dog playing weak +1 role of watchdog playing supervisor.

Call stack (callstack): a stack, which may be a computer science, that stores messages about a running subroutine is often used to store the return address of the subroutine. When any subprogram is called, the main program must temporarily store the address to which the subprogram should return after the subprogram is finished, and also if the called subprogram is to call other subprograms, the return address of the called subprogram must be stored in the execution stack, and the called subprogram is retrieved after the called subprogram is finished. Based on such principle, any program at any time can trace back its own stack space, which necessarily contains the jump back assembly instruction address saved before calling the program of the previous layer, and the same will be repeated again to find the jump back assembly instruction address saved by the program of the previous layer, so that the call chain of the program can be resolved from the jump back addresses collected and identified, and the relation formed is called a call stack.

The kernel stack: the kernel, when creating a process, may create a corresponding stack for the process. Each process has a kernel stack stored in kernel space. When a process runs in kernel space, the content in the CPU stack pointer register is the kernel stack space address, and the kernel stack is used.

User stack: the kernel, when creating a process, may create a corresponding stack for the process. Each process has a user stack stored in user space. When the process runs in the user space, the content in the CPU stack pointer register is the user stack space address, and the user stack is used.

Example 1

In accordance with the embodiments of the present application, there is provided an embodiment of a fault localization method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown.

The method embodiment provided in the first embodiment of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a fault localization method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …,102 n) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 104 for storing data, and a transmission means 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the fault location method in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the fault location method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

It should be noted here that, in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.

In the above-described operating environment, the present application provides a fault localization method as shown in fig. 2. Fig. 2 is a flowchart of a fault location method according to embodiment 1 of the present application. As shown in fig. 2, the method comprises the steps of:

and S22, if the system is detected to be faulty, printing the stack area of the kernel stack and the stack area of the user stack to obtain a printing result.

Specifically, the system may be a Linux operating system deployed in an intelligent router in a public area, where the intelligent router is implemented based on hardware and Linux kernels of a MIPS CPU architecture. Whether the system fails or not can be detected through the watchdog program, in the embodiment of the application, the system is mainly detected aiming at the hanging failure, and if the hanging failure is detected, the system can be reset at the moment, and the equipment can be restarted, so that the equipment is released from the hanging state.

In order to accurately locate faults, abnormal field information can be fully displayed, specifically, the stack area of a kernel stack and the stack area of a user stack are printed by utilizing the characteristic that the bottom layer of a C language is close enough and the language of the C language is close enough, a printing result is obtained, and the next started iteration uploads the information to a cloud end, so that the abnormal occurrence is located due to the fact that the code of the application program occupies the CPU, and the line level accuracy of source codes is achieved.

If the dead loop occurs in the kernel space, the end of the rest part is the position where the dead loop occurs beyond those call chains used by the watchdog through the analysis of the kernel stack; if the dead loop occurs in the user space, the same reason is the same, and if the user stack is directly analyzed, the dead loop and a calling chain before the dead loop occurs are seen; if a scene in which the kernel stack is trapped exists, for example, after the user space generates a system call and enters the kernel space, the user stack analyzes and displays the context of the system call, and the kernel stack displays a call chain in which the system call response generates the dead loop, so that the whole complete picture in which the dead loop occurs is obtained by matching.

For example, as shown in fig. 3, assuming that the watch dog is hung on the clock interrupt, after the watch dog is triggered, that is, after the counter exceeds the threshold value, the user stack, the print CPU register and the print kernel stack may be printed, the print result is obtained and saved, then the device is restarted, and after the device is started for the second time, the print result is sent to the cloud for analysis, so as to help locate the problem.

The code of the print kernel stack and the user stack is shown in fig. 4, wherein kstk_eip (current), kstk_esp (current) is $epc, $sp, respectively, and the working principle is: the field information stored in the CPU before the CPU is switched out may be fetched from the control block thread_info data structure of the current thread. The printing result is shown in fig. 5.

Step S24, storing the printing result and controlling the system to restart.

Specifically, to ensure that the print results are not lost after the system is restarted, the print results may be stored in a hot-restart non-lost or nonvolatile storage device, and then the system is restarted.

Step S26, if the system is restarted, the failure type is determined based on the printing result.

In an alternative, after the system is restarted, the print result may be read from the storage device, and by analyzing the print result, the specific location of the occurrence of the dead loop is located, and the fault type is determined, that is, the location of the occurrence of the abnormality is caused by which code of which application program occupies the CPU. In another alternative scheme, after the system is restarted, the printing result can be read from the storage device, the printing result is uploaded to the server, the server analyzes the printing result, a specific position where the dead cycle occurs is located, and the fault type is determined.

It should be noted that, for the kernel stack, when parsing, it is necessary to pay attention to the fact that the stack top is occupied by the ISR and its subsequent call chains, and it is necessary to walk through this piece of content to reach the stack frame of the original ISR, and then skip this frame long, so that the call chains before the ISR occurs can be reached, which is often the place where the dead loop occurs.

According to the method provided by the embodiment 1, if the system is detected to be faulty, the stack area of the kernel stack and the stack area of the user stack are printed to obtain a printing result, the printing result is stored, the system is controlled to restart, and if the system is restarted successfully, the fault type is further determined based on the printing result, so that the purpose of accurately positioning the code section causing the fault by utilizing the fault caused by the hanging is achieved.

Therefore, the technical problem that the positioning accuracy is low because the fault information of the fault occurrence cannot be timely obtained by the fault positioning method in the prior art is solved by the scheme of the embodiment 1.

In the above embodiment of the present application, step S22, printing a stack area of a kernel stack and a stack area of a user stack to obtain a printing result includes:

Step S222, acquiring a preset print length.

Specifically, the preset printing length can be the minimum length which can be accurately positioned to a specific position where the dead cycle occurs according to the actual disassembly requirement.

Step S224, according to the preset printing length, the stack area of the kernel stack and the stack area of the user stack are printed byte by byte.

In an alternative, by printing the stack area of the kernel stack and the stack area of the user stack each byte by byte for a sufficient length, the specific location where the dead loop occurs can be analyzed and the code segment that caused the fault determined.

In the above embodiment of the present application, step S28 prints the processor register while printing the stack area of the kernel stack and the stack area of the user stack, thereby obtaining the print result.

Specifically, since an interrupt response entry, such as handle_int, or brcmIRQ, will save the current field in the stack (and, here, the lift stack, push register operations are all the $k0, $k1, $sp related paragraphs mentioned above, where $k0, $k1 is the special register reserved for the OS kernel by the MIPS architecture), then set the $RA register to < ret_from_irq ] . Thus, to be able to determine by disassembly how long the stack frame used by the interrupt response entry is (44 words for handle_int) and where it holds its $RA, the CPU register can be printed.

In the above embodiment of the present application, step S24, storing the printing result includes:

step S242 of storing the print result in any one or more of the following storage media: random access memory and external memory flash memory.

Specifically, to ensure that the print result is not lost with the system restart, the print result may be stored in the area of the random access memory RAM (Random Access Memory) where the hot restart is not lost, or directly in the nonvolatile external storage Flash memory Flash, and then the system is restarted.

In the above embodiment of the present application, step S26, determining the fault type based on the printing result includes:

step S262, the printing result is analyzed, and the address information of the system fault is determined.

Specifically, the address information of the system failure may be an address before the interrupt occurs. For example, on a broadcom hardware platform based on the MIPS architecture, the distribution entry of all ISRs is the brcmIRQ () function in the kernel, whose relevant compilation is:

move k1,sp

addinu sp, k1, -176# corresponds to: adiu sp, sp-176

sw k0,140(sp)

sw ra,148 (sp) # saves $ra from $sp stack top down to 148 bytes, jumps back to address

...

The above paragraph is a paragraph in which the brcmIRQ () function saves the context field of the last running code before the interrupt occurred, where it can be seen how it saves the interrupted field in the layout arrangement in the stack after the interrupt occurred before servicing the interrupt.

For example, as shown in the print result of fig. 5, the stack frame used by the brcmIRQ () function is 176 bytes, that is, 44 words, and the 148 th byte, that is, 37 th word, is the skip back address $ra, so that the address before the interrupt occurs is successfully found: 0xc0693c50.

Step S264, disassembling the printing result to obtain the function information corresponding to the address information.

Specifically, after the address information is obtained from the print result, the binary image can be disassembled to immediately locate which function when the dead cycle occurs. For example, as a result of printing shown in fig. 5, the function information obtained by disassembly is an ioc_expr_link () function.

Step S266, based on the function information, determines the fault type.

In particular, after obtaining the function information by disassembly, the fault type, i.e. which piece of code of which application program has undergone a dead-loop, can be further determined.

It should be noted that, when specific hardware platforms are different, or the Linux kernel versions are different, the interrupt corresponding flow or assembly code details are different, but the principle is always the same. The whole response flow is always: the kernel breaks some kernel thread code being executed and then enters the total interrupt response entry, such as handle_int, or brcmIRQ, whatever this total entry is, it does, saves the current scene in the stack, then determines the interrupt source, and begins the distribution of the interrupt (dispatch).

Because of this, after the ISR service is completed, it will always return to ret_from_irq, resume the scene, and finally return to the code previously interrupted by the interrupt through the ERET assembler instruction. When the value of $RA found in the printed stack content is ret_from_irq, this means that the function that happens immediately above the current function is the "interrupt response total entry" because it will only set $RA to ret_from_irq, it is either handle_int, or brcmIRQ, or any other symbol that is designated as an abnormal responder number 0 when an abnormal vector table is installed, whatever the total entry is, it always performs the actions of pushing stacks and saving fields with $k0, $k1 and $sp, thus finding its assembly code, it knows how long it uses the stack frame (44 words for handle_int), and where it saves its $RA (37 th word); that is to say: when it is found that $RA points to ret_from_irq, the stack frame which is the "total entry" is the stack frame installed as the abnormal responder # 0, and the stack frame (for example, 44 words) thereof should be jumped back correspondingly, and the assembly address of the kernel code interrupted by the interrupt service is obtained from $RA saved by the "total entry" found therein; for the problem of debugging a dead loop, this address is where the dead loop occurs or in the vicinity of where the dead loop occurs.

In the above embodiment of the present application, step S262, analyzing the printing result, determining the address information of the system failure, includes:

step S2622, analyzing the printing result, and determining the stack frame corresponding to the interrupt service routine.

Specifically, since the actions of pushing stacks and saving fields are always performed with $k0, $k1 and $sp, no matter what the "interrupt response total entry" is, by finding the assembly codes of $k0, $k1 and $sp, it is possible to determine the stack frame corresponding to the interrupt service routine, and determine how long the stack frame used is, for example, 44 words long for handle_int.

Step 2624, obtaining data with a preset length from the stack top of the stack frame to obtain address information.

Specifically, the preset length may be $ra from the top of the stack, for example, for handle_int, the preset length may be 37 words.

In the above embodiment of the present application, after the system is restarted in step S26, in step S210, the print result is uploaded to the server, and the failure type is obtained by the server based on the print result.

Specifically, since the intelligent router is connected with the cloud server and can report data to the server, after the system is restarted, the stored printing result can be sent to the cloud server, and therefore the cloud server can determine the specific position of the endless loop sending through analyzing and disassembling the printing result.

In the above embodiment of the present application, step S20, detecting whether the system has a fault includes:

step S202, judging whether the current count value of the counter exceeds a first preset value.

Specifically, the first preset value may be a tolerance threshold for tolerating occurrence of a hanging failure.

Step S204, if the current count value exceeds the first preset value, the system is determined to be faulty.

Step S206, if the current count value is not determined to exceed the first preset value, the system is determined to not have faults.

In an alternative scheme, for the dead hanging fault, the main principle of the watchdog is that program logic which is difficult to obtain and executed by a CPU in a system plays a role of periodic watchdog, and the program logic which is easy to obtain and executed by the CPU in a system plays a role of active watchdog, and under the condition that the dead hanging fault does not occur, the watchdog feeding logic which is used as a weak person can be operated after a period of time, and a counter is cleared; in the event of a dead-hanging fault, the dog feeding logic as a weak person cannot be executed, so that the counter is increased more and more to a degree close to and exceeding the tolerance threshold, and the watchdog program logic as a strong person can see that the counter exceeds the threshold, which means that the dog feeding program cannot be executed for a plurality of enough periods, and the system possibly hangs up.

For example, as shown in FIG. 3, when the counter exceeds a threshold, the user stack and kernel stack may be printed and the CPU registers printed; when the counter does not exceed the threshold value, no processing may be performed.

In the above embodiment of the present application, step S212 controls the current count value of the counter to be increased by a second preset value when the first preset thread is detected to run, where the first preset thread runs at intervals of a preset time period.

Specifically, the second preset value may be 1, and the preset period may be an execution period of execution of the first preset thread.

Optionally, the running priority of the first preset thread is highest. Further, the first preset thread may be a clock interrupt thread.

In particular, the watchdog needs a program segment with very high priority and certain opportunity to be executed, and in the most extreme case, the logic can be written in the clock terminal ISR (i.e. the first preset thread), which is that the whole system can feel a mechanism that time passes and is matched with periodic level triggering of a hardware circuit, so that certain opportunity to be executed can be ensured.

It should be noted that, the intelligent router is realized by hardware based on a MIPS CPU architecture and a Linux kernel, so that in order to make the clock interrupt ISR take the role of a watchdog program, it is necessary to analyze and research for project situations, and it is clear that in the Linux kernel of this hardware, the interrupt processing flow is shown in fig. 6.

For example, as shown in FIG. 3, the watchdog feed variable may be a positive integer counter, the highest priority interrupt ISR incrementing the counter by 1 every second, and determining if the counter is greater than a threshold. Other business logic threads may be of normal priority.

In the above embodiment of the present application, in step S214, the control counter is cleared when the second preset thread is detected to run.

Optionally, the second preset thread has the lowest running priority.

Specifically, the second preset thread may be a watchdog thread, and based on a watchdog principle, the watchdog needs program logic that is difficult to be executed by a CPU in a system to play a role of periodic watchdog, and the watchdog thread clears a counter each time the watchdog thread is scheduled.

For example, as shown in FIG. 3, the kernel dog feed thread may be a low priority thread, with each time scheduled the counter being cleared.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is further provided a fault locating device for implementing the fault locating method, as shown in fig. 7, the device 700 includes: a printing module 702, a storage module 704, and a determination module 706.

The printing module 702 is configured to print a stack area of the kernel stack and a stack area of the user stack if a system failure is detected, so as to obtain a printing result; the storage module 704 is used for storing the printing result and controlling the restarting of the system; the determining module 706 is configured to determine a failure type based on the print result if the system is restarted.

Specifically, the system may be a Linux operating system deployed in an intelligent router in a public area, where the intelligent router is implemented based on hardware and Linux kernels of a MIPS CPU architecture. Whether the system fails or not can be detected through the watchdog program, in the embodiment of the application, the system is mainly detected aiming at the hanging failure, and if the hanging failure is detected, the system can be reset at the moment, and the equipment can be restarted, so that the equipment is released from the hanging state. To ensure that the print results are not lost after the system is restarted, the print results may be stored in a hot-restart non-lost or nonvolatile storage device and then the system restarted.

Here, the printing module 702, the storage module 704, and the determining module 706 correspond to steps S22 to S26 in embodiment 1, and the three modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

According to the method provided by the embodiment 2, if the system is detected to be faulty, the printing module is used for printing the stack area of the kernel stack and the stack area of the user stack to obtain a printing result, the storage module is used for storing the printing result, the system is controlled to restart, and if the system is restarted successfully, the determining module is further used for determining the fault type based on the printing result, so that the purpose of accurately positioning the code paragraph causing the fault by utilizing the fault.

Therefore, the technical problem that the positioning accuracy is low because the fault information of the fault occurrence cannot be timely obtained by the fault positioning method in the prior art is solved by the scheme of the embodiment 2.

In the above embodiments of the present application, the printing module includes: an acquisition unit and a printing unit.

The acquisition unit is used for acquiring a preset printing length; the printing unit is used for printing the stack area of the kernel stack and the stack area of the user stack byte by byte according to the preset printing length.

Here, it should be noted that the above-described acquisition unit and printing unit correspond to step S222 to step S224 in embodiment 1, and both units are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to those disclosed in embodiment 1 above. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiment of the present application, the printing module is further configured to print the processor register while printing the stack area of the kernel stack and the stack area of the user stack, to obtain a printing result.

Here, the print module corresponds to step S284 in embodiment 1, and the print module is the same as the example and application scenario implemented by the corresponding step, but is not limited to the disclosure of embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiments of the present application, the storage module is further configured to store the print result in any one or more of the following storage media: random access memory and external memory flash memory.

Here, the storage module corresponds to step S242 in embodiment 1, and the module is the same as the example and application scenario implemented by the corresponding step, but is not limited to the disclosure of embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiments of the present application, the determining module includes: the device comprises an analysis unit, a processing unit and a first determination unit.

The analysis unit is used for analyzing the printing result and determining the address information of the system fault; the processing unit is used for disassembling the printing result to obtain function information corresponding to the address information; the first determination unit is used for determining the fault type based on the function information.

move k1,sp

addinu sp, k1, -176# corresponds to: adiu sp, sp-176

sw k0,140(sp)

...

After the address information is obtained from the print result, the binary image can be disassembled to immediately locate which function was the position where the dead cycle occurred. After obtaining the function information by disassembly, the fault type, i.e. which piece of code of which application program has undergone a dead-loop, can be further determined.

Here, the above analysis unit, the processing unit, and the first determination unit correspond to steps S262 to S266 in embodiment 1, and the three units are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1 above. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiment of the present application, the parsing unit includes: and the analysis sub-module and the acquisition sub-module.

The analysis sub-module is used for analyzing the printing result and determining a stack frame corresponding to the interrupt service routine; the acquisition sub-module is used for acquiring data with a preset length from the stack top of the stack frame to obtain address information.

Specifically, since the actions of pushing stacks and saving fields are always performed with $k0, $k1 and $sp, no matter what the "interrupt response total entry" is, by finding the assembly codes of $k0, $k1 and $sp, it is possible to determine the stack frame corresponding to the interrupt service routine, and determine how long the stack frame used is, for example, 44 words long for handle_int. The preset length may be $RA from the top of the stack, for example, 37 words for handle_int.

Here, the parsing sub-module and the acquiring sub-module correspond to step S2622 to step S2624 in embodiment 1, and the two sub-modules are the same as the example and application scenario implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiments of the present application, the apparatus further includes: and uploading the module.

The uploading module is used for uploading the printing result to the server after the system is restarted, and the server obtains the fault type based on the printing result.

It should be noted that, the above uploading module corresponds to step S210 in embodiment 1, and the module is the same as the example and application scenario implemented by the corresponding step, but is not limited to the disclosure of embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiments of the present application, the apparatus further includes: the detection module, detection module includes: a judging unit, a second determining unit and a third determining unit.

The detection module is used for detecting whether the system has faults or not; the judging unit is used for judging whether the current count value of the counter exceeds a first preset value; the second determining unit is used for determining that the system fails if the current count value exceeds the first preset value; the third determining unit is used for determining that the system fails if the current count value is determined not to exceed the first preset value.

Here, it should be noted that the above-mentioned detection module corresponds to step S20 in embodiment 1, the above-mentioned determination unit, the second determination unit, and the third determination unit correspond to steps S202 to S206 in the embodiment, and the module and the three units are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1 above. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiments of the present application, the apparatus further includes: and a control module.

The control module is used for controlling the current count value of the counter to be increased by a second preset value under the condition that the first preset thread is detected to run, wherein the first preset thread runs at intervals of a preset time period.

Here, it should be noted that the above control module corresponds to step S212 in embodiment 1, and the module is the same as the example and application scenario implemented by the corresponding step, but is not limited to the disclosure of embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the foregoing embodiment of the present application, the control module is further configured to control the counter to be cleared when the second preset thread is detected to run.

Optionally, the second preset thread has the lowest running priority.

It should be noted that, the control module corresponds to step S214 in embodiment 1, and the module is the same as the example and application scenario implemented by the corresponding step, but is not limited to the disclosure of embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

Example 3

According to an embodiment of the present application, there is also provided a fault locating system, including:

a processor; and

a memory, coupled to the processor, for providing instructions to the processor for processing the steps of: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

According to the method provided by the embodiment 3, if the system is detected to be faulty, the stack area of the kernel stack and the stack area of the user stack are printed to obtain a printing result, the printing result is stored, the system is controlled to restart, and if the system is restarted successfully, the fault type is further determined based on the printing result, so that the purpose of accurately positioning the code section causing the fault by utilizing the fault caused by the hanging is achieved.

Therefore, the solution provided in the present application in embodiment 3 solves the technical problem that in the prior art, the fault positioning method cannot acquire the fault information when the fault occurs in time, resulting in low positioning accuracy.

Example 4

Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the above-mentioned computer terminal may execute the program code of the following steps in the fault locating method: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

Alternatively, fig. 8 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 8, the computer terminal a may include: one or more (only one is shown) processors 802 and memory 804.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the fault locating method and apparatus in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the fault locating method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

Optionally, the above processor may further execute program code for: acquiring a preset printing length; and printing the stack area of the kernel stack and the stack area of the user stack byte by byte according to the preset printing length.

Optionally, the above processor may further execute program code for: and printing a processor register while printing the stack area of the kernel stack and the stack area of the user stack to obtain a printing result.

Optionally, the above processor may further execute program code for: storing the print result in any one or more of the following storage media: random access memory and external memory flash memory.

Optionally, the above processor may further execute program code for: analyzing the printing result and determining the address information of the system fault; disassembling the printing result to obtain function information corresponding to the address information; based on the function information, a fault type is determined.

Optionally, the above processor may further execute program code for: analyzing the printing result and determining a stack frame corresponding to the interrupt service routine; and acquiring data with a preset length from the stack top of the stack frame to obtain address information.

Optionally, the above processor may further execute program code for: after the system is restarted, the printing result is uploaded to the server, and the fault type is obtained by the server based on the printing result.

Optionally, the above processor may further execute program code for: judging whether the current count value of the counter exceeds a first preset value; if the current count value exceeds the first preset value, determining that the system fails; if the current count value is not determined to exceed the first preset value, the system is determined to not fail.

Optionally, the above processor may further execute program code for: and under the condition that the operation of the first preset thread is detected, the current count value of the counter is controlled to be increased by a second preset value, wherein the first preset thread operates at intervals of a preset time period.

Optionally, the above processor may further execute program code for: and under the condition that the running of the second preset thread is detected, the control counter is cleared.

Optionally, the above processor may further execute program code for: the operation priority of the first preset thread is highest, and the operation priority of the second preset thread is lowest.

Optionally, the above processor may further execute program code for: the first preset thread is a clock interrupt thread.

By adopting the embodiment of the application, if the system is detected to have faults, the stack area of the kernel stack and the stack area of the user stack are printed to obtain a printing result, the printing result is stored, the system is controlled to restart, and if the system is restarted successfully, the fault type is further determined based on the printing result, so that the purpose of accurately positioning the code paragraph causing the faults by utilizing the occurred hang fault is realized.

It will be appreciated by those skilled in the art that the configuration shown in fig. 8 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm-phone computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 8 is not limited to the structure of the electronic device. For example, the computer terminal a may also include more or fewer components (such as a network interface, a display device, etc.) than shown in fig. 8, or have a different configuration than shown in fig. 8.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Example 4

Embodiments of the present application also provide a storage medium. Alternatively, in this embodiment, the storage medium may be used to store the program code executed by the fault locating method provided in the first embodiment.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, a fault type is determined based on the print result.

Optionally, the storage medium is further arranged to store program code for performing the steps of: acquiring a preset printing length; and printing the stack area of the kernel stack and the stack area of the user stack byte by byte according to the preset printing length.

Optionally, the storage medium is further arranged to store program code for performing the steps of: and printing a processor register while printing the stack area of the kernel stack and the stack area of the user stack to obtain a printing result.

Optionally, the storage medium is further arranged to store program code for performing the steps of: storing the print result in any one or more of the following storage media: random access memory and external memory flash memory.

Optionally, the storage medium is further arranged to store program code for performing the steps of: analyzing the printing result and determining the address information of the system fault; disassembling the printing result to obtain function information corresponding to the address information; based on the function information, a fault type is determined.

Optionally, the storage medium is further arranged to store program code for performing the steps of: analyzing the printing result and determining a stack frame corresponding to the interrupt service routine; and acquiring data with a preset length from the stack top of the stack frame to obtain address information.

Optionally, the storage medium is further arranged to store program code for performing the steps of: after the system is restarted, the printing result is uploaded to the server, and the fault type is obtained by the server based on the printing result.

Optionally, the storage medium is further arranged to store program code for performing the steps of: judging whether the current count value of the counter exceeds a first preset value; if the current count value exceeds the first preset value, determining that the system fails; if the current count value is not determined to exceed the first preset value, the system is determined to not fail.

Optionally, the storage medium is further arranged to store program code for performing the steps of: and under the condition that the operation of the first preset thread is detected, the current count value of the counter is controlled to be increased by a second preset value, wherein the first preset thread operates at intervals of a preset time period.

Optionally, the storage medium is further arranged to store program code for performing the steps of: and under the condition that the running of the second preset thread is detected, the control counter is cleared.

Optionally, the storage medium is further arranged to store program code for performing the steps of: the operation priority of the first preset thread is highest, and the operation priority of the second preset thread is lowest.

Optionally, the storage medium is further arranged to store program code for performing the steps of: the first preset thread is a clock interrupt thread.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-only memory (ROM), a random access memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A fault location method, comprising:

if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result;

storing the printing result and controlling the restarting of the system;

if the system is restarted, determining a fault type based on the printing result;

wherein the determining the fault type based on the print result includes: analyzing the printing result and determining address information of system faults; performing disassembly processing on the printing result to obtain function information corresponding to the address information; and determining the fault type based on the function information.

2. The method of claim 1, wherein printing the stack area of the kernel stack and the stack area of the user stack to obtain the print result comprises:

acquiring a preset printing length;

and printing the stack area of the kernel stack and the stack area of the user stack byte by byte according to the preset printing length.

3. The method of claim 2, wherein the print result is obtained by printing a processor register while printing a stack region of the kernel stack and a stack region of the user stack.

4. The method of claim 1, wherein storing the print result comprises:

storing the print result in any one or more of the following storage media: random access memory and external memory flash memory.

5. The method of claim 1, wherein parsing the print result to determine address information for a system failure comprises:

analyzing the printing result to determine a stack frame corresponding to the interrupt service routine;

and acquiring data with a preset length from the stack top of the stack frame to obtain the address information.

6. The method of claim 1, wherein after the system is restarted, the print result is uploaded to a server and the failure type is obtained by the server based on the print result.

7. The method of claim 1, wherein detecting whether the system is malfunctioning comprises:

judging whether the current count value of the counter exceeds a first preset value;

If the current count value is determined to exceed the first preset value, determining that the system fails;

and if the current count value is not determined to exceed the first preset value, determining that the system fails.

8. The method of claim 7, wherein in the event that a first preset thread is detected to be running, controlling the current count value of the counter to be increased by a second preset value, wherein the first preset thread is running every preset time period.

9. The method of claim 8, wherein the counter is controlled to be cleared in the event that a second preset thread run is detected.

10. The method of claim 9, wherein the first preset thread has a highest priority of operation and the second preset thread has a lowest priority of operation.

11. The method of claim 10, wherein the first predetermined thread is a clock interrupt thread.

12. A fault locating device comprising:

the printing module is used for printing the stack area of the kernel stack and the stack area of the user stack if the system is detected to be faulty, so as to obtain a printing result;

the storage module is used for storing the printing result and controlling the restarting of the system;

A determining module configured to determine a failure type based on the print result if the system is restarted;

the determining module is further configured to: analyzing the printing result and determining address information of system faults; performing disassembly processing on the printing result to obtain function information corresponding to the address information; and determining the fault type based on the function information.

13. A storage medium comprising a stored program, wherein the program, when run, controls a device on which the storage medium resides to perform the steps of: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, determining a fault type based on the printing result; wherein the determining the fault type based on the print result includes: analyzing the printing result and determining address information of system faults; performing disassembly processing on the printing result to obtain function information corresponding to the address information; and determining the fault type based on the function information.

14. A processor for running a program, wherein the program when run performs the steps of: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, determining a fault type based on the printing result; wherein the program, when run, determines a fault type based on the print result by: analyzing the printing result and determining address information of system faults; performing disassembly processing on the printing result to obtain function information corresponding to the address information; and determining the fault type based on the function information.

15. A fault location system, comprising:

a processor; and

a memory, coupled to the processor, for providing instructions to the processor to process the following processing steps: if the system is detected to be faulty, printing a stack area of the kernel stack and a stack area of the user stack to obtain a printing result; storing the printing result and controlling the restarting of the system; if the system is restarted, determining a fault type based on the printing result; wherein the memory is configured to determine a fault type based on the print result by: analyzing the printing result and determining address information of system faults; performing disassembly processing on the printing result to obtain function information corresponding to the address information; and determining the fault type based on the function information.