CN112650610A - Linux system crash control method, system and medium - Google Patents

Linux system crash control method, system and medium Download PDF

Info

Publication number
CN112650610A
CN112650610A CN202011462215.7A CN202011462215A CN112650610A CN 112650610 A CN112650610 A CN 112650610A CN 202011462215 A CN202011462215 A CN 202011462215A CN 112650610 A CN112650610 A CN 112650610A
Authority
CN
China
Prior art keywords
kernel
linux
crash
system crash
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011462215.7A
Other languages
Chinese (zh)
Other versions
CN112650610B (en
Inventor
史慧娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202011462215.7A priority Critical patent/CN112650610B/en
Publication of CN112650610A publication Critical patent/CN112650610A/en
Application granted granted Critical
Publication of CN112650610B publication Critical patent/CN112650610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a Linux system crash control method which comprises the steps of establishing a kernel crash analysis thread, analyzing the cause of system crash, and analyzing the system crash caused by a user or hardware or software; creating a kernel crash avoidance thread, performing a system crash test experiment caused by software, and triggering the system crash caused by the software; reading a log code generated when the system is crashed, and writing the log code into a suspended task function for protection and shielding; packaging the suspended task function into a Linux kernel, restarting an operating system, and re-entering the Linux kernel; by the mode, the system log can be obtained through the Linux command, the phenomenon of triggering breakdown of hardware problems and software problems can be distinguished by analyzing the system log, and the blocking of the service volume is protected and shielded, so that the service volume is not monitored, when the service access data volume is large, the service access data volume can be properly processed, and the software fault is avoided.

Description

Linux system crash control method, system and medium
Technical Field
The invention relates to the field of system anomaly analysis, in particular to a method, a system and a medium for controlling the breakdown of a Linux system.
Background
When the server is used, the server is crashed due to the phenomenon that the server is abnormally down or the Linux kernel is triggered frequently, and the reason for triggering the server crash phenomenon is the hardware problem of the server; or an external environment triggering problem, such as the ambient temperature being too high or too low, triggering a server self-protection threshold; or may be the influence caused by the virus of the external environment, or the downtime of the server system caused by the blockage of the task; the abnormal restart downtime caused by any phenomenon can cause immeasurable influence on the customer experience or the customer use.
The kdump is a tool and a service used for dumping memory operation parameters when the system crashes, deadlocks or crashes, so that the system can generate a vmcore file under var/crash when the kernel panel is triggered, and a Linux engineer analyzes the reason of the system crash according to the generated vmcore-dmesg file and the vmcore.
However, at present, the problem that Linux task is blocked to trigger kernel panel to cause system crash caused by the fact that the data access amount including banking business and the like is large and the load is too large in the use of customers cannot be solved well.
Disclosure of Invention
The invention mainly solves the technical problem of providing a Linux system crash control method, a Linux system crash control system and a Linux system crash control medium, which can acquire a system log through a Linux command, can distinguish a hardware problem from a software problem by analyzing the system log to trigger the phenomenon of system crash, and establish a protection module barrier for blocking of service volume so as not to be monitored, so that the service access data volume can be well processed when being large, and software faults are avoided.
In order to solve the technical problems, the invention adopts a technical scheme that: the Linux system crash control method comprises the following steps: the method comprises the steps of creating a kernel crash analysis thread, and analyzing the cause of the crash of a system:
if system crash information caused by user behavior exists in the system log, defining the system crash reason as the system crash caused by the user;
if the information in the system log has system crash information caused by hardware failure in the server, defining the system crash reason as the system crash caused by the hardware;
if the information in the system log comprises fault information or soft deadlock error information caused by system task blocking, defining the system crash reason as system crash caused by software;
creating a kernel crash avoidance thread, carrying out a test experiment on system crash caused by software, triggering the system crash, reading a log code generated when the system crashes, and writing the log code into a suspended task function;
and packaging the suspended task function into a Linux kernel, restarting the operating system, and re-entering the Linux kernel.
Further, the log code generated when the system crashes contains a task process; causing a system crash when the system runs a task process.
Further, the writing suspension task function comprises the following steps:
reading task processes and the number of the task processes in a log code generated when a system crashes;
and writing the task processes and the number of the task processes into a suspended task function.
Further, the step of packaging the task suspending function into the Linux kernel comprises the following steps:
clearing a compiling file and a configuration file generated in the compiling process of the Linux kernel;
clearing object files and executable files generated in the Linux kernel compiling process;
using an interface command to change a kernel configuration interface into a graphical mode, selecting a task suspending function, and compiling the task suspending function into a Linux kernel;
compiling the Linux kernel through a compiling kernel command;
installing a Linux kernel driver module by using an installation command;
and installing a Linux kernel.
A Linux system crash control system, comprising: the device comprises an analysis module, an avoidance module and a packaging module;
the analysis module checks the system log, and the analysis system generates system crash caused by a user, system crash caused by hardware or system crash caused by software;
the evasion module performs a system crash test experiment caused by software, triggers system crash, reads log codes generated when the system crashes, and writes the log codes into the suspended task function.
A computer-readable storage medium having stored thereon a computer program for executing the steps of a Linux system crash control method described above by a processor.
The invention has the beneficial effects that: the invention can better analyze and locate the system crash reason caused by large service data access amount, can distinguish the phenomenon that the hardware problem and the software problem trigger restarting, and establishes a protection module barrier for the blocking of the service amount to ensure that the protection module barrier is not monitored, so that the software fault can be well avoided when the service access data amount is large, and the system crash is avoided.
Drawings
FIG. 1 is a flowchart illustrating a method for controlling a crash of a Linux system according to a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of a Linux system crash control system according to the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
The embodiment of the invention comprises the following steps:
in a first aspect, referring to fig. 1, a Linux system crash control method includes:
creating a kernel crash analysis thread, displaying the event occurrence time, the host name of an event occurrence source system and the program name for generating log information through a Linux command # more/var/log/messages, and analyzing the cause of the system crash;
if the kernel crash analysis thread finds that the system log has the following display:
shutdown:shutting down for system reboot
init:Switching to runlevel:6exiting on signal 15
Got SIGTERM,quitting;
the reason for the system crash is the restart behavior initiated by the user;
the kernel crash analysis thread finds a command specifically executed by the last user through the command;
the kernel crash analysis thread displays the command information used by the previous user through the lastcomm, checks the command executed by the previous root user through the lastcomm root, and analyzes the system log, if the command used or executed by the previous user causes the system crash; a system crash for the user himself.
Checking the system log through a kernel crash analysis thread;
if the system log shows that:
CPU 1:Machine Check Exception:4Bank 4:ba00000000070f0f
Kernel panic-not syncing:Machine check
the Kernel global-not synchronizing refers to the system restart caused by the failure of CPU hardware;
if the system log shows that:
kernel:CPUX:Temperature above threshold,cpu clock throttled
kernel:CPUX:Core power limit notification(total events=1);
Power Button Pressed received event"button/power PWRF 00000000
00000000' restart due to overheating of server, and kernel crash analysis thread gives prompt and suggests
Checking a refrigeration system of the data center and a fan of the server;
the system crash is the cause of hardware failure;
the reasons for hardware failure also include CPU pin bending, memory having a lot of uncorrectable ECC failure, disk or memory damage, and so on.
Using kdump by the kernel crash analysis thread;
starting the kdump by a kernel crash analysis thread;
when the system crashes, the kdump generates a kernel of the current operation information of the capture, and the kernel can collect all the operation states and data information in the memory at the moment into a vmcore file of the virtual core;
the system crash problem in the vmcore file of the virtual core is diagnosed through the Kernel Oops Analyzer, and the cause of the system crash fault is determined.
Checking the system log through a kernel crash analysis thread;
if the system log shows that:
“kernel:INFO:task:60blocked for more than 120seconds.”
a fault caused by system task blocking;
if the system log shows that:
BUG:soft lockup-CPU#2stuck for 67s![vmmemctl:894]
BUG:soft lockup-CPU#5stuck for 67s![bdi-default:49]
BUG:soft lockup-CPU#3stuck for 67s![irqbalance:1351]
BUG:soft lockup-CPU#4stuck for 67s![swapper:0]
BUG:soft lockup-CPU#6stuck for 67s![watchdog/6:30]
BUG:soft lockup-CPU#5stuck for 67s![vmmemctl:894]
BUG:soft lockup-CPU#0stuck for 67s![events/0:35]
BUG:soft lockup-CPU#7stuck for 67s![lldpad:1459]
BUG:soft lockup-CPU#6stuck for 67s![mpt_poll_0:376]
BUG:soft lockup-CPU#4stuck for 67s![ksoftirqd/4:21]
if a certain driver of the system has a problem, the CPU resource is insufficient, the watchdog is too busy and not timely, the usage data of each logic CPU in operation cannot be collected, and a soft deadlock (soft deadlock) is thrown incorrectly;
the system crash is the cause of software crash;
in order to solve the problem of the system task blockage caused by the software downtime, a kernel crash evasion thread is created, a system crash test experiment caused by the software is firstly carried out in the kernel crash evasion thread, and the system crash is caused by the system task blockage triggered; then reads the log code generated when the system crashes,
the log code generated during system crash is 'kernel: INFO: task xxx:60blocked for more than 120 seconds', and the task process of the task xxx displayed in the log code is the process causing the system crash;
reading the task processes in the log codes, and writing the task processes and the number of the task processes into a hung _ task function for protection and shielding so as to prevent the task processes from being monitored;
and packaging the hung _ task function into the Linux kernel by the kernel crash avoidance thread, restarting the operating system, re-entering the Linux kernel, and avoiding system crash caused by task blocking by using the newly compiled Linux kernel.
Packaging the hung _ task function into a Linux kernel comprises the following steps:
clearing all intermediate files generated in the compiling process by # make mrprep, including the kernel configuration file once configured in the past, # config' is cleared, namely deleting the original old configuration file when carrying out new compiling work so as to avoid influencing new kernel compiling;
clearing files and executable files with the suffix of the object file generated by the last compiling command being 'o' through # make clear;
after waiting for a few seconds, # make menuconfig #, the terminal becomes a graphical kernel configuration interface, selects a modified function module (hung _ task function), and compiles the function into a kernel;
# make-j2// compile core, -j4 if the computer is quad-core, or-j 8 if the computer is eight-core. The larger the number after j, the faster the compilation time, generating the kernel module and vmlinux, initrd
After compiling is successful, the system generates a subdirectory under the/lib/modules directory, and all loadable modules of the new kernel are stored in the subdirectory (namely, the compiled modules are copied to the/lib/modules)
make install// install kernel, i.e., copy. config, vmlinux, initrd. img, system. map file to/boot directory, update grub. The following three grub files of the RedHat system are automatically updated, and a new kernel is started by default.
The kdump memory dump running tool is a tool and a service for dumping memory running parameters when a system crashes, deadlocks or crashes.
Kernel Oops Analyzer is a Kernel crash analysis tool; the hung task suspension task function is a self-protection module used for detecting whether a process which is in a D state and exceeds a certain specific time (the duration can be set) exists in a system, and if the process exists, a kernel is triggered to cause the server to crash and restart; all processes are detected in a loop in the hung _ task pending task function.
The Linux is provided with a watchdog implementation for monitoring the operation of the system, and comprises a kernel watchdog module and a watchdog program of a user space; the Linux kernel watchdog module communicates with the user space through the/dev/watchdog character device, once the user space program turns on the/dev/watchdog device, a 1-minute timer is started in the kernel, and then the user space program needs to ensure that data is written into the device within 1 minute, each writing operation can cause resetting of the timer, and if the user space program does not have writing operation within 1 minute, the expiration of the timer can cause a system reboot operation.
In a second aspect, based on the same inventive concept as the Linux system crash control method in the foregoing embodiments, an embodiment of the present specification further provides a Linux system crash control system, including: the device comprises an analysis module, an avoidance module and a packaging module;
the analysis module checks the system log, and the analysis system generates system crash caused by a user, hardware or software;
the avoidance module performs a system crash test experiment caused by software and triggers the system crash caused by the software; then reading a log code generated when the system crashes, reading a task process in the log code, and writing the task process and the task process number into a hung _ task function for protection and shielding so as to prevent the task process and the task process number from being monitored;
and the encapsulation module encapsulates the hung _ task function in the avoidance module into a Linux kernel.
In a third aspect, based on the same inventive concept as the Linux system crash control method in the foregoing embodiments, the present specification embodiment further provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the Linux system crash control method.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A Linux system crash control method is characterized by comprising the steps of creating a kernel crash analysis thread and analyzing the cause of system crash; the reasons for system crash include system crash caused by user, system crash caused by hardware and system crash caused by software;
creating a kernel crash avoidance thread, carrying out a test experiment on system crash caused by software, triggering the system crash, reading a log code generated when the system crashes, and writing the log code into a suspended task function;
and packaging the suspended task function into a Linux kernel, restarting the operating system, and re-entering the Linux kernel.
2. The Linux system crash control method of claim 1, wherein: the log code generated when the system crashes comprises a task process; causing a system crash when the system runs a task process.
3. The Linux system crash control method of claim 2, wherein: the writing suspension task function comprises the following steps:
reading task processes and the number of the task processes in a log code generated when a system crashes;
and writing the task processes and the number of the task processes into a suspended task function.
4. The Linux system crash control method of claim 2, wherein: the step of packaging the task suspending function into the Linux kernel comprises the following steps:
clearing a compiling file and a configuration file generated in the compiling process of the Linux kernel;
clearing object files and executable files generated in the Linux kernel compiling process;
using an interface command to change a kernel configuration interface into a graphical mode, selecting a task suspending function, and compiling the task suspending function into a Linux kernel;
compiling the Linux kernel through a compiling kernel command;
installing a Linux kernel driver module by using an installation command;
and installing a Linux kernel.
5. A Linux system crash control system, comprising: the device comprises an analysis module, an avoidance module and a packaging module;
the analysis module checks the system log, and the analysis system generates system crash caused by a user, system crash caused by hardware or system crash caused by software;
the evasion module performs a system crash test experiment caused by software, triggers system crash, reads a log code generated when the system crashes, and writes the log code into a suspended task function;
and the packaging module packages the suspension function in the avoidance module into a Linux kernel.
6. A computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to perform the steps of the Linux system crash control method of any one of claims 1-4.
CN202011462215.7A 2020-12-11 2020-12-11 Linux system crash control method, system and medium Active CN112650610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011462215.7A CN112650610B (en) 2020-12-11 2020-12-11 Linux system crash control method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011462215.7A CN112650610B (en) 2020-12-11 2020-12-11 Linux system crash control method, system and medium

Publications (2)

Publication Number Publication Date
CN112650610A true CN112650610A (en) 2021-04-13
CN112650610B CN112650610B (en) 2023-01-10

Family

ID=75353715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011462215.7A Active CN112650610B (en) 2020-12-11 2020-12-11 Linux system crash control method, system and medium

Country Status (1)

Country Link
CN (1) CN112650610B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706708A (en) * 2022-05-24 2022-07-05 北京拓林思软件有限公司 Fault analysis method and system for Linux operating system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929747A (en) * 2012-11-05 2013-02-13 中标软件有限公司 Method for treating crash dump of Linux operation system based on loongson server
CN106339285A (en) * 2016-08-19 2017-01-18 浪潮电子信息产业股份有限公司 Analysis method for accidental restart of LINUX system
CN106959909A (en) * 2017-03-27 2017-07-18 西安电子科技大学 A kind of application software abnormal restoring method in android system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929747A (en) * 2012-11-05 2013-02-13 中标软件有限公司 Method for treating crash dump of Linux operation system based on loongson server
CN106339285A (en) * 2016-08-19 2017-01-18 浪潮电子信息产业股份有限公司 Analysis method for accidental restart of LINUX system
CN106959909A (en) * 2017-03-27 2017-07-18 西安电子科技大学 A kind of application software abnormal restoring method in android system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706708A (en) * 2022-05-24 2022-07-05 北京拓林思软件有限公司 Fault analysis method and system for Linux operating system
CN114706708B (en) * 2022-05-24 2022-08-30 北京拓林思软件有限公司 Fault analysis method and system for Linux operating system

Also Published As

Publication number Publication date
CN112650610B (en) 2023-01-10

Similar Documents

Publication Publication Date Title
WO2022160756A1 (en) Server fault positioning method, apparatus and system, and computer-readable storage medium
KR101036702B1 (en) Method, system, and apparatus for providing custom product support for a software program based upon states of program execution instability
US7243267B2 (en) Automatic failure detection and recovery of applications
US6457142B1 (en) Method and apparatus for target application program supervision
US5948112A (en) Method and apparatus for recovering from software faults
US6502208B1 (en) Method and system for check stop error handling
US7363546B2 (en) Latent fault detector
KR20060046281A (en) Method, system, and apparatus for identifying unresponsive portions of a computer program
US8984335B2 (en) Core diagnostics and repair
WO1995022794A1 (en) System for automatic recovery from software problems that cause computer failure
CN108292342B (en) Notification of intrusions into firmware
Murphy Automating Software Failure Reporting: We can only fix those bugs we know about.
US11314610B2 (en) Auto-recovery for software systems
CN110457907B (en) Firmware program detection method and device
CN112650610B (en) Linux system crash control method, system and medium
US7340594B2 (en) Bios-level incident response system and method
KR20180134677A (en) Method and apparatus for fault injection test
US9009671B2 (en) Crash notification between debuggers
KR100358278B1 (en) Method of Self-Diagnosis and Self-Restoration of System Error and A Computer System Using The Same
JPH02294739A (en) Fault detecting system
Tröger et al. WAP: What activates a bug? A refinement of the Laprie terminology model
CN113127245B (en) Method, system and device for processing system management interrupt
JP4269362B2 (en) Computer system
CN115599645A (en) Method and device for testing stability of linux drive module
CN114217925A (en) Business program operation monitoring method and system for realizing abnormal automatic restart

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant