CN112650610A - Linux system crash control method, system and medium - Google Patents
Linux system crash control method, system and medium Download PDFInfo
- Publication number
- CN112650610A CN112650610A CN202011462215.7A CN202011462215A CN112650610A CN 112650610 A CN112650610 A CN 112650610A CN 202011462215 A CN202011462215 A CN 202011462215A CN 112650610 A CN112650610 A CN 112650610A
- Authority
- CN
- China
- Prior art keywords
- kernel
- linux
- crash
- system crash
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000004806 packaging method and process Methods 0.000 claims abstract description 11
- 238000002474 experimental method Methods 0.000 claims abstract description 7
- 238000012360 testing method Methods 0.000 claims abstract description 7
- 230000008569 process Effects 0.000 claims description 32
- 238000004590 computer program Methods 0.000 claims description 5
- 239000000725 suspension Substances 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 2
- 238000009434 installation Methods 0.000 claims description 2
- 230000000903 blocking effect Effects 0.000 abstract description 6
- 230000015556 catabolic process Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 18
- 230000001960 triggered effect Effects 0.000 description 4
- 230000004888 barrier function Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 241000700605 Viruses Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000009225 memory damage Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000013021 overheating Methods 0.000 description 1
- 238000005057 refrigeration Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a Linux system crash control method which comprises the steps of establishing a kernel crash analysis thread, analyzing the cause of system crash, and analyzing the system crash caused by a user or hardware or software; creating a kernel crash avoidance thread, performing a system crash test experiment caused by software, and triggering the system crash caused by the software; reading a log code generated when the system is crashed, and writing the log code into a suspended task function for protection and shielding; packaging the suspended task function into a Linux kernel, restarting an operating system, and re-entering the Linux kernel; by the mode, the system log can be obtained through the Linux command, the phenomenon of triggering breakdown of hardware problems and software problems can be distinguished by analyzing the system log, and the blocking of the service volume is protected and shielded, so that the service volume is not monitored, when the service access data volume is large, the service access data volume can be properly processed, and the software fault is avoided.
Description
Technical Field
The invention relates to the field of system anomaly analysis, in particular to a method, a system and a medium for controlling the breakdown of a Linux system.
Background
When the server is used, the server is crashed due to the phenomenon that the server is abnormally down or the Linux kernel is triggered frequently, and the reason for triggering the server crash phenomenon is the hardware problem of the server; or an external environment triggering problem, such as the ambient temperature being too high or too low, triggering a server self-protection threshold; or may be the influence caused by the virus of the external environment, or the downtime of the server system caused by the blockage of the task; the abnormal restart downtime caused by any phenomenon can cause immeasurable influence on the customer experience or the customer use.
The kdump is a tool and a service used for dumping memory operation parameters when the system crashes, deadlocks or crashes, so that the system can generate a vmcore file under var/crash when the kernel panel is triggered, and a Linux engineer analyzes the reason of the system crash according to the generated vmcore-dmesg file and the vmcore.
However, at present, the problem that Linux task is blocked to trigger kernel panel to cause system crash caused by the fact that the data access amount including banking business and the like is large and the load is too large in the use of customers cannot be solved well.
Disclosure of Invention
The invention mainly solves the technical problem of providing a Linux system crash control method, a Linux system crash control system and a Linux system crash control medium, which can acquire a system log through a Linux command, can distinguish a hardware problem from a software problem by analyzing the system log to trigger the phenomenon of system crash, and establish a protection module barrier for blocking of service volume so as not to be monitored, so that the service access data volume can be well processed when being large, and software faults are avoided.
In order to solve the technical problems, the invention adopts a technical scheme that: the Linux system crash control method comprises the following steps: the method comprises the steps of creating a kernel crash analysis thread, and analyzing the cause of the crash of a system:
if system crash information caused by user behavior exists in the system log, defining the system crash reason as the system crash caused by the user;
if the information in the system log has system crash information caused by hardware failure in the server, defining the system crash reason as the system crash caused by the hardware;
if the information in the system log comprises fault information or soft deadlock error information caused by system task blocking, defining the system crash reason as system crash caused by software;
creating a kernel crash avoidance thread, carrying out a test experiment on system crash caused by software, triggering the system crash, reading a log code generated when the system crashes, and writing the log code into a suspended task function;
and packaging the suspended task function into a Linux kernel, restarting the operating system, and re-entering the Linux kernel.
Further, the log code generated when the system crashes contains a task process; causing a system crash when the system runs a task process.
Further, the writing suspension task function comprises the following steps:
reading task processes and the number of the task processes in a log code generated when a system crashes;
and writing the task processes and the number of the task processes into a suspended task function.
Further, the step of packaging the task suspending function into the Linux kernel comprises the following steps:
clearing a compiling file and a configuration file generated in the compiling process of the Linux kernel;
clearing object files and executable files generated in the Linux kernel compiling process;
using an interface command to change a kernel configuration interface into a graphical mode, selecting a task suspending function, and compiling the task suspending function into a Linux kernel;
compiling the Linux kernel through a compiling kernel command;
installing a Linux kernel driver module by using an installation command;
and installing a Linux kernel.
A Linux system crash control system, comprising: the device comprises an analysis module, an avoidance module and a packaging module;
the analysis module checks the system log, and the analysis system generates system crash caused by a user, system crash caused by hardware or system crash caused by software;
the evasion module performs a system crash test experiment caused by software, triggers system crash, reads log codes generated when the system crashes, and writes the log codes into the suspended task function.
A computer-readable storage medium having stored thereon a computer program for executing the steps of a Linux system crash control method described above by a processor.
The invention has the beneficial effects that: the invention can better analyze and locate the system crash reason caused by large service data access amount, can distinguish the phenomenon that the hardware problem and the software problem trigger restarting, and establishes a protection module barrier for the blocking of the service amount to ensure that the protection module barrier is not monitored, so that the software fault can be well avoided when the service access data amount is large, and the system crash is avoided.
Drawings
FIG. 1 is a flowchart illustrating a method for controlling a crash of a Linux system according to a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of a Linux system crash control system according to the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
The embodiment of the invention comprises the following steps:
in a first aspect, referring to fig. 1, a Linux system crash control method includes:
creating a kernel crash analysis thread, displaying the event occurrence time, the host name of an event occurrence source system and the program name for generating log information through a Linux command # more/var/log/messages, and analyzing the cause of the system crash;
if the kernel crash analysis thread finds that the system log has the following display:
shutdown:shutting down for system reboot
init:Switching to runlevel:6exiting on signal 15
Got SIGTERM,quitting;
the reason for the system crash is the restart behavior initiated by the user;
the kernel crash analysis thread finds a command specifically executed by the last user through the command;
the kernel crash analysis thread displays the command information used by the previous user through the lastcomm, checks the command executed by the previous root user through the lastcomm root, and analyzes the system log, if the command used or executed by the previous user causes the system crash; a system crash for the user himself.
Checking the system log through a kernel crash analysis thread;
if the system log shows that:
CPU 1:Machine Check Exception:4Bank 4:ba00000000070f0f
Kernel panic-not syncing:Machine check
the Kernel global-not synchronizing refers to the system restart caused by the failure of CPU hardware;
if the system log shows that:
kernel:CPUX:Temperature above threshold,cpu clock throttled
kernel:CPUX:Core power limit notification(total events=1);
Power Button Pressed received event"button/power PWRF 00000000
00000000' restart due to overheating of server, and kernel crash analysis thread gives prompt and suggests
Checking a refrigeration system of the data center and a fan of the server;
the system crash is the cause of hardware failure;
the reasons for hardware failure also include CPU pin bending, memory having a lot of uncorrectable ECC failure, disk or memory damage, and so on.
Using kdump by the kernel crash analysis thread;
starting the kdump by a kernel crash analysis thread;
when the system crashes, the kdump generates a kernel of the current operation information of the capture, and the kernel can collect all the operation states and data information in the memory at the moment into a vmcore file of the virtual core;
the system crash problem in the vmcore file of the virtual core is diagnosed through the Kernel Oops Analyzer, and the cause of the system crash fault is determined.
Checking the system log through a kernel crash analysis thread;
if the system log shows that:
“kernel:INFO:task:60blocked for more than 120seconds.”
a fault caused by system task blocking;
if the system log shows that:
BUG:soft lockup-CPU#2stuck for 67s![vmmemctl:894]
BUG:soft lockup-CPU#5stuck for 67s![bdi-default:49]
BUG:soft lockup-CPU#3stuck for 67s![irqbalance:1351]
BUG:soft lockup-CPU#4stuck for 67s![swapper:0]
BUG:soft lockup-CPU#6stuck for 67s![watchdog/6:30]
BUG:soft lockup-CPU#5stuck for 67s![vmmemctl:894]
BUG:soft lockup-CPU#0stuck for 67s![events/0:35]
BUG:soft lockup-CPU#7stuck for 67s![lldpad:1459]
BUG:soft lockup-CPU#6stuck for 67s![mpt_poll_0:376]
BUG:soft lockup-CPU#4stuck for 67s![ksoftirqd/4:21]
if a certain driver of the system has a problem, the CPU resource is insufficient, the watchdog is too busy and not timely, the usage data of each logic CPU in operation cannot be collected, and a soft deadlock (soft deadlock) is thrown incorrectly;
the system crash is the cause of software crash;
in order to solve the problem of the system task blockage caused by the software downtime, a kernel crash evasion thread is created, a system crash test experiment caused by the software is firstly carried out in the kernel crash evasion thread, and the system crash is caused by the system task blockage triggered; then reads the log code generated when the system crashes,
the log code generated during system crash is 'kernel: INFO: task xxx:60blocked for more than 120 seconds', and the task process of the task xxx displayed in the log code is the process causing the system crash;
reading the task processes in the log codes, and writing the task processes and the number of the task processes into a hung _ task function for protection and shielding so as to prevent the task processes from being monitored;
and packaging the hung _ task function into the Linux kernel by the kernel crash avoidance thread, restarting the operating system, re-entering the Linux kernel, and avoiding system crash caused by task blocking by using the newly compiled Linux kernel.
Packaging the hung _ task function into a Linux kernel comprises the following steps:
clearing all intermediate files generated in the compiling process by # make mrprep, including the kernel configuration file once configured in the past, # config' is cleared, namely deleting the original old configuration file when carrying out new compiling work so as to avoid influencing new kernel compiling;
clearing files and executable files with the suffix of the object file generated by the last compiling command being 'o' through # make clear;
after waiting for a few seconds, # make menuconfig #, the terminal becomes a graphical kernel configuration interface, selects a modified function module (hung _ task function), and compiles the function into a kernel;
# make-j2// compile core, -j4 if the computer is quad-core, or-j 8 if the computer is eight-core. The larger the number after j, the faster the compilation time, generating the kernel module and vmlinux, initrd
After compiling is successful, the system generates a subdirectory under the/lib/modules directory, and all loadable modules of the new kernel are stored in the subdirectory (namely, the compiled modules are copied to the/lib/modules)
make install// install kernel, i.e., copy. config, vmlinux, initrd. img, system. map file to/boot directory, update grub. The following three grub files of the RedHat system are automatically updated, and a new kernel is started by default.
The kdump memory dump running tool is a tool and a service for dumping memory running parameters when a system crashes, deadlocks or crashes.
Kernel Oops Analyzer is a Kernel crash analysis tool; the hung task suspension task function is a self-protection module used for detecting whether a process which is in a D state and exceeds a certain specific time (the duration can be set) exists in a system, and if the process exists, a kernel is triggered to cause the server to crash and restart; all processes are detected in a loop in the hung _ task pending task function.
The Linux is provided with a watchdog implementation for monitoring the operation of the system, and comprises a kernel watchdog module and a watchdog program of a user space; the Linux kernel watchdog module communicates with the user space through the/dev/watchdog character device, once the user space program turns on the/dev/watchdog device, a 1-minute timer is started in the kernel, and then the user space program needs to ensure that data is written into the device within 1 minute, each writing operation can cause resetting of the timer, and if the user space program does not have writing operation within 1 minute, the expiration of the timer can cause a system reboot operation.
In a second aspect, based on the same inventive concept as the Linux system crash control method in the foregoing embodiments, an embodiment of the present specification further provides a Linux system crash control system, including: the device comprises an analysis module, an avoidance module and a packaging module;
the analysis module checks the system log, and the analysis system generates system crash caused by a user, hardware or software;
the avoidance module performs a system crash test experiment caused by software and triggers the system crash caused by the software; then reading a log code generated when the system crashes, reading a task process in the log code, and writing the task process and the task process number into a hung _ task function for protection and shielding so as to prevent the task process and the task process number from being monitored;
and the encapsulation module encapsulates the hung _ task function in the avoidance module into a Linux kernel.
In a third aspect, based on the same inventive concept as the Linux system crash control method in the foregoing embodiments, the present specification embodiment further provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the Linux system crash control method.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (6)
1. A Linux system crash control method is characterized by comprising the steps of creating a kernel crash analysis thread and analyzing the cause of system crash; the reasons for system crash include system crash caused by user, system crash caused by hardware and system crash caused by software;
creating a kernel crash avoidance thread, carrying out a test experiment on system crash caused by software, triggering the system crash, reading a log code generated when the system crashes, and writing the log code into a suspended task function;
and packaging the suspended task function into a Linux kernel, restarting the operating system, and re-entering the Linux kernel.
2. The Linux system crash control method of claim 1, wherein: the log code generated when the system crashes comprises a task process; causing a system crash when the system runs a task process.
3. The Linux system crash control method of claim 2, wherein: the writing suspension task function comprises the following steps:
reading task processes and the number of the task processes in a log code generated when a system crashes;
and writing the task processes and the number of the task processes into a suspended task function.
4. The Linux system crash control method of claim 2, wherein: the step of packaging the task suspending function into the Linux kernel comprises the following steps:
clearing a compiling file and a configuration file generated in the compiling process of the Linux kernel;
clearing object files and executable files generated in the Linux kernel compiling process;
using an interface command to change a kernel configuration interface into a graphical mode, selecting a task suspending function, and compiling the task suspending function into a Linux kernel;
compiling the Linux kernel through a compiling kernel command;
installing a Linux kernel driver module by using an installation command;
and installing a Linux kernel.
5. A Linux system crash control system, comprising: the device comprises an analysis module, an avoidance module and a packaging module;
the analysis module checks the system log, and the analysis system generates system crash caused by a user, system crash caused by hardware or system crash caused by software;
the evasion module performs a system crash test experiment caused by software, triggers system crash, reads a log code generated when the system crashes, and writes the log code into a suspended task function;
and the packaging module packages the suspension function in the avoidance module into a Linux kernel.
6. A computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to perform the steps of the Linux system crash control method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011462215.7A CN112650610B (en) | 2020-12-11 | 2020-12-11 | Linux system crash control method, system and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011462215.7A CN112650610B (en) | 2020-12-11 | 2020-12-11 | Linux system crash control method, system and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112650610A true CN112650610A (en) | 2021-04-13 |
CN112650610B CN112650610B (en) | 2023-01-10 |
Family
ID=75353715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011462215.7A Active CN112650610B (en) | 2020-12-11 | 2020-12-11 | Linux system crash control method, system and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112650610B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114706708A (en) * | 2022-05-24 | 2022-07-05 | 北京拓林思软件有限公司 | Fault analysis method and system for Linux operating system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929747A (en) * | 2012-11-05 | 2013-02-13 | 中标软件有限公司 | Method for treating crash dump of Linux operation system based on loongson server |
CN106339285A (en) * | 2016-08-19 | 2017-01-18 | 浪潮电子信息产业股份有限公司 | Analysis method for accidental restart of LINUX system |
CN106959909A (en) * | 2017-03-27 | 2017-07-18 | 西安电子科技大学 | A kind of application software abnormal restoring method in android system |
-
2020
- 2020-12-11 CN CN202011462215.7A patent/CN112650610B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929747A (en) * | 2012-11-05 | 2013-02-13 | 中标软件有限公司 | Method for treating crash dump of Linux operation system based on loongson server |
CN106339285A (en) * | 2016-08-19 | 2017-01-18 | 浪潮电子信息产业股份有限公司 | Analysis method for accidental restart of LINUX system |
CN106959909A (en) * | 2017-03-27 | 2017-07-18 | 西安电子科技大学 | A kind of application software abnormal restoring method in android system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114706708A (en) * | 2022-05-24 | 2022-07-05 | 北京拓林思软件有限公司 | Fault analysis method and system for Linux operating system |
CN114706708B (en) * | 2022-05-24 | 2022-08-30 | 北京拓林思软件有限公司 | Fault analysis method and system for Linux operating system |
Also Published As
Publication number | Publication date |
---|---|
CN112650610B (en) | 2023-01-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022160756A1 (en) | Server fault positioning method, apparatus and system, and computer-readable storage medium | |
KR101036702B1 (en) | Method, system, and apparatus for providing custom product support for a software program based upon states of program execution instability | |
US7243267B2 (en) | Automatic failure detection and recovery of applications | |
US6457142B1 (en) | Method and apparatus for target application program supervision | |
US5948112A (en) | Method and apparatus for recovering from software faults | |
US6502208B1 (en) | Method and system for check stop error handling | |
US7363546B2 (en) | Latent fault detector | |
KR20060046281A (en) | Method, system, and apparatus for identifying unresponsive portions of a computer program | |
US8984335B2 (en) | Core diagnostics and repair | |
WO1995022794A1 (en) | System for automatic recovery from software problems that cause computer failure | |
CN108292342B (en) | Notification of intrusions into firmware | |
Murphy | Automating Software Failure Reporting: We can only fix those bugs we know about. | |
US11314610B2 (en) | Auto-recovery for software systems | |
CN110457907B (en) | Firmware program detection method and device | |
CN112650610B (en) | Linux system crash control method, system and medium | |
US7340594B2 (en) | Bios-level incident response system and method | |
KR20180134677A (en) | Method and apparatus for fault injection test | |
US9009671B2 (en) | Crash notification between debuggers | |
KR100358278B1 (en) | Method of Self-Diagnosis and Self-Restoration of System Error and A Computer System Using The Same | |
JPH02294739A (en) | Fault detecting system | |
Tröger et al. | WAP: What activates a bug? A refinement of the Laprie terminology model | |
CN113127245B (en) | Method, system and device for processing system management interrupt | |
JP4269362B2 (en) | Computer system | |
CN115599645A (en) | Method and device for testing stability of linux drive module | |
CN114217925A (en) | Business program operation monitoring method and system for realizing abnormal automatic restart |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |