CN112650610A

CN112650610A - Linux system crash control method, system and medium

Info

Publication number: CN112650610A
Application number: CN202011462215.7A
Authority: CN
Inventors: 史慧娟
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-13
Anticipated expiration: 2040-12-11
Also published as: CN112650610B

Abstract

The invention discloses a Linux system crash control method which comprises the steps of establishing a kernel crash analysis thread, analyzing the cause of system crash, and analyzing the system crash caused by a user or hardware or software; creating a kernel crash avoidance thread, performing a system crash test experiment caused by software, and triggering the system crash caused by the software; reading a log code generated when the system is crashed, and writing the log code into a suspended task function for protection and shielding; packaging the suspended task function into a Linux kernel, restarting an operating system, and re-entering the Linux kernel; by the mode, the system log can be obtained through the Linux command, the phenomenon of triggering breakdown of hardware problems and software problems can be distinguished by analyzing the system log, and the blocking of the service volume is protected and shielded, so that the service volume is not monitored, when the service access data volume is large, the service access data volume can be properly processed, and the software fault is avoided.

Description

Linux system crash control method, system and medium

Technical Field

The invention relates to the field of system anomaly analysis, in particular to a method, a system and a medium for controlling the breakdown of a Linux system.

Background

When the server is used, the server is crashed due to the phenomenon that the server is abnormally down or the Linux kernel is triggered frequently, and the reason for triggering the server crash phenomenon is the hardware problem of the server; or an external environment triggering problem, such as the ambient temperature being too high or too low, triggering a server self-protection threshold; or may be the influence caused by the virus of the external environment, or the downtime of the server system caused by the blockage of the task; the abnormal restart downtime caused by any phenomenon can cause immeasurable influence on the customer experience or the customer use.

The kdump is a tool and a service used for dumping memory operation parameters when the system crashes, deadlocks or crashes, so that the system can generate a vmcore file under var/crash when the kernel panel is triggered, and a Linux engineer analyzes the reason of the system crash according to the generated vmcore-dmesg file and the vmcore.

However, at present, the problem that Linux task is blocked to trigger kernel panel to cause system crash caused by the fact that the data access amount including banking business and the like is large and the load is too large in the use of customers cannot be solved well.

Disclosure of Invention

The invention mainly solves the technical problem of providing a Linux system crash control method, a Linux system crash control system and a Linux system crash control medium, which can acquire a system log through a Linux command, can distinguish a hardware problem from a software problem by analyzing the system log to trigger the phenomenon of system crash, and establish a protection module barrier for blocking of service volume so as not to be monitored, so that the service access data volume can be well processed when being large, and software faults are avoided.

In order to solve the technical problems, the invention adopts a technical scheme that: the Linux system crash control method comprises the following steps: the method comprises the steps of creating a kernel crash analysis thread, and analyzing the cause of the crash of a system:

if system crash information caused by user behavior exists in the system log, defining the system crash reason as the system crash caused by the user;

if the information in the system log has system crash information caused by hardware failure in the server, defining the system crash reason as the system crash caused by the hardware;

if the information in the system log comprises fault information or soft deadlock error information caused by system task blocking, defining the system crash reason as system crash caused by software;

creating a kernel crash avoidance thread, carrying out a test experiment on system crash caused by software, triggering the system crash, reading a log code generated when the system crashes, and writing the log code into a suspended task function;

and packaging the suspended task function into a Linux kernel, restarting the operating system, and re-entering the Linux kernel.

Further, the log code generated when the system crashes contains a task process; causing a system crash when the system runs a task process.

Further, the writing suspension task function comprises the following steps:

reading task processes and the number of the task processes in a log code generated when a system crashes;

and writing the task processes and the number of the task processes into a suspended task function.

Further, the step of packaging the task suspending function into the Linux kernel comprises the following steps:

clearing a compiling file and a configuration file generated in the compiling process of the Linux kernel;

clearing object files and executable files generated in the Linux kernel compiling process;

using an interface command to change a kernel configuration interface into a graphical mode, selecting a task suspending function, and compiling the task suspending function into a Linux kernel;

compiling the Linux kernel through a compiling kernel command;

installing a Linux kernel driver module by using an installation command;

and installing a Linux kernel.

A Linux system crash control system, comprising: the device comprises an analysis module, an avoidance module and a packaging module;

the analysis module checks the system log, and the analysis system generates system crash caused by a user, system crash caused by hardware or system crash caused by software;

the evasion module performs a system crash test experiment caused by software, triggers system crash, reads log codes generated when the system crashes, and writes the log codes into the suspended task function.

A computer-readable storage medium having stored thereon a computer program for executing the steps of a Linux system crash control method described above by a processor.

The invention has the beneficial effects that: the invention can better analyze and locate the system crash reason caused by large service data access amount, can distinguish the phenomenon that the hardware problem and the software problem trigger restarting, and establishes a protection module barrier for the blocking of the service amount to ensure that the protection module barrier is not monitored, so that the software fault can be well avoided when the service access data amount is large, and the system crash is avoided.

Drawings

FIG. 1 is a flowchart illustrating a method for controlling a crash of a Linux system according to a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of a Linux system crash control system according to the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

The embodiment of the invention comprises the following steps:

in a first aspect, referring to fig. 1, a Linux system crash control method includes:

creating a kernel crash analysis thread, displaying the event occurrence time, the host name of an event occurrence source system and the program name for generating log information through a Linux command # more/var/log/messages, and analyzing the cause of the system crash;

if the kernel crash analysis thread finds that the system log has the following display:

shutdown:shutting down for system reboot

init:Switching to runlevel:6exiting on signal 15

Got SIGTERM，quitting；

the reason for the system crash is the restart behavior initiated by the user;

the kernel crash analysis thread finds a command specifically executed by the last user through the command;

the kernel crash analysis thread displays the command information used by the previous user through the lastcomm, checks the command executed by the previous root user through the lastcomm root, and analyzes the system log, if the command used or executed by the previous user causes the system crash; a system crash for the user himself.

Checking the system log through a kernel crash analysis thread;

if the system log shows that:

CPU 1:Machine Check Exception:4Bank 4:ba00000000070f0f

Kernel panic-not syncing:Machine check

the Kernel global-not synchronizing refers to the system restart caused by the failure of CPU hardware;

if the system log shows that:

kernel:CPUX:Temperature above threshold,cpu clock throttled

kernel:CPUX:Core power limit notification(total events＝1)；

Power Button Pressed received event"button/power PWRF 00000000

00000000' restart due to overheating of server, and kernel crash analysis thread gives prompt and suggests

Checking a refrigeration system of the data center and a fan of the server;

the system crash is the cause of hardware failure;

the reasons for hardware failure also include CPU pin bending, memory having a lot of uncorrectable ECC failure, disk or memory damage, and so on.

Using kdump by the kernel crash analysis thread;

starting the kdump by a kernel crash analysis thread;

when the system crashes, the kdump generates a kernel of the current operation information of the capture, and the kernel can collect all the operation states and data information in the memory at the moment into a vmcore file of the virtual core;

the system crash problem in the vmcore file of the virtual core is diagnosed through the Kernel Oops Analyzer, and the cause of the system crash fault is determined.

Checking the system log through a kernel crash analysis thread;

if the system log shows that:

“kernel:INFO:task:60blocked for more than 120seconds.”

a fault caused by system task blocking;

if the system log shows that:

BUG:soft lockup-CPU#2stuck for 67s！[vmmemctl:894]

BUG:soft lockup-CPU#5stuck for 67s！[bdi-default:49]

BUG:soft lockup-CPU#3stuck for 67s！[irqbalance:1351]

BUG:soft lockup-CPU#4stuck for 67s！[swapper:0]

BUG:soft lockup-CPU#6stuck for 67s！[watchdog/6:30]

BUG:soft lockup-CPU#5stuck for 67s！[vmmemctl:894]

BUG:soft lockup-CPU#0stuck for 67s！[events/0:35]

BUG:soft lockup-CPU#7stuck for 67s！[lldpad:1459]

BUG:soft lockup-CPU#6stuck for 67s！[mpt_poll_0:376]

BUG:soft lockup-CPU#4stuck for 67s！[ksoftirqd/4:21]

if a certain driver of the system has a problem, the CPU resource is insufficient, the watchdog is too busy and not timely, the usage data of each logic CPU in operation cannot be collected, and a soft deadlock (soft deadlock) is thrown incorrectly;

the system crash is the cause of software crash;

in order to solve the problem of the system task blockage caused by the software downtime, a kernel crash evasion thread is created, a system crash test experiment caused by the software is firstly carried out in the kernel crash evasion thread, and the system crash is caused by the system task blockage triggered; then reads the log code generated when the system crashes,

the log code generated during system crash is 'kernel: INFO: task xxx:60blocked for more than 120 seconds', and the task process of the task xxx displayed in the log code is the process causing the system crash;

reading the task processes in the log codes, and writing the task processes and the number of the task processes into a hung _ task function for protection and shielding so as to prevent the task processes from being monitored;

and packaging the hung _ task function into the Linux kernel by the kernel crash avoidance thread, restarting the operating system, re-entering the Linux kernel, and avoiding system crash caused by task blocking by using the newly compiled Linux kernel.

Packaging the hung _ task function into a Linux kernel comprises the following steps:

clearing all intermediate files generated in the compiling process by # make mrprep, including the kernel configuration file once configured in the past, # config' is cleared, namely deleting the original old configuration file when carrying out new compiling work so as to avoid influencing new kernel compiling;

clearing files and executable files with the suffix of the object file generated by the last compiling command being 'o' through # make clear;

after waiting for a few seconds, # make menuconfig #, the terminal becomes a graphical kernel configuration interface, selects a modified function module (hung _ task function), and compiles the function into a kernel;

# make-j2// compile core, -j4 if the computer is quad-core, or-j 8 if the computer is eight-core. The larger the number after j, the faster the compilation time, generating the kernel module and vmlinux, initrd

After compiling is successful, the system generates a subdirectory under the/lib/modules directory, and all loadable modules of the new kernel are stored in the subdirectory (namely, the compiled modules are copied to the/lib/modules)

make install// install kernel, i.e., copy. config, vmlinux, initrd. img, system. map file to/boot directory, update grub. The following three grub files of the RedHat system are automatically updated, and a new kernel is started by default.

The kdump memory dump running tool is a tool and a service for dumping memory running parameters when a system crashes, deadlocks or crashes.

Kernel Oops Analyzer is a Kernel crash analysis tool; the hung task suspension task function is a self-protection module used for detecting whether a process which is in a D state and exceeds a certain specific time (the duration can be set) exists in a system, and if the process exists, a kernel is triggered to cause the server to crash and restart; all processes are detected in a loop in the hung _ task pending task function.

The Linux is provided with a watchdog implementation for monitoring the operation of the system, and comprises a kernel watchdog module and a watchdog program of a user space; the Linux kernel watchdog module communicates with the user space through the/dev/watchdog character device, once the user space program turns on the/dev/watchdog device, a 1-minute timer is started in the kernel, and then the user space program needs to ensure that data is written into the device within 1 minute, each writing operation can cause resetting of the timer, and if the user space program does not have writing operation within 1 minute, the expiration of the timer can cause a system reboot operation.

In a second aspect, based on the same inventive concept as the Linux system crash control method in the foregoing embodiments, an embodiment of the present specification further provides a Linux system crash control system, including: the device comprises an analysis module, an avoidance module and a packaging module;

the analysis module checks the system log, and the analysis system generates system crash caused by a user, hardware or software;

the avoidance module performs a system crash test experiment caused by software and triggers the system crash caused by the software; then reading a log code generated when the system crashes, reading a task process in the log code, and writing the task process and the task process number into a hung _ task function for protection and shielding so as to prevent the task process and the task process number from being monitored;

and the encapsulation module encapsulates the hung _ task function in the avoidance module into a Linux kernel.

In a third aspect, based on the same inventive concept as the Linux system crash control method in the foregoing embodiments, the present specification embodiment further provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the Linux system crash control method.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A Linux system crash control method is characterized by comprising the steps of creating a kernel crash analysis thread and analyzing the cause of system crash; the reasons for system crash include system crash caused by user, system crash caused by hardware and system crash caused by software;

2. The Linux system crash control method of claim 1, wherein: the log code generated when the system crashes comprises a task process; causing a system crash when the system runs a task process.

3. The Linux system crash control method of claim 2, wherein: the writing suspension task function comprises the following steps:

4. The Linux system crash control method of claim 2, wherein: the step of packaging the task suspending function into the Linux kernel comprises the following steps:

compiling the Linux kernel through a compiling kernel command;

installing a Linux kernel driver module by using an installation command;

and installing a Linux kernel.

5. A Linux system crash control system, comprising: the device comprises an analysis module, an avoidance module and a packaging module;

the evasion module performs a system crash test experiment caused by software, triggers system crash, reads a log code generated when the system crashes, and writes the log code into a suspended task function;

and the packaging module packages the suspension function in the avoidance module into a Linux kernel.

6. A computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to perform the steps of the Linux system crash control method of any one of claims 1-4.