CN108108259A - A kind of kernel Fault Locating Method and device - Google Patents
A kind of kernel Fault Locating Method and device Download PDFInfo
- Publication number
- CN108108259A CN108108259A CN201810026869.1A CN201810026869A CN108108259A CN 108108259 A CN108108259 A CN 108108259A CN 201810026869 A CN201810026869 A CN 201810026869A CN 108108259 A CN108108259 A CN 108108259A
- Authority
- CN
- China
- Prior art keywords
- failure
- kernels
- hardware
- fault
- deadlock
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0721—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/073—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3024—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention provides a kind of kernel Fault Locating Method and devices, monitoring server system, hardware whether failure, when system jam on server, or during hardware failure, the memory information of failure system is collected by BMC, analyzes internal storage data, the reason for rapidly analyzing failure simultaneously positions failure, find solution fault method, the present invention can ensure business on server can fast quick-recovery, reduce loss.
Description
Technical field
The present invention relates to the technical fields of server, and in particular to a kind of kernel Fault Locating Method and device.
Background technology
As client traffic demand constantly increases, the performance of server must be continuously increased, the hardware configuration of server
It is constantly promoted, as CPU is likely to be breached more than thousand cores, memory reaches more than TB.Also event is improved while server hardware increase
Barrier rate, operating system also become increasingly complex, and with the increase of hardware, driver also accordingly increases, and the BUG of introducing can also be got over
Come more.When server fail, it is necessary to which quick analyzing failure cause simultaneously finds solution, it is necessary to preserve
Or obtain corresponding data and analyzed, especially when key business is disposed on server, quickly cope with problem
Economic loss will be reduced to client, ensures the fast quick-recovery of business.
In the prior art, common Fault Locating Method is installs K-UX operating systems and runs on the server, normally
In the case of K-UX operating systems in K-UX kernels, when catastrophe failure occurs, K-UX kernels hang up, then start
Crash kernels (Crash kernels:One small linux kernel is mainly used for the internal storage data of K-UX kernels being saved in magnetic
Disk);The internal storage data that K-UX kernels use is saved on disk by Crash kernels, to restart post analysis orientation problem next time;
After Crash kernels have collected K-UX kernel memory informations, restart system and enter in BIOS, BIOS proceeds by hardware initialization etc.
Operation, BIOS final stage start to load K-UX kernel activation systems;Into after K-UX systems, analysis crash kernels are saved in
Internal storage data (as shown in Figure 2) on disk.The shortcomings that prior art is:1st, user configuration crash kernels are needed, and in distribution
It deposits, wastes certain memory headroom;2nd, preserving internal storage data needs a large amount of disk spaces, wastes disk space;3rd, many users
Crash kernels are not configured when installing K-UX, great difficulty is brought to follow-up orientation problem.
The content of the invention
Based on the above problem, the present invention proposes a kind of kernel Fault Locating Method and device, and failure system is collected by BMC
The memory information of system, quick the reason for analyzing failure, simultaneously position failure.
The present invention provides following technical solution:
On the one hand, the present invention provides a kind of kernel Fault Locating Method, including:
Step 101, monitor K-UX kernels and/or hardware whether failure;
Step 102, if K-UX kernels and/or hardware fault, into BMC systems, the memory information of failure system is obtained;
Step 103, the memory information of the failure system is analyzed, positions failure.
Wherein, solution failure is further included after the positioning failure, recovers server normal operation.
Wherein, the failure system is K-UX systems or hardware system.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard
Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
In addition, the present invention also provides a kind of kernel fault locator, described device includes:
Monitoring modular, for monitor K-UX kernels and/or hardware whether failure;
Acquisition module for entering BMC systems when K-UX kernels and/or hardware fault, obtains the memory information of failure system;
Locating module for analyzing the memory information of the failure system, positions failure.
Wherein, solution failure is further included after the positioning failure, recovers server normal operation.
Wherein, the failure system is K-UX systems or hardware system.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard
Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
The present invention provides a kind of kernel Fault Locating Method and device, monitoring server system, hardware whether failure, when
System jam or during hardware failure on server, collects the memory information of failure system by BMC, in analysis
Deposit data, rapidly analyze failure the reason for and position failure, find solution fault method, the present invention can ensure on server
Business can fast quick-recovery, reduce loss.
Description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the flow chart of the prior art.
Specific embodiment
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Based on above-mentioned, on the one hand, embodiments of the present invention provide a kind of kernel Fault Locating Method, and attached drawing 1 is this
The flow chart of invention, the described method includes:
Step 101, monitor K-UX kernels and/or hardware whether failure;
K-UX:Tide operating system, class Linux.K-UX operating systems are installed on server simultaneously normal operation, monitoring
K-UX kernels or other hardware faults;
Step 102, if K-UX kernels and/or hardware fault, into BMC systems, the memory information of failure system is obtained;
When K-UX kernels break down, log in BMC systems and obtain K-UX memory informations.Wherein, BMC:
Baseboard Management Controller baseboard management controllers run the small-sized behaviour of a separate server system
Make system, effect is the operations such as to facilitate the remote management of server, monitoring, install, restart.K-UX kernel catastrophe failures:Such as sky
Pointer, Array Bound, soft deadlock, hard deadlock etc. cause the failure that K-UX systems can not work on.Hardware fault:Cause hardware
The failure that can not be continuing with, if some sectors of disk can not be read and write, some CPU cores can not work normally.
Step 103, the memory information of the failure system is analyzed, positions failure.
The reason for analyzing the K-UX memory informations obtained, positioning failure;Failure is solved, recovers server normal operation.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard
Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
The present invention provides a kind of kernel Fault Locating Method, monitoring server system, hardware whether failure, work as server
When upper system jam or hardware failure, the memory information of failure system is collected by BMC, analyzes internal storage data,
The reason for rapidly analyzing failure simultaneously positions failure, finds solution fault method, and the present invention can ensure the business on server
Can fast quick-recovery, reduce loss.
On the other hand, embodiments of the present invention provide a kind of kernel fault locator, and described device includes:
Monitoring modular 201, for monitor K-UX kernels and/or hardware whether failure;
K-UX:Tide operating system, class Linux.K-UX operating systems are installed on server simultaneously normal operation, monitoring
K-UX kernels or other hardware faults;
Acquisition module 202 for entering BMC systems when K-UX kernels and/or hardware fault, obtains the interior of failure system
Deposit information;
When K-UX kernels break down, log in BMC systems and obtain K-UX memory informations.Wherein, BMC:
Baseboard Management Controller baseboard management controllers run the small-sized behaviour of a separate server system
Make system, effect is the operations such as to facilitate the remote management of server, monitoring, install, restart.K-UX kernel catastrophe failures:Such as sky
Pointer, Array Bound, soft deadlock, hard deadlock etc. cause the failure that K-UX systems can not work on.Hardware fault:Cause hardware
The failure that can not be continuing with, if some sectors of disk can not be read and write, some CPU cores can not work normally.
Locating module 203 for analyzing the memory information of the failure system, positions failure.
The reason for analyzing the K-UX memory informations obtained, positioning failure;Failure is solved, recovers server normal operation.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard
Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
The present invention provides a kind of kernel fault locator, monitoring server system, hardware whether failure, work as server
When upper system jam or hardware failure, the memory information of failure system is collected by BMC, analyzes internal storage data,
The reason for rapidly analyzing failure simultaneously positions failure, finds solution fault method, and the present invention can ensure the business on server
Can fast quick-recovery, reduce loss.
The foregoing description of the disclosed embodiments enables those skilled in the art to realize or use the present invention.To this
A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can
Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited
The embodiments shown herein is formed on, but meets the most wide model consistent with the principles and novel features disclosed herein
It encloses.
Claims (8)
1. a kind of kernel Fault Locating Method, it is characterised in that:
Step 101, monitor K-UX kernels and/or hardware whether failure;
Step 102, if K-UX kernels and/or hardware fault, into BMC devices, the memory information of failed equipment is obtained;
Step 103, the memory information of the failed equipment is analyzed, positions failure.
2. according to the method described in claim 1, it is characterized in that:Solution failure is further included after the positioning failure, is recovered
Server normal operation.
3. according to the method described in claim 1, it is characterized in that:The failed equipment is K-UX devices or hardware unit.
4. according to the method described in claim 1, it is characterized in that:The K-UX kernels failure include null pointer, Array Bound,
At least one soft deadlock, hard deadlock;The hardware fault is including disk sector can not be read and write, CPU core can not work normally at least
One of.
5. a kind of kernel fault locator, it is characterised in that:Described device includes:
Monitoring modular, for monitor K-UX kernels and/or hardware whether failure;
Acquisition module for entering BMC systems when K-UX kernels and/or hardware fault, obtains the memory information of failure system;
Locating module for analyzing the memory information of the failure system, positions failure.
6. device according to claim 5, it is characterised in that:Solution failure is further included after the positioning failure, is recovered
Server normal operation.
7. device according to claim 5, it is characterised in that:The failed equipment is K-UX devices or hardware unit.
8. device according to claim 5, it is characterised in that:The K-UX kernels failure include null pointer, Array Bound,
At least one soft deadlock, hard deadlock;The hardware fault is including disk sector can not be read and write, CPU core can not work normally at least
One of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810026869.1A CN108108259A (en) | 2018-01-11 | 2018-01-11 | A kind of kernel Fault Locating Method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810026869.1A CN108108259A (en) | 2018-01-11 | 2018-01-11 | A kind of kernel Fault Locating Method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108108259A true CN108108259A (en) | 2018-06-01 |
Family
ID=62219541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810026869.1A Pending CN108108259A (en) | 2018-01-11 | 2018-01-11 | A kind of kernel Fault Locating Method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108108259A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021056912A1 (en) * | 2019-09-29 | 2021-04-01 | 苏州浪潮智能科技有限公司 | Method and device for detecting memory downgrade error |
CN112799917A (en) * | 2021-02-08 | 2021-05-14 | 联想(北京)有限公司 | Data processing method, device and equipment |
CN114706708A (en) * | 2022-05-24 | 2022-07-05 | 北京拓林思软件有限公司 | Fault analysis method and system for Linux operating system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598346A (en) * | 2015-02-15 | 2015-05-06 | 浪潮电子信息产业股份有限公司 | Monitoring and management device and method for quick fault positioning in server system |
CN105183575A (en) * | 2015-08-24 | 2015-12-23 | 浪潮(北京)电子信息产业有限公司 | Processor fault diagnosis method, device and system |
CN105659215A (en) * | 2014-06-24 | 2016-06-08 | 华为技术有限公司 | Fault processing method, related device and computer |
CN106293984A (en) * | 2016-08-11 | 2017-01-04 | 浪潮(北京)电子信息产业有限公司 | A kind of computer glitch automatically processes mode and device |
CN107357684A (en) * | 2017-07-07 | 2017-11-17 | 郑州云海信息技术有限公司 | A kind of kernel failure method for restarting and device |
CN107368385A (en) * | 2017-07-26 | 2017-11-21 | 郑州云海信息技术有限公司 | A kind of method and system of expansible more memory failure fast positionings based on BMC controls |
-
2018
- 2018-01-11 CN CN201810026869.1A patent/CN108108259A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105659215A (en) * | 2014-06-24 | 2016-06-08 | 华为技术有限公司 | Fault processing method, related device and computer |
CN104598346A (en) * | 2015-02-15 | 2015-05-06 | 浪潮电子信息产业股份有限公司 | Monitoring and management device and method for quick fault positioning in server system |
CN105183575A (en) * | 2015-08-24 | 2015-12-23 | 浪潮(北京)电子信息产业有限公司 | Processor fault diagnosis method, device and system |
CN106293984A (en) * | 2016-08-11 | 2017-01-04 | 浪潮(北京)电子信息产业有限公司 | A kind of computer glitch automatically processes mode and device |
CN107357684A (en) * | 2017-07-07 | 2017-11-17 | 郑州云海信息技术有限公司 | A kind of kernel failure method for restarting and device |
CN107368385A (en) * | 2017-07-26 | 2017-11-21 | 郑州云海信息技术有限公司 | A kind of method and system of expansible more memory failure fast positionings based on BMC controls |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021056912A1 (en) * | 2019-09-29 | 2021-04-01 | 苏州浪潮智能科技有限公司 | Method and device for detecting memory downgrade error |
US11853150B2 (en) | 2019-09-29 | 2023-12-26 | Inspur Suzhou Intelligent Technology Co., Ltd. | Method and device for detecting memory downgrade error |
CN112799917A (en) * | 2021-02-08 | 2021-05-14 | 联想(北京)有限公司 | Data processing method, device and equipment |
CN112799917B (en) * | 2021-02-08 | 2024-01-23 | 联想(北京)有限公司 | Data processing method, device and equipment |
CN114706708A (en) * | 2022-05-24 | 2022-07-05 | 北京拓林思软件有限公司 | Fault analysis method and system for Linux operating system |
CN114706708B (en) * | 2022-05-24 | 2022-08-30 | 北京拓林思软件有限公司 | Fault analysis method and system for Linux operating system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7017085B2 (en) | Systems and methods for remote tracking of reboot status | |
US6907419B1 (en) | Method, system, and product for maintaining within a virtualization system a historical performance database for physical devices | |
US20150154079A1 (en) | Fault tolerant architecture for distributed computing systems | |
US20080086515A1 (en) | Method and System for a Soft Error Collection of Trace Files | |
CN105518629A (en) | Cloud deployment infrastructure validation engine | |
CN103415840A (en) | Error management across hardware and software layers | |
US20120110378A1 (en) | Firmware recovery system and method of baseboard management controller of computing device | |
KR101331935B1 (en) | Method and system of fault diagnosis and repair using based-on tracepoint | |
CN108536548B (en) | Method and device for processing bad track of disk and computer storage medium | |
US20110154097A1 (en) | Field replaceable unit failure determination | |
US8930761B2 (en) | Test case result processing | |
CN110879742B (en) | Method, device and storage medium for asynchronously creating internal snapshot by virtual machine | |
KR970066876A (en) | Calculator system and its software recovery method | |
CN108108259A (en) | A kind of kernel Fault Locating Method and device | |
US20040148542A1 (en) | Method and apparatus for recovering from a failed I/O controller in an information handling system | |
US20050177763A1 (en) | System and method for improving network reliability | |
US10255124B1 (en) | Determining abnormal conditions of host state from log files through Markov modeling | |
KR101643729B1 (en) | System and method of data managing for time base data backup, restoring, and mounting | |
US7003617B2 (en) | System and method for managing target resets | |
CN101145983B (en) | A self-diagnosis and self-discovery subsystem and method of network management system | |
US11263069B1 (en) | Using unsupervised learning to monitor changes in fleet behavior | |
CN110737924A (en) | method and equipment for data protection | |
US9250942B2 (en) | Hardware emulation using on-the-fly virtualization | |
CN108762999A (en) | A kind of kernel failure collection method and device | |
CN104020963A (en) | Method and device for preventing misjudgment of hard disk read-write errors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180601 |