CN107608813A

CN107608813A - A kind of method that failure is automatically analyzed based on linux operation system informations

Info

Publication number: CN107608813A
Application number: CN201710827649.4A
Authority: CN
Inventors: 郭美思; 周国浪
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-09-14
Filing date: 2017-09-14
Publication date: 2018-01-19

Abstract

The present invention is more particularly directed to a kind of method that failure is automatically analyzed based on linux operation system informations.This automatically analyzes the method for failure based on linux operation system informations, obtains linux operation system informations first, and form diagnosis rule storehouse according to different faults classification and trouble unit；Diagnosis rule in diagnosis rule storehouse automatically analyzes to operation system information, after corresponding diagnosis rule is matched, to description and the fault resolution of ging wrong, and preserves analysis result.This automatically analyzes the method for failure based on linux operation system informations, obtain linux operation system informations and a diagnosis rule storehouse is formed according to the rule and treating method of fault routine, when linux operating systems break down, the information checked in diagnosis rule storehouse can find corresponding solution, substantially increase the efficiency of malfunction elimination.

Description

A kind of method that failure is automatically analyzed based on linux operation system informations

Technical field

It is more particularly to a kind of to be divided automatically based on linux operation system informations the present invention relates to Computer Applied Technology field The method for analysing failure.

Background technology

With the development in epoch, the improvement of people's living standards, the life style and working method of people are all become Change, computer has become equipment irreplaceable in people's daily life.

User can be handled official business using computer application software, operate computer.And application software is in operating system Support it is lower could run, operating system is the interface of user and computer, while and computer hardware and other software connect Mouthful.The relevant data of operating system can be supplied to user to be used to analyze to solve the problems, such as.

But there is a lot, operation system information because operating system component is relatively complicated, the reason for caused failure Enormous amount.When computer breaks down, it is necessary to which technical staff checks that operation system information is analyzed manually, find corresponding Fault message, solve the failure problems occurred, therefore technical staff wants rapid to determine that failure cause is extremely difficult.

The substantial amounts of operation system information of manual analysis, not only wastes time and energy expensive, and efficiency is low.For this feelings Condition, the present invention devise a kind of method that failure is automatically analyzed based on linux operation system informations.

The content of the invention

The defects of present invention is in order to make up prior art, there is provided a kind of simple efficiently based on linux operating systems letter Cease the method for automatically analyzing failure.

The present invention is achieved through the following technical solutions：

A kind of method that failure is automatically analyzed based on linux operation system informations, it is characterised in that comprise the following steps：

（1）Obtain linux operation system informations；

（2）Diagnosis rule storehouse is formed according to different faults classification and trouble unit；

（3）Diagnosis rule in diagnosis rule storehouse is automatically analyzed to operation system information, and failure is corresponded to when matching After rule, to description and the fault resolution of ging wrong, and analysis result is preserved.

The step（1）In, linux operation system informations include CPU information, memory information, BIOS information, disk letter Breath, activation bit, network interface card information, BMC information and RAID information.

The CPU information, which is collected, includes summary info and details, uses lscpu orders, dmidecode-t Processor orders and cat/proc/cpuinfo orders；The memory information is collected using free orders, dmidecode - t memory orders and cat/proc/meminfo orders；The BIOS information is collected is ordered using dmidecode-t bios Order；The disc information is collected using lsblk, lsscsi, df-h, mount, fdisk-l, smartctl orders；The drive Dynamic information uses lsmod orders；The network interface card information is collected and uses ifconfig, lspci order；The BMC information is received Collection uses ipmitool orders；The RAID information is collected specifies instrument to be collected by different type RAID.

The step（2）In, fault message and solution are obtained at any time, extract diagnosis rule storehouse field；Then adopt With random forests algorithm, automatic identification failure simultaneously excavates phenomenon of the failure and the relation of diagnosis rule, and the event to automatically identifying Barrier carries out experts' evaluation, and effective phenomenon of the failure and processing scheme are generated into diagnosis rule, are stored in diagnosis rule storehouse.

By curstomer's site, research and development department, the fault message and solution that test organization and operation maintenance personnel obtain, refine Be out of order rule base field；Meanwhile the data in training set are accurately positioned specific equipment, failure cause is analysed in depth.

Diagnosis rule storehouse field includes machine models, operating system, fault category, trouble unit, daily record rank, day Will details, keyword, log path, problem description and solution.

When data in training set are accurately positioned CPU and memory failure, CPU events and internal memory event, parsing are read Mcelog, position failure CPU and core position；PCIE failures are positioned, read PCIE events, according to the machine silk-screen table of comparisons, Allot corresponding slot Information；CallTrace failure error-reporting routine sections are positioned, analyze CallTrace event logs, excavate function Call stack, analyse in depth failure cause.

Using random forests algorithm, the forest being made up of decision tree is generated, merger processing is carried out to fault message, by more Decision tree is voted phenomenon of the failure, failure judgement, and takes corresponding solution.

The step（3）In, when occurring EMS memory error in operation system information, fault category is system；Failure portion Part is Memory；Daily record rank is critical；Keyword is Memory Controller, Err；Log path is /var/ log/mcelog；Problem description is Memory Controller Hub failure；Solution method is memory failure, is changed after confirming specific core position Internal memory.

Beneficial effects of the present invention：This automatically analyzes the method for failure based on linux operation system informations, obtains linux Operation system information simultaneously forms a diagnosis rule storehouse according to the rule and treating method of fault routine, when linux operating systems During failure, the information checked in diagnosis rule storehouse can find corresponding solution, substantially increase malfunction elimination Efficiency.

Brief description of the drawings

Accompanying drawing 1 automatically analyzes the method schematic diagram of failure for the present invention based on linux operation system informations.

Embodiment

In order that technical problems, technical solutions and advantages to be solved are more clearly understood, tie below Drawings and examples are closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention.

This automatically analyzes the method for failure based on linux operation system informations, comprises the following steps：

（1）Obtain linux operation system informations；

Claims

A kind of 1. method that failure is automatically analyzed based on linux operation system informations, it is characterised in that comprise the following steps：

（1）Obtain linux operation system informations；

（2）Diagnosis rule storehouse is formed according to different faults classification and trouble unit；

（3）Diagnosis rule in diagnosis rule storehouse is automatically analyzed to operation system information, and failure is corresponded to when matching After rule, to description and the fault resolution of ging wrong, and analysis result is preserved.
2. the method according to claim 1 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： The step（1）In, linux operation system informations include CPU information, memory information, BIOS information, disc information, driving letter Breath, network interface card information, BMC information and RAID information.
3. the method according to claim 2 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： The CPU information, which is collected, includes summary info and details, is ordered using lscpu orders, dmidecode-t processor Order and cat/proc/cpuinfo orders；The memory information is collected is ordered using free orders, dmidecode-t memory Order and cat/proc/meminfo orders；The BIOS information is collected and uses dmidecode-t bios orders；The disk Information uses lsblk, lsscsi, df-h, mount, fdisk-l, smartctl orders；The activation bit, which is collected, to be made With lsmod orders；The network interface card information is collected and uses ifconfig, lspci order；The BMC informations use Ipmitool orders；The RAID information is collected specifies instrument to be collected by different type RAID.
4. the method according to claim 1 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： The step（2）In, fault message and solution are obtained at any time, extract diagnosis rule storehouse field；Then using random gloomy Woods algorithm, automatic identification failure simultaneously excavate phenomenon of the failure and the relation of diagnosis rule, and the failure to automatically identifying is carried out specially Family's evaluation, effective phenomenon of the failure and processing scheme are generated into diagnosis rule, are stored in diagnosis rule storehouse.
5. the method according to claim 4 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： By curstomer's site, research and development department, the fault message and solution that test organization and operation maintenance personnel obtain, refinement is out of order rule Then storehouse field；Meanwhile the data in training set are accurately positioned specific equipment, failure cause is analysed in depth.
6. the method that failure is automatically analyzed based on linux operation system informations according to claim 4 or 5, its feature are existed In：Diagnosis rule storehouse field includes machine models, operating system, fault category, trouble unit, daily record rank, and daily record is detailed Thin information, keyword, log path, problem description and solution.
7. the method according to claim 5 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： When data in training set are accurately positioned CPU and memory failure, CPU events and internal memory event are read, parses mcelog, positioning Failure CPU and core position；PCIE failures are positioned, read PCIE events, according to the machine silk-screen table of comparisons, match corresponding insert Groove information；CallTrace failure error-reporting routine sections are positioned, analyze CallTrace event logs, excavate function call stack, deeply Analyzing failure cause.
8. the method according to claim 4 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： Using random forests algorithm, the forest being made up of decision tree is generated, merger processing is carried out to fault message, passes through more decision trees Phenomenon of the failure is voted, failure judgement, and take corresponding solution.
9. the method according to claim 1 that failure is automatically analyzed based on linux operation system informations, it is characterised in that： The step（3）In, when occurring EMS memory error in operation system information, fault category is system；Trouble unit is Memory；Daily record rank is critical；Keyword is Memory Controller, Err；Log path is /var/log/ mcelog；Problem description is Memory Controller Hub failure；Solution method is memory failure, confirm to change behind specific core position in Deposit.