CN113296994A - Fault diagnosis system and method based on domestic computing platform - Google Patents

Fault diagnosis system and method based on domestic computing platform Download PDF

Info

Publication number
CN113296994A
CN113296994A CN202110540400.1A CN202110540400A CN113296994A CN 113296994 A CN113296994 A CN 113296994A CN 202110540400 A CN202110540400 A CN 202110540400A CN 113296994 A CN113296994 A CN 113296994A
Authority
CN
China
Prior art keywords
fault
data
characteristic
module
computing platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110540400.1A
Other languages
Chinese (zh)
Inventor
赵博颖
张力
郭申
张琨
孟飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202110540400.1A priority Critical patent/CN113296994A/en
Publication of CN113296994A publication Critical patent/CN113296994A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0775Content or structure details of the error report, e.g. specific table structure, specific error fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention relates to a fault diagnosis system and method based on a domestic computing platform, and relates to the technical field of computer fault diagnosis. The invention realizes the characteristic data acquisition of the platform based on the IPMI standard protocol and the custom protocol, the characteristic data processing based on various fusion strategies, the fault diagnosis based on the fault knowledge base and the fusion of various optimization strategies to obtain a complete system and a method for diagnosing the fault of the computing platform.

Description

Fault diagnosis system and method based on domestic computing platform
Technical Field
The invention relates to the technical field of information security, in particular to a fault diagnosis system and method based on a domestic computing platform.
Background
At present, the complexity, the comprehensiveness and the intelligent degree of a domestic computing platform system are continuously improved, and the cost of development, production, particularly maintenance and guarantee is higher and higher. Meanwhile, the increase of the composition links and the influence factors gradually increases the probability of the failure and the functional failure of the whole computing platform. Based on this background, almost all domestic computing systems put an urgent need for accurate fault diagnosis of devices. On the other hand, the application of the fault diagnosis technology in the field of the domestic computing platform still stays in a more basic level, and the domestic computing platform has high requirements on the rapidity and the accuracy of equipment fault diagnosis, so a complete set of fault diagnosis system and method based on the domestic computing platform needs to be provided.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to design a complete set of fault diagnosis system and method based on a domestic computing platform.
(II) technical scheme
In order to solve the technical problem, the invention provides a fault diagnosis system based on a domestic computing platform, which comprises a platform characteristic data acquisition module, a data fusion processing module and a state monitoring and fault diagnosis module;
the platform characteristic data acquisition module is used for acquiring data of a specific part of a domestic computing platform in real time through a sensor;
the data fusion processing module is used for analyzing and processing the collected characteristic data and removing redundant meaningless characteristic parameters;
and the state monitoring and fault diagnosis module classifies the data information obtained after the fusion processing according to the data information obtained after the fusion processing and the type information related to the domestic computing platform on the parameter indexes of the domestic computing platform to be monitored, performs similarity matching calculation on the classified information, and obtains the information with the maximum matching degree as an analysis result.
Preferably, before feature extraction, the data fusion processing module firstly performs comprehensive processing on feature data of each module of the domestic computing platform, then determines parameter indexes to be monitored, covers parameter features capable of indicating whether equipment is in fault to the maximum extent, and then starts feature attribute selection.
Preferably, the state monitoring and fault diagnosis module realizes the concept of using an expert system for reference, and adopts the expert system to evaluate the health condition of the domestic computing platform according to the state information data of each module of the domestic computing platform, wherein the state monitoring and fault diagnosis module comprises a fault knowledge database, a platform fault reasoning submodule and a platform fault database management submodule; the fault knowledge database stores data information and engineering technical data required by state monitoring and fault diagnosis of the computing platform, wherein the data information and the engineering technical data comprise fault cases, characteristic parameters, fault related factors, fault phenomena and fault processing operation; the platform fault reasoning submodule is a reasoning method based on fault cases and with different strategies set according to the damage mechanism and the data analysis requirements of different modules of the domestic computing platform; the platform fault database management submodule provides management operations for fault knowledge data, including addition, deletion, modification and query of the fault knowledge database.
Preferably, the platform characteristic data acquisition module defines a communication protocol on an operating system of the domestic computing platform in a self-defined mode, is responsible for extracting and packaging IPMI data, analyzes and classifies the analyzed data, and transmits a result to the domestic computing platform.
The invention also provides a method for realizing fault diagnosis by using the fault diagnosis system, which comprises the following steps:
step one, constructing a fault knowledge database
The fault knowledge database comprises the following five parts: in the design process of each module of a domestic computing platform, analyzing various software and hardware and environmental factors which may or are known to cause the fault of a board card or a whole machine, drawing a relational logic diagram, analyzing fault cases and fault characteristic factors of each module, and storing fault knowledge in a platform fault knowledge database in a list form;
step two, when the domestic computing platform is abnormal in operation, the platform characteristic data acquisition module acquires characteristic data information of the domestic computing platform in real time through IPMI and a custom communication protocol, the data fusion processing module performs data abstraction on the characteristic data after fusion processing to obtain a list of related characteristic attributes, and the data fusion process comprises two key processing operations: firstly, word segmentation processing is carried out, the feature data of the complete character string are decomposed into independent words, and redundant data or features in the words are deleted; second, substitution treatment, the entry after word segmentation treatment is replaced by the entry in the established professional term;
thirdly, the state monitoring and fault diagnosis module classifies the characteristic attributes obtained by the second step according to fault data information and type information related to a domestic computing platform, then performs primary matching retrieval, considers that a fault phenomenon is stored in a fault knowledge database in a character string mode, and searches a fault case similar to current data information in the platform fault knowledge database by adopting a fuzzy query algorithm based on character string matching in the primary matching process;
fourthly, the state monitoring and fault diagnosis module carries out similarity matching calculation on the classified information to obtain a similarity matrix b;
and step five, multiplying the obtained similarity matrix b by the weight value of the corresponding characteristic attribute by the state monitoring and fault diagnosis module to obtain a final result.
Preferably, the principle of implementing similarity matching calculation on the classified information by the state monitoring and fault diagnosis module is as follows: the method comprises the steps of assuming that characteristic attributes in a fault knowledge database are distributed in an n-dimensional characteristic space in a point mode according to a certain rule, and a search system constructed by characteristic information based on points finds similar points, namely similar characteristic information, according to a space distance after the characteristic attributes are input, wherein the similarity between new fault characteristics and existing fault characteristics in the fault knowledge database is determined by using weighting as a judgment method through the distance between the points in the space as an evaluation scale.
Preferably, the specific implementation manner of the state monitoring and fault diagnosing module performing similarity matching calculation on the classified information is as follows:
comparing the characteristic attributes of the classified information with the characteristic attributes of n fault cases in a fault knowledge database one by one to obtain a similarity matrix b:
Figure BDA0003071548480000041
n represents the nth fault case in the fault knowledge database, m represents the mth characteristic attribute of the fault case n, wherein the similarity bnmThe calculation method of (2) is as follows:
let x be { x ═ x1,x2,...,xnWhere x is all characteristic attributes of a failure case, xiIs the ith feature attribute of the fault case, i is more than or equal to 1 and less than or equal to n, and x is { x for two points (i.e. two fault cases) on the n-dimensional feature space a1,x2,…,xnY ═ y1,y2,…,yn-the distance in feature space is:
Figure BDA0003071548480000042
wirepresenting the ith weight value, x, in the fault caseiAnd yiI-th characteristic attributes of the fault cases x and y, respectively, when x isi≠yiWhen, a (x)i,yi) Taking the value of 1 when xi=yiWhen, a (x)i,yi) The value is 0, and the distance formula is the characteristic attribute x of the classified fault case xiAnd the characteristic attribute y of the fault case y in the fault knowledge databaseiThe calculation formula for finally obtaining the similarity between x and y is as follows:
Figure BDA0003071548480000043
replacing x with n and y with m to obtain bnm
Preferably, in step five, the state monitoring and fault diagnosing module multiplies the obtained similarity matrix b by the weight value of the corresponding characteristic attribute, and the final result is obtained as follows:
Figure BDA0003071548480000044
and (3) carrying out statistical sequencing on the matrix t to obtain X with the maximum matching degree as an analysis result, namely:
Figure BDA0003071548480000051
the result X is the best matching solution, and the fault case with the highest similarity can be extracted at this time.
Preferably, after the fifth step, processing measures corresponding to the fault case are retrieved from the fault knowledge database, the judged and modified fault is analyzed, the fault is corrected according to the judgment conclusion and the proposed processing measures, and a solution of the current fault problem is found.
Preferably, if the fault information is not found in the fault knowledge database, correspondingly executing an adding operation, and adding corresponding new knowledge to the fault knowledge database by using an instance learning method, wherein the operation steps are as follows:
step 1, adding a new fault in a fault knowledge database;
step 2, finding a new fault diagnosis action interface;
step 3, adding a new diagnosis state for the diagnosis action;
step 4, adding a new diagnosis action, and circularly executing the step 3;
and step 5, adding a fault processing action.
(III) advantageous effects
The invention realizes the characteristic data acquisition of the platform based on the IPMI standard protocol and the custom protocol, the characteristic data processing based on various fusion strategies, the fault diagnosis based on the fault knowledge base and the fusion of various optimization strategies to obtain a complete system and a method for diagnosing the fault of the computing platform.
Drawings
FIG. 1 is a schematic diagram of a fault diagnosis system of the present invention;
FIG. 2 is a schematic diagram of a fault knowledge base implementation of the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
As shown in fig. 1 and fig. 2, the fault diagnosis system for the domestic computing platform of the present invention is used for performing fault diagnosis on the domestic computing platform, wherein the domestic computing platform is constructed based on a domestic soar platform and a galaxy kylin operating system, and is completely adapted to other domestic software and hardware platforms; the system comprises a platform characteristic data acquisition module, a data fusion processing module and a state monitoring and fault diagnosis module;
the platform characteristic data acquisition module acquires data of a specific part of a domestic computing platform in real time through a sensor;
the data fusion processing module analyzes and processes the collected characteristic data and removes redundant meaningless characteristic parameters;
and the state monitoring and fault diagnosis module is used for rapidly classifying the parameter indexes of the domestic computing platform to be monitored according to the data information after fusion processing and the type information related to the domestic computing platform and performing similarity matching calculation to obtain the parameter indexes with the maximum matching degree as an analysis result.
The domestic computing platform comprises a computing module, a power supply module, a switching module and a Baseboard Management Controller (BMC); the platform characteristic data acquisition module is also used for transmitting the acquired data to a specified position through a protocol; the acquisition of characteristic data of each module of the domestic computing platform depends on a standard IPMI technical protocol and other self-defined protocols, wherein the BMC plays a main role in managing communication between each other module of the domestic computing platform and a platform characteristic data acquisition module of a fault diagnosis system and providing monitoring and control functions for the computing module, the power supply module and the exchange module, and the platform characteristic data acquisition module can acquire information of temperature, voltage, current, load, fan and the like of the computing module, the power supply module and the exchange module in real time through the BMC.
The data fusion processing module realizes characteristic data analysis and processing according to two strategies:
in a complex domestic computing platform, the characteristic parameter indexes of the modules to be monitored are more, if the characteristic parameter indexes are not subjected to preprocessing, the complexity of subsequent system computing is higher, so before feature extraction, firstly, the characteristic data of each module of the domestic computing platform are comprehensively processed, redundant characteristic parameters are removed, then, the parameter indexes to be monitored are determined, the parameter characteristics capable of indicating whether equipment is in fault or not are covered to the maximum extent, and then, feature selection is started;
in the data extraction process, a plurality of pieces of same characteristic data information sometimes appear, a certain data redundancy is caused when all the characteristic data information are recorded in the log, and the plurality of pieces of same data information are combined into one piece, so that the generation of the subsequent system log and the analysis of the subsequent system log are facilitated. Meanwhile, when the system monitors, fault data which are never encountered before are found, the new information is fused with the previous fault information, namely, the new fault category is written into a database, and the solution method is matched and corresponds to achieve the purpose of intelligent operation and maintenance.
The state monitoring and fault diagnosis module realizes the concept of referring to an expert system, adopts the expert system to evaluate the health condition of the domestic computing platform according to the state information data of each module of the domestic computing platform, and mainly comprises a platform fault knowledge database, a platform fault reasoning submodule and a platform fault database management submodule.
In the operation process of the computing platform, the BMC performs initialization work of the collected data sensor, and monitors and records data such as internal hardware temperature, fan operation, CPU occupancy rate and the like of the domestic computing platform. The system event log is used for recording all states in visible time, and the system can inquire the event log information periodically. Meanwhile, the platform characteristic data acquisition module self-defines a communication protocol on an operating system of the domestic computing platform, is responsible for extracting and packaging IPMI data, analyzes the analyzed data, classifies information and transmits the result to the domestic computing platform.
FIG. 2 is a flow chart of the operation of the condition monitoring and fault diagnosis module. The fault knowledge database stores data information and engineering technical data required by state monitoring and fault diagnosis of the computing platform, wherein the data information and the engineering technical data comprise fault cases, characteristic parameters, fault related factors, fault phenomena, fault processing operation and the like; the platform fault reasoning submodule is a reasoning method with different strategies set according to damage mechanisms and data analysis requirements of different modules of a domestic computing platform, and the reasoning method based on fault cases is mainly adopted; the platform fault database management sub-module mainly provides management operations for the fault knowledge database, including addition, deletion, modification, query and the like of the fault knowledge database.
The fault diagnosis method realized by the fault diagnosis system comprises the following steps:
step one, constructing a fault knowledge database. The computing platform fault knowledge database mainly comprises the following five parts: fault case number, fault phenomenon, fault location, system to which the fault belongs, and fault handling operation. In the design process of each module of a domestic computing platform, various software and hardware, environments and other factors which may or are known to cause the fault of a board card or a whole computer are analyzed, a relational logic diagram is drawn, common fault cases and fault characteristic factors of each module are analyzed, and fault knowledge is stored in a platform fault knowledge database in a list form;
step two, when the domestic computing platform is abnormal in operation, the platform characteristic data acquisition module acquires characteristic data information of the domestic computing platform in real time through IPMI and a custom communication protocol, the data fusion processing module performs data abstraction on the characteristic data after fusion processing to obtain a list of related characteristic attributes, and the data fusion process mainly comprises two parts of key processing operations: firstly, word segmentation processing is carried out, feature data such as complete character strings and the like are decomposed into independent words, and redundant data or features in the words are deleted; second, substitution treatment, the entry after word segmentation treatment is replaced by the entry in the established professional term;
and step three, the state monitoring and fault diagnosis module rapidly classifies the characteristic attributes obtained by the processing of the step two according to fault data information and type information related to a domestic computing platform, and then performs preliminary matching retrieval, wherein the important concern is the similarity of fault phenomena. In consideration of the fact that the fault phenomenon is stored in a fault knowledge database in a character string mode, in the primary matching process, a fuzzy query algorithm based on character string matching is adopted, and a fault case similar to current data information is searched in a platform fault knowledge database;
and step four, carrying out similarity matching calculation on the classified information by the state monitoring and fault diagnosis module, wherein the main realization principle is as follows: it is assumed that the feature attributes in the fault knowledge database are distributed in an n-dimensional feature space in a point form according to a certain rule, and the feature information is based on a search system constructed by the points, and after a certain feature attribute is input, similar points, namely similar feature information, are quickly found according to the spatial distance. The distance between points in the space is used as an evaluation scale, and the similarity between the new fault feature and the existing fault features in the fault knowledge database is determined by weighting to serve as a judgment method. The specific implementation mode is as follows:
comparing the characteristic attributes of the classified information with the characteristic attributes of n fault cases in a fault knowledge database one by one to obtain a similarity matrix b:
Figure BDA0003071548480000091
n represents the nth fault case in the fault knowledge database, and m represents the mth characteristic attribute of the fault case n. Wherein the similarity bnmThe calculation method of (2) is as follows:
let x be{x1,x2,...,xnWhere x is all characteristic attributes of a failure case, xi(1 ≦ i ≦ n) is the ith feature attribute for the failure case. X ═ x for two points (i.e., two failure cases) on the n-dimensional feature space a1,x2,…,xnY ═ y1,y2,…,yn-the distance in feature space is:
Figure BDA0003071548480000092
wirepresenting the ith weight value, x, in the fault caseiAnd yi(1 ≦ i ≦ n) is the ith feature attribute for failure case x and y, respectively, when x isi≠yiWhen, a (x)i,yi) Taking the value of 1 when xi=yiWhen, a (x)i,yi) The value is 0. The distance formula is the characteristic attribute x of the classified fault case xiAnd the characteristic attribute y of the fault case y in the fault knowledge databaseiThe distance calculating method of (1). The final similarity between x and y is calculated as:
Figure BDA0003071548480000093
replacing x with n and y with m to obtain bnm
Step five, the state monitoring and fault diagnosis module multiplies the obtained similarity matrix b by the weight value of the corresponding characteristic attribute to obtain a final result:
Figure BDA0003071548480000094
and (3) performing statistical sequencing on the matrix t to obtain X with the maximum matching degree as an analysis result, namely:
Figure BDA0003071548480000101
the result X is the best matching solution, and the fault case with the highest similarity can be extracted at this time. And then, processing measures corresponding to the fault cases are retrieved from the fault knowledge database, similar faults are analyzed, judged and modified, the similar faults are corrected according to the judgment conclusion and the proposed processing measures, and a solution of the current fault problem is found.
And meanwhile, the platform fault database management submodule mainly adds, modifies and deletes the fault knowledge database. Firstly, an initial fault knowledge database is established, and information such as historical debugging data and engineering experience is input. When the system starts diagnosis work, inputting the processed characteristic attributes of each module of the domestic computing platform to a platform fault reasoning submodule, operating reasoning and judging by the platform fault reasoning submodule according to the requirement of the diagnosis process, performing similarity matching with relevant fault information in a fault knowledge database, evaluating the health condition of each module of the domestic computing platform, and if a fault occurs and the fault knowledge database does not have the fault information, correspondingly executing addition operation. And adding corresponding new knowledge to the fault knowledge database by using an example learning method. The main operation steps are as follows:
step 1, adding a new fault in a fault knowledge database;
step 2, finding a new fault diagnosis action interface;
step 3, adding a new diagnosis state for the diagnosis action;
step 4, adding a new diagnosis action, and circularly executing the step 3;
and step 5, adding a fault processing action.
The modify and delete operations in the fault knowledge database are similar to the above operational steps.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A fault diagnosis system based on a domestic computing platform is characterized by comprising a platform characteristic data acquisition module, a data fusion processing module and a state monitoring and fault diagnosis module;
the platform characteristic data acquisition module is used for acquiring data of a specific part of a domestic computing platform in real time through a sensor;
the data fusion processing module is used for analyzing and processing the collected characteristic data and removing redundant meaningless characteristic parameters;
and the state monitoring and fault diagnosis module classifies the data information obtained after the fusion processing according to the data information obtained after the fusion processing and the type information related to the domestic computing platform on the parameter indexes of the domestic computing platform to be monitored, performs similarity matching calculation on the classified information, and obtains the information with the maximum matching degree as an analysis result.
2. The system of claim 1, wherein before feature extraction, the data fusion processing module firstly performs comprehensive processing on feature data of each module of the domestic computing platform, then determines parameter indexes to be monitored, covers parameter features capable of indicating whether equipment is in fault to the maximum extent, and then starts feature attribute selection.
3. The system of claim 1, wherein the state monitoring and fault diagnosis module implements a concept of using an expert system to evaluate the health condition of the domestic computing platform according to the state information data of each module of the domestic computing platform, and comprises a fault knowledge database, a platform fault reasoning sub-module and a platform fault database management sub-module; the fault knowledge database stores data information and engineering technical data required by state monitoring and fault diagnosis of the computing platform, wherein the data information and the engineering technical data comprise fault cases, characteristic parameters, fault related factors, fault phenomena and fault processing operation; the platform fault reasoning submodule is a reasoning method based on fault cases and with different strategies set according to the damage mechanism and the data analysis requirements of different modules of the domestic computing platform; the platform fault database management submodule provides management operations for fault knowledge data, including addition, deletion, modification and query of the fault knowledge database.
4. The system of claim 1, wherein the platform characteristic data collection module defines a communication protocol on an operating system of the domestic computing platform, is responsible for extracting and packaging IPMI data, analyzes and classifies information of the analyzed data, and transmits a result to the domestic computing platform.
5. A method for performing fault diagnosis using the fault diagnosis system according to any one of claims 1 to 4, comprising the steps of:
step one, constructing a fault knowledge database
The fault knowledge database comprises the following five parts: in the design process of each module of a domestic computing platform, analyzing various software and hardware and environmental factors which may or are known to cause the fault of a board card or a whole machine, drawing a relational logic diagram, analyzing fault cases and fault characteristic factors of each module, and storing fault knowledge in a platform fault knowledge database in a list form;
step two, when the domestic computing platform is abnormal in operation, the platform characteristic data acquisition module acquires characteristic data information of the domestic computing platform in real time through IPMI and a custom communication protocol, the data fusion processing module performs data abstraction on the characteristic data after fusion processing to obtain a list of related characteristic attributes, and the data fusion process comprises two key processing operations: firstly, word segmentation processing is carried out, the feature data of the complete character string are decomposed into independent words, and redundant data or features in the words are deleted; second, substitution treatment, the entry after word segmentation treatment is replaced by the entry in the established professional term;
thirdly, the state monitoring and fault diagnosis module classifies the characteristic attributes obtained by the second step according to fault data information and type information related to a domestic computing platform, then performs primary matching retrieval, considers that a fault phenomenon is stored in a fault knowledge database in a character string mode, and searches a fault case similar to current data information in the platform fault knowledge database by adopting a fuzzy query algorithm based on character string matching in the primary matching process;
fourthly, the state monitoring and fault diagnosis module carries out similarity matching calculation on the classified information to obtain a similarity matrix b;
and step five, multiplying the obtained similarity matrix b by the weight value of the corresponding characteristic attribute by the state monitoring and fault diagnosis module to obtain a final result.
6. The method of claim 5, wherein the similarity matching calculation of the classified information by the condition monitoring and fault diagnosis module is implemented according to the following principle: the method comprises the steps of assuming that characteristic attributes in a fault knowledge database are distributed in an n-dimensional characteristic space in a point mode according to a certain rule, and a search system constructed by characteristic information based on points finds similar points, namely similar characteristic information, according to a space distance after the characteristic attributes are input, wherein the similarity between new fault characteristics and existing fault characteristics in the fault knowledge database is determined by using weighting as a judgment method through the distance between the points in the space as an evaluation scale.
7. The method of claim 6, wherein the similarity matching calculation of the classified information by the condition monitoring and fault diagnosis module is implemented as follows:
comparing the characteristic attributes of the classified information with the characteristic attributes of n fault cases in a fault knowledge database one by one to obtain a similarity matrix b:
Figure FDA0003071548470000031
n represents the nth fault case in the fault knowledge database, m represents the mth characteristic attribute of the fault case n, wherein the similarity bnmThe calculation method of (2) is as follows:
let x be { x ═ x1,x2,...,xnWhere x is all characteristic attributes of a failure case, xiIs the ith feature attribute of the fault case, i is more than or equal to 1 and less than or equal to n, and x is { x for two points (i.e. two fault cases) on the n-dimensional feature space a1,x2,…,xnY ═ y1,y2,…,yn-the distance in feature space is:
Figure FDA0003071548470000032
wirepresenting the ith weight value, x, in the fault caseiAnd yiI-th characteristic attributes of the fault cases x and y, respectively, when x isi≠yiWhen, a (x)i,yi) Taking the value of 1 when xi=yiWhen, a (x)i,yi) The value is 0, and the distance formula is the characteristic attribute x of the classified fault case xiAnd the characteristic attribute y of the fault case y in the fault knowledge databaseiThe calculation formula for finally obtaining the similarity between x and y is as follows:
Figure FDA0003071548470000041
replacing x with n and y with m to obtain bnm
8. The method according to claim 7, wherein, in step five, the state monitoring and fault diagnosis module multiplies the obtained similarity matrix b by the weight value of the corresponding characteristic attribute to obtain the following final result:
Figure FDA0003071548470000042
and (3) carrying out statistical sequencing on the matrix t to obtain X with the maximum matching degree as an analysis result, namely:
Figure FDA0003071548470000043
the result X is the best matching solution, and the fault case with the highest similarity can be extracted at this time.
9. The method of claim 7, wherein after step five further retrieving in a fault knowledge database the handling measures for the corresponding fault case, analyzing the determined and modified fault, correcting the fault based on the determination and the proposed handling measures, and finding a solution to the current fault problem.
10. The method as claimed in claim 5, wherein if the fault information is not found in the fault knowledge database, correspondingly performing an adding operation, and adding corresponding new knowledge to the fault knowledge database by using an instance learning method, the operation steps are as follows:
step 1, adding a new fault in a fault knowledge database;
step 2, finding a new fault diagnosis action interface;
step 3, adding a new diagnosis state for the diagnosis action;
step 4, adding a new diagnosis action, and circularly executing the step 3;
and step 5, adding a fault processing action.
CN202110540400.1A 2021-05-18 2021-05-18 Fault diagnosis system and method based on domestic computing platform Pending CN113296994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110540400.1A CN113296994A (en) 2021-05-18 2021-05-18 Fault diagnosis system and method based on domestic computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110540400.1A CN113296994A (en) 2021-05-18 2021-05-18 Fault diagnosis system and method based on domestic computing platform

Publications (1)

Publication Number Publication Date
CN113296994A true CN113296994A (en) 2021-08-24

Family

ID=77322653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110540400.1A Pending CN113296994A (en) 2021-05-18 2021-05-18 Fault diagnosis system and method based on domestic computing platform

Country Status (1)

Country Link
CN (1) CN113296994A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838565A (en) * 2021-09-24 2021-12-24 中国科学院近代物理研究所 Intelligent operation and maintenance device and method for controlling medical heavy ion accelerator

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102765643A (en) * 2012-05-31 2012-11-07 天津大学 Elevator fault diagnosis and early-warning method based on data drive
US20160041070A1 (en) * 2014-08-05 2016-02-11 01dB-METRAVIB, Société par Actions Simplifiée Automatic Rotating-Machine Fault Diagnosis With Confidence Level Indication
WO2019228317A1 (en) * 2018-05-28 2019-12-05 华为技术有限公司 Face recognition method and device, and computer readable medium
CN112016471A (en) * 2020-08-27 2020-12-01 杭州电子科技大学 Rolling bearing fault diagnosis method under incomplete sample condition
CN112202741A (en) * 2020-09-23 2021-01-08 山西省工业设备安装集团有限公司 Gateway device based on small signal analysis and automatic identification communication bus and protocol
CN112270312A (en) * 2020-11-26 2021-01-26 中南林业科技大学 Fan bearing fault diagnosis method and system, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102765643A (en) * 2012-05-31 2012-11-07 天津大学 Elevator fault diagnosis and early-warning method based on data drive
US20160041070A1 (en) * 2014-08-05 2016-02-11 01dB-METRAVIB, Société par Actions Simplifiée Automatic Rotating-Machine Fault Diagnosis With Confidence Level Indication
WO2019228317A1 (en) * 2018-05-28 2019-12-05 华为技术有限公司 Face recognition method and device, and computer readable medium
CN112016471A (en) * 2020-08-27 2020-12-01 杭州电子科技大学 Rolling bearing fault diagnosis method under incomplete sample condition
CN112202741A (en) * 2020-09-23 2021-01-08 山西省工业设备安装集团有限公司 Gateway device based on small signal analysis and automatic identification communication bus and protocol
CN112270312A (en) * 2020-11-26 2021-01-26 中南林业科技大学 Fan bearing fault diagnosis method and system, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838565A (en) * 2021-09-24 2021-12-24 中国科学院近代物理研究所 Intelligent operation and maintenance device and method for controlling medical heavy ion accelerator

Similar Documents

Publication Publication Date Title
CN113723632B (en) Industrial equipment fault diagnosis method based on knowledge graph
KR101948634B1 (en) Failure prediction method of system resource for smart computing
CN112949874A (en) Power distribution terminal defect characteristic self-diagnosis method and system
CN116167370A (en) Log space-time characteristic analysis-based distributed system anomaly detection method
CN115617614A (en) Log sequence anomaly detection method based on time interval perception self-attention mechanism
CN115719283A (en) Intelligent accounting management system
CN113296994A (en) Fault diagnosis system and method based on domestic computing platform
CN117675691A (en) Remote fault monitoring method, device, equipment and storage medium of router
CN114647558A (en) Method and device for detecting log abnormity
CN115460061B (en) Health evaluation method and device based on intelligent operation and maintenance scene
CN110740111B (en) Data leakage prevention method and device and computer readable storage medium
CN113778792B (en) Alarm classifying method and system for IT equipment
CN115964470A (en) Service life prediction method and system for motorcycle accessories
CN111221704B (en) Method and system for determining running state of office management application system
CN114936139A (en) Log processing method, device, equipment and storage medium in data center network
CN113076217A (en) Disk fault prediction method based on domestic platform
Wang et al. A hybird image retrieval system with user's relevance feedback using neurocomputing
CN112418449A (en) Generation method, positioning method and device of power supply line fault positioning model
CN117150439B (en) Automobile manufacturing parameter detection method and system based on multi-source heterogeneous data fusion
Liu et al. AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis
JP7450570B2 (en) Information processing device, information processing method, and information processing program
CN118245264A (en) Server fault processing method and device, electronic equipment and medium
Dong et al. Compound record clustering algorithm for design pattern detection by decision tree learning
CN117951854A (en) Barrier removing method and device for edge equipment, electronic equipment and storage medium
Xiao et al. Application Research of Computer Artificial Intelligence Technology in Enterprise Financial Accounting Risk Warning System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination