CN117131859A

CN117131859A - Equipment fault mode extraction and identification method based on text mining technology

Info

Publication number: CN117131859A
Application number: CN202310991214.9A
Authority: CN
Inventors: 杨军; 王宁
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-08-08
Filing date: 2023-08-08
Publication date: 2023-11-28

Abstract

The invention provides a device fault mode extraction and identification method based on text mining technology, which comprises the steps of firstly, carrying out data preprocessing on maintenance text data through operations such as unifying data coding formats, eliminating useless characters, word segmentation and the like; secondly, taking the preprocessed fault mode data as input, adopting a TF-IDF algorithm to carry out text vectorization and feature extraction, and adopting a K-Means clustering algorithm to obtain a fault mode type label; and then, taking the preprocessed fault phenomenon data as input, adopting a TF-IDF algorithm to carry out text vectorization and feature extraction, and constructing an equipment fault mode identification framework based on six machine learning classification algorithms, so as to establish the corresponding relation between the fault phenomenon and the fault mode. According to the method, equipment maintenance text information is fully utilized, a text mining technology is used for extracting valuable information contained in a text file, and the defect that available quantized data is deficient in equipment development stage and failure mode analysis is difficult to effectively develop can be effectively overcome.

Description

Equipment fault mode extraction and identification method based on text mining technology

Technical Field

The invention provides an equipment fault mode extraction and identification method based on a text mining technology, which can make up for the defect that available quantitative data is deficient in a large-scale equipment development stage, so that fault mode analysis is difficult to effectively develop, fully utilizes equipment maintenance text information, and develops fault mode extraction and machine learning-based fault mode identification research by using the text mining technology, thereby effectively carrying out rapid equipment fault positioning and maintenance arrangement decision. The method is suitable for the relevant fields of equipment fault mode analysis and the like.

Background

Reliability is an important technical attribute for measuring the use efficiency of weapon equipment and complex systems, and along with the rapid development of science and technology, high reliability and long service life have become the general requirements for development, production and service of model equipment. Therefore, advanced and scientific reliability verification and comprehensive evaluation have become the basic basis for equipment sizing, and thus, life cycle management decisions. The fault mode analysis and the fault mode identification of the equipment system are important links of reliability assessment, provide required fault type information for the reliability assessment, and directly determine the validity of a reliability assessment result in terms of accuracy and high efficiency. However, due to the outstanding characteristics of high test cost, long development period and less field test data of the large equipment system, quantitative data information which can be used for directly carrying out equipment system fault mode analysis is deficient. Therefore, how to extract other data information rapidly and accurately, and alleviate the outstanding difficulty of insufficient field test data in the equipment test identification stage, and has extremely important significance for effectively carrying out equipment fault mode analysis and identification work.

During the development of equipment systems, a large amount of data information has not been effectively utilized. In fact, the equipment system has a large amount of text information in the development stage, and the content of maintenance reports and the like contained in the text information provides effective data for the equipment system to perform fault mode analysis. However, how to extract effective failure modes from recorded text information and establish a relationship between failure phenomena and failure modes, thereby guiding maintenance personnel to quickly locate failure sites has become a new challenge. In recent years, thanks to the development of big data analysis technology, text information mining methods have been widely studied and developed, which can extract valuable information contained in text files based on intelligent algorithms, organize and sort the valuable information, and realize mining of text knowledge. Therefore, the text mining technology provides a powerful technical tool for solving the problems of insufficient test data, unknown relevance between a fault phenomenon and a fault mode and the like in the fault mode analysis of the equipment system.

Therefore, the invention aims at the problems that available quantized data is deficient and failure mode analysis is difficult to effectively develop in the equipment development stage, fully utilizes maintenance text information, and develops failure mode extraction and failure mode identification research based on machine learning by using a text mining technology. By effectively extracting the fault information in the text, the corresponding relation between the fault phenomenon and the fault mode is established, so that related staff can be effectively guided to carry out rapid fault positioning and maintenance activity arrangement.

Disclosure of Invention

The purpose of the invention is that: aiming at the problems that available quantized data is deficient and failure mode analysis is difficult to effectively develop in the equipment development stage, the invention fully utilizes maintenance text information, uses an intelligent extraction algorithm and a machine learning technology, and provides an equipment failure mode extraction and identification method based on a text mining technology, thereby establishing a corresponding relation between failure phenomena and failure modes and effectively carrying out quick positioning and maintenance arrangement decision of equipment failures.

The technical scheme is as follows:

based on the method and thought, the invention provides an equipment fault mode extraction and identification method based on a text mining technology, which comprises the following specific implementation steps:

step one: preprocessing text data;

chinese text is different from english text, has no space between words, and contains a large amount of useless information. Therefore, in order to effectively develop text information mining, data cleaning and preprocessing are firstly carried out on a maintenance text, so that structured text data is obtained, and the specific flow is as follows:

(1) unifying data coding formats;

chinese text information typically contains multiple encoding formats, severely impacting data processing efficiency. Considering the wide applicability of UTF-8 to various programming languages, the invention uses UTF-8 as a standard coding format to perform unified processing on all Chinese text information.

(2) Removing useless characters;

considering that punctuation marks are usually used for Chinese text information sentence breaking, arabic numerals relate to model information and partial quantization information, english letters relate to model specific component information, and all characters except the punctuation marks, the Arabic numerals and the English letters in a maintenance text are removed.

(3) Word segmentation and part-of-speech tagging;

the invention combines the stop word vocabulary designed by the Harbin industrial university and the built-in vocabulary of the Jieba word segmentation as stop word vocabulary and custom dictionary for improving the word segmentation effect and the accuracy of word part tagging based on the operation of word segmentation and word part tagging of the Jieba maintenance record text information.

The steps are applied to maintenance record text information of the equipment system, a preprocessing result of fault phenomenon and fault mode data is obtained, and the preprocessing result is used as input to develop subsequent fault mode extraction and recognition research.

Step two: clustering fault modes;

(1) fault mode word vector conversion;

text data is unstructured data, cannot be directly calculated by a computer, and needs to be converted into a series of vectors capable of expressing text semantics. Therefore, firstly, word vector conversion is carried out on fault mode text based on TF-IDF algorithm, and the specific steps are as follows:

(a) Word frequencies are first extracted based on the word segmentation result, which represents the number of times a word appears in a document, and in order to eliminate the influence of the size of the document itself, it is generally defined as:

where TF (t, d) represents the word frequency of the word t, df (t) represents the number of times the word t appears in the document, and N represents the total number of words in the document.

(b) An inverse document frequency is calculated, the parameter being used to represent the importance of the term. If a word appears more frequently, the denominator is larger, the inverse document frequency is closer to 0, and the calculation process is as follows:

where IDF (t) represents the inverse document frequency of word t and n represents the total number of documents in the corpus.

(c) The TF-IDF is calculated as follows:

TF-IDF(t)＝TF(t,d)×IDF(t). (3)

based on the steps, the word vector conversion is carried out on the preprocessing result of the fault mode text information to obtain a fault mode feature matrix, and corresponding input is provided for subsequent fault mode clustering.

(2) K-Means based failure mode clustering;

clustering the fault mode feature matrix obtained in the step (1) by adopting a K-Means clustering algorithm to obtain potential fault mode categories, wherein the specific process is as follows:

(a) Selecting the number k of categories to be clustered, and selecting k center points;

(b) For each sample point, finding the nearest center point, and gathering the nearest points from the same center point into a class, thus completing one-time clustering;

(c) Judging whether the category conditions of the sample points before and after clustering are the same, and if so, stopping the algorithm; otherwise, entering step (d);

(d) Calculating the center points of the sample points for each class of sample points, and taking the center points as new center points of the class; then, continuing step (b);

through the steps, the types of the potential failure modes of the equipment are obtained and used as class labels of the subsequent corresponding failure phenomenon characteristic matrixes.

Step three: machine learning-based fault pattern recognition;

after the fault mode clustering is completed, the fault mode type can be rapidly positioned after the related staff observe the fault phenomenon, so that maintenance management work can be efficiently carried out. Based on the fault phenomenon corresponding to each fault mode, the invention excavates and establishes the interrelation between the two, and provides effective maintenance guidance for relevant staff, and the specific operation is as follows:

(1) word vector conversion of fault phenomena;

in order to be able to efficiently handle the failure phenomenon, it is also necessary to transform it into a series of vectors capable of expressing text semantics using text mining techniques. In the operation, the TF-IDF is also adopted to preprocess the system fault phenomenon data, and the result data, the word vector conversion and the feature extraction are carried out. The operation flow is the same as the operation (1) of the first step, and will not be described here again.

Based on the operation, word vector conversion is carried out to obtain a fault phenomenon characteristic matrix, and corresponding input is provided for building a subsequent fault mode identification framework.

(2) Constructing a fault mode identification framework;

after the extraction of the fault phenomenon characteristic matrix is completed, combining the corresponding fault mode types, and building a fault mode identification framework based on a machine learning classification algorithm. Classifiers commonly used in machine learning include KNN, SVM, decision tree, naive bayes, random forests, adaboost classifier, etc. Therefore, the invention is mainly built by developing the optimal fault mode identification framework by the six classifiers.

In order to verify and compare the effect of the proposed algorithm, three indexes most commonly used in the machine learning classification field are adopted for verification, namely accuracy, recall and F1-score, and the method is concretely as follows:

in the formula, accuracy refers to Accuracy, and represents the proportion of the number of correctly classified test cases to the total number of the test cases; recall refers to Recall rate, which indicates the proportion of the number of correctly classified positive examples to the actual number of positive examples; precision refers to Precision, also called Precision, representing the proportion of the number of correctly classified positive examples to the number of examples classified as positive examples; f1-score is based on the harmonic average of recall and precision, i.e., the recall and precision are integrated for evaluation.

And taking the fault phenomenon characteristic matrix as characteristic input, taking the corresponding fault mode type as class label output, constructing a corresponding fault mode identification frame based on the six machine learning classification algorithms in sequence, and finally, verifying the prediction effect of the provided identification frame by adopting three indexes of accuracy, recall and F1-score.

The invention has the advantages that:

(1) aiming at the problems that available quantized data is deficient and failure mode analysis is difficult to effectively develop in the equipment development stage, maintenance text information is fully utilized, an intelligent extraction algorithm and a machine learning technology are used, and an equipment failure mode extraction and identification method based on a text mining technology is provided. The method can establish the corresponding relation between the fault phenomenon and the fault mode, thereby providing a reference basis for rapid positioning and maintenance arrangement decision of equipment faults.

(2) The prediction method provided by the invention combines engineering practice, has simple model construction, is easy to optimize and train, does not need to intervene expert experience, is convenient for engineering technicians to apply, and is scientific in method specification.

Drawings

Fig. 1 is a flow chart of the method of the present invention.

Fig. 2 is a text word cloud of failure modes of steering engine equipment.

Fig. 3 is a clustering analysis result of the fault modes of the steering engine equipment.

Detailed Description

The invention relates to an equipment fault mode extraction and identification method based on a text mining technology, wherein a technical flow chart is shown in fig. 1, and a concrete implementation step of the invention is described in detail below by taking steering engine equipment maintenance text data of a certain type of commercial ship as an example.

Step one: preprocessing text data;

and obtaining text data preprocessing results of the fault phenomenon and the fault mode of steering engine equipment through the steps of unifying data coding formats, eliminating useless characters, segmenting words, labeling parts of speech and the like, wherein the text data preprocessing results are shown in a table 1 and a table 2 respectively.

TABLE 1 data preprocessing results for failure phenomena

TABLE 2 failure mode data Pre-processing results

In order to more intuitively visualize the word segmentation effect of the fault text, the invention takes the fault mode text as an example, and generates a corresponding word cloud picture, which is particularly shown in fig. 2.

And then, taking the data preprocessing result as input, and developing a subsequent fault mode extraction and identification method research.

Step two: clustering fault modes;

firstly, based on a TF-IDF algorithm, the result data corresponding to the fault mode in the table 2 is preprocessed, word vector conversion is carried out, and a fault mode feature matrix is obtained, as shown in the table 3.

TABLE 3 failure mode feature matrix

In the table, the numbers 1 to 58 represent 58 pieces of fault pattern description information recorded in the maintenance text information, and the subsequent data are the extracted features. Therefore, based on the above information, failure mode cluster analysis can be performed to further extract the main failure mode type.

And then, clustering the obtained fault mode feature matrix based on a K-Means clustering algorithm to obtain potential fault mode categories. Experiments find that when k is taken to be 4, at least two pieces of maintenance information are contained in each category. Accordingly, a total of 4 types of failure modes are set, and the failure mode information represented by each type and the maintenance information number contained therein are shown in table 4. In order to perform visual display on the clustering effect, a TSNE tool is adopted to reduce dimensions of all features, so that visualization is facilitated, and a result is shown in FIG. 3. It can be seen that the clustering model achieves a very good clustering effect.

TABLE 4 failure mode Cluster analysis results

As can be seen from a combination of fig. 3 and table 4, the main failure modes of the steering engine equipment are two major types, namely, a system failure caused by a circuit problem and a failure caused by a system hardware problem, and in addition, a failure caused by external environment interference and a failure caused by a system software problem also occupy a considerable proportion.

Step three: machine learning-based fault pattern recognition;

first, similar to the fault mode text data processing, the result data is preprocessed according to the TF-IDF algorithm, which corresponds to table 1, and word vector conversion is performed, so as to obtain a fault phenomenon feature matrix, as shown in table 5.

TABLE 5 characterization matrix of failure phenomena

In table 5, the numbers 1 to 58 represent 58 pieces of failure phenomenon text information recorded in the maintenance text information, and the data in the table is the extracted failure phenomenon feature matrix.

Then, a fault phenomenon characteristic matrix is taken as input, a fault mode clustering result is taken as class output, a corresponding fault mode identification framework is constructed based on six common machine learning algorithms, and the classification result is shown in table 6.

Table 6 classification effect of six types of steering engine fault mode recognition frameworks

It can be seen that the KNN-based steering engine fault pattern recognition framework, the SVM-based steering engine fault pattern recognition framework, the decision tree-based steering engine fault pattern recognition framework, the naive bayes-based steering engine fault pattern recognition framework and the random forest-based steering engine fault pattern recognition framework all achieve excellent effects, wherein the decision tree-based steering engine fault pattern recognition framework and the random forest-based steering engine fault pattern recognition framework achieve the best classification recognition effect, and reach 100%. Therefore, it is recommended to construct a steering engine failure mode recognition framework using decision tree-based or random forest classification algorithms.

Claims

1. A method for extracting and identifying equipment fault modes based on a text mining technology is characterized by comprising the following steps of: the method comprises the following steps:

step one: preprocessing text data; performing data cleaning and preprocessing on the maintenance text to obtain structured text data, including: unifying data coding formats, removing useless characters, segmentation words and part-of-speech labels;

step two: clustering fault modes; comprising the following steps: fault mode word vector conversion and K-Means based fault mode clustering;

step three: machine learning-based fault pattern recognition; comprising the following steps: and constructing a fault phenomenon word vector conversion and fault mode identification framework.

2. The text mining technology-based equipment failure mode extraction and recognition method according to claim 1, wherein: and taking UTF-8 as a standard coding format to perform unified processing on all Chinese text information.

3. The text mining technology-based equipment failure mode extraction and recognition method according to claim 1, wherein: and eliminating all characters except punctuation marks, arabic numerals and English letters in the maintenance text.

4. The text mining technology-based equipment failure mode extraction and recognition method according to claim 1, wherein: the stop word vocabulary designed by the Harbin industrial university and the vocabulary built in the Jieba word segmentation are used as stop word vocabulary and custom dictionary.

5. A method for extracting and identifying equipment failure modes based on text mining technology according to claim 1 or 2 or 3 or 4, wherein: word vector conversion is carried out on the fault mode text based on the TF-IDF algorithm, and the specific steps are as follows:

(a) Word frequency is extracted based on the word segmentation result, which represents the number of times a word appears in a document, and in order to eliminate the influence of the size of the document itself, it is defined as:

where TF (t, d) represents the word frequency of the word t, df (t) represents the number of times the word t appears in the document, and N represents the total number of words in the document;

(b) Calculating an inverse document frequency for representing the importance of the word; if a word appears more frequently, the denominator is larger, the inverse document frequency is closer to 0, and the calculation process is as follows:

wherein IDF (t) represents the inverse document frequency of word t, and n represents the total number of documents in the corpus;

(c) The TF-IDF is calculated as follows:

TF-IDF(t)＝TF(t，d)×IDF(t) (3)。

6. the text mining technology-based equipment failure mode extraction and recognition method according to claim 5, wherein: clustering the fault mode feature matrix by adopting a K-Means clustering algorithm to obtain potential fault mode categories, wherein the specific process is as follows:

(b) For each sample point, finding the nearest center point, and gathering the nearest points from the same center point into a class to finish one-time clustering;

(d) Calculating the center points of the sample points for each class of sample points, and taking the center points as new center points of the class; then, step (b) is continued.

7. The text mining technology-based equipment failure mode extraction and recognition method according to claim 1, wherein: after the extraction of the fault phenomenon characteristic matrix is completed, combining the corresponding fault mode types, and building a fault mode identification framework based on a machine learning classification algorithm.

8. The text mining technology-based equipment failure mode extraction and recognition method according to claim 7, wherein: the classifier in machine learning comprises KNN, SVM, decision tree, naive Bayes, random forest and AdaboostClassifier.

9. The text mining technology-based equipment failure mode extraction and recognition method according to claim 8, wherein: verification is performed with accuracy, recall and F1-score, as follows: