CN113281048A

CN113281048A - Rolling bearing fault diagnosis method and system based on relational knowledge distillation

Info

Publication number: CN113281048A
Application number: CN202110716619.2A
Authority: CN
Inventors: 朱海平; 王慧; 陈志鹏; 石海彬; 冯世元; 程佳欣
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-08-20
Anticipated expiration: 2041-06-25
Also published as: CN113281048B

Abstract

The invention discloses a rolling bearing fault diagnosis method and system based on relational knowledge distillation, and belongs to the technical field of fault diagnosis. According to the invention, after the original vibration signals of the bearing are collected, a time-frequency graph is constructed for each processing sample as a fault sample, and the fault sample is used as the input of a fault diagnosis system. According to the invention, the student model is adopted to simultaneously learn the multivariate relation between the Softmax output soft label of the teacher model and the output of a plurality of samples in the last pooling layer, namely, the student network learns from two aspects of teacher structure and single sample output in the teacher network, and the classification performance of the fault diagnosis system is effectively improved under the condition of not increasing the memory and the training time. The invention realizes bearing fault diagnosis by using a relation knowledge distillation migration learning method, and effectively reduces the computational complexity by using the idea of replacing a large model with a small model.

Description

Rolling bearing fault diagnosis method and system based on relational knowledge distillation

Technical Field

The invention belongs to the technical field of fault diagnosis, and particularly relates to a rolling bearing fault diagnosis method and system based on relational knowledge distillation.

Background

Rolling bearings are a key component of rotating machinery and are also one of the high failure rate elements, and according to incomplete statistics, 30% of failures of rotating equipment are caused by rolling bearing failures. The condition monitoring and fault diagnosis of the rolling bearing play an important role in knowing the operation performance of equipment and finding potential faults, and the management level and the maintenance efficiency of mechanical equipment can be effectively improved.

At present, a new round of artificial intelligence technology represented by deep learning makes the establishment of an end-to-end deep integrated intelligent fault diagnosis method become a new target in the industrial intelligence era. Compared with the traditional model, the deep learning model has deeper network layers and strong nonlinear computing capability, can better approximate complex function relationship, and has more successful application in the field of fault diagnosis. However, the success of deep learning fault diagnosis depends on a large amount of labeled high-quality data, and in deep learning, when a large-scale data set is trained, the number of network layers is often increased in order to process complex data distribution, and the number of model parameters can reach millions, so that a large amount of computing power and resources are consumed in training to achieve higher accuracy. However, the model is large in scale and high in training cost from the beginning, and is difficult to deploy due to the limitations of computing resources, response speed and the like when actual engineering is deployed.

Patent CN110162018A discloses an incremental equipment fault diagnosis method based on knowledge distillation and hidden layer sharing, and the main ideas are as follows: by using knowledge distillation and hidden layer sharing technologies, the shallow equipment fault diagnosis model is ensured to have better data feature extraction capability, and the fault classification performance of the shallow equipment fault diagnosis model is improved. Aiming at the continuous increase of industrial data and the update of a fault diagnosis model of edge equipment, the incremental learning of the model is realized by using methods such as effective sample identification, data set reconstruction, pre-training model fine adjustment and the like. The method provided by the invention overcomes the requirements on network bandwidth and network delay in the data transmission process of mass real-time industrial equipment, improves the accuracy of the fault diagnosis method of shallow equipment, and supports incremental learning. Through simulation experiments on bearing running state data, under the condition of limited computing resources, the method improves edge cloud cooperative data transmission efficiency, realizes fault prediction classification accuracy, and supports data learning and processing. Patent CN112504678A discloses a motor bearing fault diagnosis method based on knowledge distillation, which has the main ideas: the vibration signal training model is used as a teacher model, the current signal and the rotating speed signal are input as student models, and the student models are trained by using the dark knowledge brought by the teacher model, so that the student models can be stably converged to perform effective diagnosis.

However, the following drawbacks exist: 1) most of the collected time sequence signals are analyzed to extract effective characteristics, for example, the most relevant fault characteristics are extracted through multiple screening such as pretreatment of the collected vibration signals, characteristic screening and the like and are used as the input of a fault classifier, but the manual screening is long in time and effective information is easy to lose; 2) the student model only learns the output of Softmax at the tail end of the teacher model, namely only the performance of a single sample on the teacher model is considered, so that the fault diagnosis accuracy of the student model is low.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a rolling bearing fault diagnosis method and system based on relational knowledge distillation, and aims to improve the real-time response efficiency and accuracy of a fault diagnosis model.

To achieve the above object, according to a first aspect of the present invention, there is provided a rolling bearing failure diagnosis method based on relational knowledge distillation, the method including:

a preparation stage:

acquiring vibration signal sections of a rolling bearing in a normal state and a fault state, and measuring each state for multiple times; taking a plurality of continuous sampling points with the same state type as a processing sample, constructing each processing sample into a time-frequency graph, and taking the < time-frequency graph and the corresponding state type > as training samples to obtain a training sample set; constructing a teacher model-student model;

a training stage:

pre-training a teacher model by using a training sample set; simultaneously inputting a plurality of training samples into a pre-trained teacher model to obtain a plurality of corresponding features output by the last pooling layer of the teacher model, and taking the features as a feature set T;

randomly initializing a student model; simultaneously inputting a plurality of training samples into the initialized student model to obtain a plurality of corresponding features output by the last pooling layer of the student model as a feature set S;

calculating the binary distance and the ternary angle between elements in the feature set T, and calculating the binary distance and the ternary angle between the elements in the feature set S;

constructing distance distillation loss based on binary distances between elements in the feature sets T and S, and constructing angle distillation loss based on ternary angles between elements in the feature sets T and S;

incorporating distance and angle distillation losses into the overall loss function of the entire model;

training a teacher model-student model by taking the minimization of the total loss function as a target to obtain a trained teacher model-student model;

an application stage:

acquiring a vibration signal section of a rolling bearing to be detected, and constructing a time-frequency diagram; inputting the result into a trained student model to obtain a diagnosis result.

Preferably, continuous wavelet analysis is carried out on the normalized one-dimensional vibration signal segment to generate a continuous three-channel wavelet time-frequency map.

Has the advantages that: according to the invention, continuous wavelet analysis is preferably carried out on the normalized one-dimensional vibration signal segment to generate a continuous three-channel wavelet time-frequency map, and the direct generation of the time-frequency map does not need to carry out feature screening on signals, so that the loss of partial information caused by time-frequency domain feature extraction is reduced, and the performance of a fault diagnosis model is improved when a large number of fault samples are processed.

Preferably, the distance distillation loss is constructed based on the binary distances between the elements in the feature sets T and S, specifically as follows:

wherein L is_RKD-DDenotes the distance distillation loss, x_i，x_jRespectively represent the ith, jth training sample, χ²Representing a set of binary relations,/_δ() Representing the Huber loss function, Ψ_D(t_i，t_j) Represents t_iAnd t_jDistance of (2), Ψ_D(s_i，s_j) Denotes s_iAnd s_jDistance of (d), t_i，t_jRespectively representing a plurality of corresponding characteristics, s, of the ith and jth training samples input to the final pooling layer output of the teacher model_i，s_jRespectively representing a plurality of corresponding characteristics output by the ith and jth training samples input to the last pooling layer of the student model.

Has the advantages that: the distance distillation loss is preferably constructed in the mode, and the distance difference in the characteristic space is punished to realize the transfer learning of the student model and the teacher model due to the distillation loss in the distance direction, so that the student model is not forced to be directly matched with the output of the teacher model, but the student model is encouraged to learn the distance structure output by the teacher model, and the fault diagnosis performance of the student model is closer to that of the teacher model.

Preferably, the angle distillation loss is constructed based on the ternary angles between the elements in the feature sets T and S, specifically as follows:

wherein L is_RKD-ADenotes the angular distillation loss, x_i，x_j，x_kRespectively represent the ith, j, k training samples, χ³To representSet of ternary relationships, l_δ() Representing the Huber loss function, t_i，t_j，t_kRespectively representing a plurality of corresponding characteristics, s, of the ith, j and k training samples input to the final pooling layer output of the teacher model_i，s_j，s_kRespectively representing a plurality of corresponding characteristics, psi, of the ith, j, k training samples input to the last pooling layer output of the student model_A(t_i，t_j，t_k) Representing teacher model output features t_i，t_j，t_kTernary angular relationship between, Ψ_A(s_i，s_j，s_k) Representing student model output features s_i，s_j，s_kA ternary angular relationship therebetween.

Has the advantages that: the angle distillation loss is preferably constructed in the mode, the angle distillation loss realizes the transfer learning of the embedding relation of the training samples in the student model and the teacher model by punishing the angle difference in the characteristic space, and because the angle is an attribute with a higher order than the distance, the angle distillation loss can effectively transmit the relation information and provide more flexibility for the student model in the training process, thereby having higher convergence and better performance.

Preferably Ψ_A(t_i，t_j，t_k)＝cos∠t_it_jt_k＝<e^ij，e^jk>

Wherein, t_it_jt_kRepresenting a ternary feature t_i，t_j，t_kAngle of formation e^ijRepresents a vector t_it_iUnitized vector of, e^jkRepresents a vector t_jt_kThe vector of (a) is unitized,<，>represents a vector e^ij，e^jkCosine value of the angle between t_i，t_j，t_kRespectively representing the ith, j and k training samples to be input into the final pooling of the teacher modelA plurality of corresponding features of the layer output.

Has the advantages that: the invention preferably transmits the relation knowledge of the characteristics in the high-order space in the mode, even if the output characteristic dimension is different between the teacher model and the student model, the high-order characteristic angle potential energy is invariable to the low-order characteristic space angle potential energy through the calculation, the high-order potential energy is possibly strong in capturing the high-order structure, but the high-order potential energy is high in calculation cost, and therefore the measurement of the characteristics in the high-order space relation knowledge can be realized under the condition of small calculation amount by using simple and effective ternary angle relation.

Preferably, the total loss function is calculated as follows:

L＝α*L_KD+β*(ω₁*L_RKD-D+ω₂*L_RKD-A)

wherein L is_KDDenotes the value of distillation loss, L_RKD-DDenotes the distance distillation loss, L_RKD-ARepresenting angular distillation loss, alpha, beta representing weight coefficient of various loss values, omega₁，ω₂Weights for distance distillation loss and angle distillation loss are expressed.

Has the advantages that: the invention preferably calculates the total loss in the above mode, and the student model can learn stronger feature expression capability from the teacher model due to the added punishment on the binary distance distillation loss and the ternary angle distillation loss, thereby improving the fault diagnosis performance of the student model.

To achieve the above object, according to a second aspect of the present invention, there is provided a rolling bearing failure diagnosis system based on relational knowledge distillation, comprising: a computer-readable storage medium and a processor;

the computer-readable storage medium is used for storing executable instructions;

the processor is configured to read executable instructions stored in the computer-readable storage medium and execute the rolling bearing fault diagnosis method based on the relational knowledge distillation according to the first aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) in the existing bearing fault diagnosis method based on knowledge distillation, a student model only learns the output of Softmax at the tail end of a teacher model, namely, the performance of a single sample on the teacher model is only considered, so that the fault diagnosis accuracy of the student model is low. Aiming at the problem, the student model is adopted to simultaneously learn the multivariate relation between the output of Softmax of the teacher model and the output of a plurality of samples at the last pooling layer, namely, the network structure information of the teacher model can be learned, the structure information contained in the network is considered, the input samples of the same mini-batch are cooperatively learned, the student network learns from the two aspects of teacher structure and single sample output in the teacher network, and the classification performance of the fault diagnosis system is effectively improved under the condition of not increasing the memory and the training time.

(2) In the prior art, effective characteristics are extracted by analyzing the acquired time sequence signals, for example, the most relevant fault characteristics are extracted by performing multiple screening such as preprocessing and characteristic screening on the acquired vibration signals and are used as the input of a fault classifier, but the manual screening is long and effective information is easily lost. Aiming at the problem, after the original vibration signals of the rolling bearing are collected, 1000 sampling points are used as a processing sample, a time-frequency graph is constructed for each processing sample and used as a fault sample, the fault sample is used as the input of a teacher model, and the time-frequency graph contains complete time-frequency information of the vibration signals, so that the real-time response efficiency and the accuracy of a fault diagnosis model are improved.

(3) The method realizes bearing fault diagnosis by using a relational knowledge distillation migration learning method, effectively reduces the computational complexity by replacing a large model with a small model, trains a simple model more suitable for actual engineering deployment under the condition of ensuring the precision, and improves the response efficiency of a terminal model.

Drawings

FIG. 1 is a flow chart of a bearing fault diagnosis system based on relationship knowledge distillation provided by the present invention.

Fig. 2 is a schematic diagram of a network structure of a system model according to an embodiment of the present invention.

FIG. 3 is pseudo code of a loss function for batch computation knowledge distillation provided by the present invention.

FIG. 4 is a schematic diagram of a bearing fault diagnosis system based on relationship knowledge distillation provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present invention provides a rolling bearing fault diagnosis method based on relational knowledge distillation, including:

step 1: the method comprises the steps of collecting and marking a sensor signal installed on a rolling bearing, wherein the signal is a vibration signal capable of reflecting the running characteristics of the bearing, the original tag hardtarget of a data set is a one-hot tag, namely the positive tag is 1, and the negative tag is 0.

Step 2: signal preprocessing and data transformation, namely taking a continuous wavelet time-frequency diagram of an original signal of the rolling bearing as model input, and dividing the generated time-frequency diagram into a training set and a test set.

Specifically, an original vibration signal is selected, the sample length is 1000 data samples, mexh wavelets are used as basic wavelets of continuous wavelet analysis, and a 32 x 3 three-channel wavelet time-frequency graph is generated for each sample and used as a new data set. Randomly breaking the time-frequency graph samples of each category, and selecting 80% of the time-frequency graph samples as a training set and 20% of the time-frequency graph samples as a testing set.

Continuous Wavelet Transform (CWT) is defined as the transformation of an arbitrary space L²The function f (t) of (R) is expanded on the wavelet basis essentially by projecting the time function onto the time-scale phase plane, as expressed by:

wherein, WT_f(a, tau) is wavelet transform coefficient, < > represents inner product operation, wavelet base psi_a，τ(t) has two parameters, the scale α and the translation τ, and takes continuously varying values.

Wavelet is the wave phi (t) existing in a small area, and phi (t) epsilon L²(R), if its fourier transform Ψ (ω) satisfies the following equation, ψ (t) is a basic wavelet.

The invention selects mexh wavelet as basic wavelet of CWT, and its calculation formula is as follows:

the mexh wavelet function is a second derivative of a Gaussian function, has good localization in time domain and frequency domain, is used for extracting the edge of a signal and an image, and has no scale function and therefore has no orthogonality.

And step 3: pre-training a teacher model, training a teacher network by using a time-frequency graph training set and a real data label, and storing the optimal model as the teacher model.

Specifically, the teacher model training process is as follows:

step 3.1, establishing a teacher network: as shown in fig. 2, the teacher network is a ResNet-20 with a multi-layer network, and is composed of 19 convolutional layers and 1 fully-connected layer (excluding the pooling layer and the BN layer), wherein three residual Block (Block) structures constitute a layer module. The student network is a ResNet-8 residual network and consists of 7 convolutional layers and 1 full-connection layer, the layer module is only provided with one residual block, the network output adopts a global average pooling layer to replace the full-connection layer of the traditional convolutional neural network, the GAP layer is closely associated with each class, a feature map can be generated for each class, the black box operation of the full-connection layer is avoided, the GAP layer does not have parameters needing to be learned, the parameter number is effectively reduced, and the problem of overfitting is avoided.

The invention uses the residual error network as the backbone frame of the fault diagnosis system, the teacher network adopts ResNet-, the student network adopts ResNet-8 network structure, the network structure is similar, the output dimension is consistent, which is more beneficial to the extraction of the characteristic information, the training is more stable, the problems of degradation and gradient disappearance or explosion of the conventional CNN along with the deepening of the network layer number are solved, and the characteristic extraction capability of the fault diagnosis signal is effectively improved.

Step 3.2 pre-training the teacher model: and (3) inputting the time-frequency graph training set generated in the step (2) into a ResNet-20 network, and comparing the model output with the sample real label to obtain the difference between the model output and the sample real label so as to form a loss function. And continuously updating the weight of the model network by using a back propagation algorithm to reduce the loss function until the model converges, and storing the model with the highest accuracy on the test set as the final model.

And 4, step 4: and (3) the student model learns the relationship structure information of the pre-training teacher model for training, the test set is used for verifying the student model, and the student model is stored as a model for final deployment when the prediction precision reaches the best.

Specifically, the training learning process of the student model is as follows:

step 4.1, calculating a relation knowledge distillation loss function value: one mini-batch time frequency diagram sample { x₁，…，x_nInputting the data into a pre-trained teacher model and an initialized student model, and respectively outputting the characteristics f of the last pooling layer_T，f_sAs the feature of the learning structure, calculating the loss function of the relation knowledge distillation

And

the calculation process is shown in fig. 3.

Step 4.2, calculating a knowledge distillation loss function value: distilling the pre-trained teacher network at high temperature T, calculating the soft target value of the teacher model at the temperature T, and comparing the soft target value with the student model to obtain Loss^soft. The calculation formula of the soft label after the knowledge distillation is as follows, wherein a temperature coefficient T is introduced to control the distribution of the soft label, and the original Softmax output value is obtained when T is 1.

As shown in FIG. 4, when training a student model, the student network learns the soft object generated by the teacher network with the same T value, approaches the soft object to learn the structural distribution characteristics of the data, and the output is compared with the softtarget to generate Loss^soft(ii) a Meanwhile, the student network calculates Softmax to obtain a predicted value, and the predicted value is compared with Hardtarget to generate Loss^hard. The total Loss function Loss is obtained by the weighted sum of the lambda of the two Loss functions and is a target function, and the Loss calculation formula is as follows:

Loss＝λLoss^sof+(1-λ)Loss^hard

in order to enable the output of the student model to be closer to that of the teacher model, KL divergence is introduced to measure the output distribution of the two models, distillation is realized by constantly minimizing the KL divergence in the learning process, and at the moment, a total loss function L is obtained_KDThe following steps are changed:

L_KD＝αT²·KLdiv(Q_s，Q_T)+(1-α).Loss^hard

in the formula, Q_s，Q_TAnd respectively outputting the Softmax output for the student model and the Softmax output for the teacher model, wherein alpha is an adjusting coefficient.

The student models are trained under supervision of pre-trained teacher models, different layers of output features of the teacher network are recombined into structural information, and the student network learns the structural relationship of the teacher network and single output information to improve model performance. Wherein the structural information uses the binary distance Ψ_dAnd ternary angle Ψ_ATo achieve this relationship.

Ψ_dThe characteristic distance of a binary sample in a student network kernel teacher network can be measured, wherein mu is a normalized coefficient of the distance, and the calculation formula is as follows:

wherein mu is a normalized coefficient of the distance, and the calculation formula is as follows:

loss function of distance distillation at this time

Comprises the following steps:

wherein l_δIs the Huber loss, defined as:

Ψ_Athe method can measure the angular distance of the ternary samples in the student network, the core and teacher network, and the calculation formula is psi_A(t_i，t_j，t_k)＝co∠t_it_jt_k＝<e^ij，e^jk>

Wherein the content of the first and second substances,

the angular loss function at this time is:

step 4.3, calculating a total loss function value of the distillation fault diagnosis system based on the relational knowledge:

in the formula, Loss^hard，Loss^divThe loss value of the hard tag in knowledge distillation, KL divergence loss value, T knowledge distillation temperature, gamma, alpha and beta are weight coefficients of each loss function value, and omega₁，ω₂To adjust the weight of the distance loss and angle loss values.

And 4.4, training a student model, initializing the student model, training a student network by using a training set and a hardtarget, enabling the performance of the student network in the system to be closer to that of a teacher network from two aspects by learning structural information of the teacher network and output information of a single sample in the teacher network, updating a total loss function, updating model parameters according to a gradient updating and error back propagation mode, and storing the model with the highest accuracy on the test set as a finally deployed model.

To further elaborate on the invention, the invention was verified using bearing data of a rotating machine, which was derived from a rolling bearing condition monitoring experiment at the university of Paderborn, germany, which provides condition monitoring of rolling bearing current and vibration signals to collect vibration signals. This time verification experiment chooses acceleration life experimental data among the bearing experiment for use, and this experiment produces real bearing fault data through the acceleration life experiment, can simulate the data in the actual engineering, and the model of training is more suitable for the engineering and uses. The test stand was operated at 1500rpm, with a load torque of 0.7Nm and a radial force F of 1000N acting on the bearing. And selecting 6203 type ball bearing fault data, and dividing the states into an outer ring damage state, an inner ring damage state and a normal state. 20 measurements were recorded for each condition, 4 seconds each. Selecting an original vibration signal, wherein the sample length is 1000 data samples, and generating a 32 x 3 three-channel CWT time-frequency graph of each sample as the CWT time-frequency graphA new data set. Randomly breaking the time-frequency graph samples of each category, and selecting 80% of the time-frequency graph samples as a training set and 20% of the time-frequency graph samples as a testing set. The initial learning rate lr of the teacher network is 0.05, momentum is 0.9, and weight _ decay is 5 e-4; the initial learning rate of the student network is 0.01, an SGD optimization algorithm is selected for iteration, the batch number is 24, the system hyper-parameter is set to be gamma-0.5, alpha-0.3, beta-0.2, and omega₁＝2，ω₂5, the distillation temperature T is 4, and the experimental result is shown in Table 1, so that the method effectively improves the classification performance of the fault diagnosis system under the condition of not increasing the memory and the training time.

TABLE 1

The above-mentioned modules in the system for diagnosing the distillation fault based on the relationship knowledge can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A rolling bearing fault diagnosis method based on relational knowledge distillation is characterized by comprising the following steps:

a preparation stage:

a training stage:

an application stage:

2. The method of claim 1, wherein the normalized one-dimensional vibratory signal segment is subjected to continuous wavelet analysis to generate a continuous three-channel wavelet time-frequency map.

3. The method according to claim 1 or 2, characterized in that the distance distillation loss is constructed on the basis of the binary distances between the elements in the feature sets T and S, in particular as follows:

4. The method according to claim 1 or 2, characterized in that the angular distillation loss is constructed on the basis of the ternary angles between the elements in the feature sets T and S, as follows:

wherein L is_RKD-ADenotes the angular distillation loss, x_i，x_j，x_kRespectively represent the ith, j, k training samples, χ³Representing a set of ternary relationships, l_δ() Representing the Huber loss function, t_i，t_j，t_kRespectively representing a plurality of corresponding characteristics, s, of the ith, j and k training samples input to the final pooling layer output of the teacher model_i，s_j，s_kRespectively representing a plurality of corresponding characteristics, psi, of the ith, j, k training samples input to the last pooling layer output of the student model_A(t_i，t_j，t_k) Representing teacher model outputCharacteristic t_i，t_j，t_kTernary angular relationship between, Ψ_A(s_i，s_j，s_k) Representing student model output features s_i，s_j，s_kA ternary angular relationship therebetween.

5. The method of claim 4,

Ψ_A(t_i，t_j，t_k)＝cos∠t_it_jt_k＝<e^ij，e^jk>

wherein, t_it_jt_kRepresenting a ternary feature t_i，t_j，t_kAngle of formation e^ijRepresents a vector t_it_jUnitized vector of, e^jkRepresents a vector t_jt_kThe vector of (a) is unitized,<，>cosine value, t, representing the angle between vectors_i，t_j，t_kAnd respectively representing a plurality of corresponding characteristics output by the ith, j and k training samples input to the last pooling layer of the teacher model.

6. The method of claim 1 or 2, wherein the total loss function is calculated as follows:

L＝α*L_KD+β*(ω₁*L_RKD-D+ω₂*L_RKD-A)

7. A rolling bearing fault diagnosis system based on relational knowledge distillation is characterized by comprising: a computer-readable storage medium and a processor;

the processor is used for reading executable instructions stored in the computer readable storage medium and executing the rolling bearing fault diagnosis method based on the relational knowledge distillation, which is disclosed by any one of claims 1 to 6.