CN116610935A

CN116610935A - A Mechanical Fault Detection Method Based on Multimodal Analysis of Engine Vibration Signals

Info

Publication number: CN116610935A
Application number: CN202310554203.4A
Authority: CN
Inventors: 刘翔鹏; 李文杰; 袁非牛; 张相芬; 王心怡; 安康; 管西强; 张会
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-18

Abstract

The invention relates to a mechanical fault detection method based on engine vibration signal multi-mode analysis, firstly adopting a multi-mode feature extraction network to extract image features related to abnormal signals from one-dimensional amplitude data and two-dimensional image data for the extracted diesel engine vibration signal, then adopting a mixed channel feature fusion detection network to split a feature image into two groups, respectively adopting a spatial attention mechanism and a channel attention mechanism to carry out weighting calculation on the feature image in a spatial domain and a channel domain, grouping the calculated feature image again, obtaining a multi-dimensional weighting feature image through merging operation, and finally adopting a multi-scale detector to detect the three feature images simultaneously to judge whether the signal in the period has an abnormal state. Compared with the prior art, the invention has the advantages of high accuracy, good noise resistance and the like.

Description

A mechanical fault detection method based on multimodal analysis of engine vibration signals

技术领域Technical Field

本发明涉及机器学习技术领域，尤其是涉及一种基于发动机振动信号多模态分析的机械故障检测方法。The present invention relates to the technical field of machine learning, and in particular to a mechanical fault detection method based on multimodal analysis of engine vibration signals.

背景技术Background Art

对柴油发动机进行故障检测可以延长其使用寿命，增强使用安全性，具有重要的经济价值及社会效益。Fault detection of diesel engines can extend their service life and enhance their safety, which has important economic value and social benefits.

然而，现有的故障检测方法无法应对实际应用下的强噪声环境，在多工况条件下不具备普适性。同时，现有的故障诊断模型采用单一数据检测方案，例如仅将一维数据作为输入进行异常分析检测，此类检测方案多欠缺对不同形式下的相同数据源其内在相关性和分布差距的考虑，对多源数据的探索存在局限性。However, existing fault detection methods cannot cope with strong noise environments in practical applications and are not universal under multiple working conditions. At the same time, existing fault diagnosis models use a single data detection scheme, such as using only one-dimensional data as input for abnormal analysis and detection. Such detection schemes often lack consideration of the inherent correlation and distribution gap of the same data source in different forms, and have limitations in the exploration of multi-source data.

发明内容Summary of the invention

本发明的目的是克服上述现有技术存在的缺陷而提供一种基于发动机振动信号多模态分析的机械故障检测方法。The purpose of the present invention is to overcome the defects of the above-mentioned prior art and provide a mechanical fault detection method based on multimodal analysis of engine vibration signals.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved by the following technical solutions:

一种基于发动机振动信号多模态分析的机械故障检测方法，包括以下步骤：A mechanical fault detection method based on multimodal analysis of engine vibration signals comprises the following steps:

采集发电机的振动信号；Collect vibration signals of generators;

将所述振动信号输入多模态特征提取网络，得到多个特征信息；Inputting the vibration signal into a multimodal feature extraction network to obtain multiple feature information;

选取p个特征信息输入混通道特征融合检测网络，进行特征二次处理及检测，输出检测结果Select p feature information and input it into the mixed channel feature fusion detection network, perform secondary feature processing and detection, and output the detection results

其中，所述多模态特征提取网络包括整形网络、卷积模块、全连接模块及多模态Transformer模块；The multimodal feature extraction network includes a shaping network, a convolution module, a fully connected module and a multimodal Transformer module;

所述整形网络用于将输入的发动机振动信号转换为二维图片，卷积模块基于所述二维图片进行特征提取，得到二维特征图片；The shaping network is used to convert the input engine vibration signal into a two-dimensional image, and the convolution module performs feature extraction based on the two-dimensional image to obtain a two-dimensional feature image;

所述全连接模块用于提取输入的发动机振动信号的一维特征向量；所述多模态Transformer模块用于整合所述一维特征向量及所述二维特征图片；The fully connected module is used to extract a one-dimensional feature vector of the input engine vibration signal; the multimodal Transformer module is used to integrate the one-dimensional feature vector and the two-dimensional feature image;

所述混通道特征融合检测网络包括特征聚合模块、特征混组模块及多尺度检测模块；The mixed-channel feature fusion detection network includes a feature aggregation module, a feature mixing module and a multi-scale detection module;

所述特征聚合模块用于将所述p个特征信息进行聚合，得到聚合特征图FM；The feature aggregation module is used to aggregate the p feature information to obtain an aggregated feature map FM;

所述特征混组模块用于对所述聚合特征图FM进行特征混组，得到多个不同大小的特征图；The feature mixing module is used to perform feature mixing on the aggregated feature map FM to obtain a plurality of feature maps of different sizes;

所述多尺度检测模块用于对混组后的特征图进行检测，通过检测特征图中是否有异常、不规则纹理区域以判断在当前这一时段内柴油机气缸是否有异常振动信号。The multi-scale detection module is used to detect the mixed feature map, and to determine whether there is an abnormal vibration signal in the diesel engine cylinder during the current period by detecting whether there is an abnormal or irregular texture area in the feature map.

进一步地，所述多模态特征提取网络包括q层结构，各层结构中均包括卷积模块、全连接模块及多模态Transformer模块。Furthermore, the multimodal feature extraction network includes a q-layer structure, and each layer structure includes a convolution module, a fully connected module and a multimodal Transformer module.

进一步地，所述多模态特征提取网络对所述振动信号进行特征提取，得到多个特征信息，包括以下步骤：Furthermore, the multimodal feature extraction network extracts features from the vibration signal to obtain a plurality of feature information, including the following steps:

S1、所述振动信号分别输入整形网络及全连接模块，整形网络输出得到二维图片，并将所述二维图片输入卷积模块进行特征提取；S1, the vibration signal is input into the shaping network and the fully connected module respectively, the shaping network outputs a two-dimensional image, and the two-dimensional image is input into the convolution module for feature extraction;

S2、所述全连接模块输出得到一维特征向量；S2, the fully connected module outputs a one-dimensional feature vector;

S3、所述卷积模块输出得到二维特征图片；S3, the convolution module outputs a two-dimensional feature image;

S4、将所述一维特征向量及二维特征图片输入多模态Transformer模块，得到整合特征信息；S4, inputting the one-dimensional feature vector and the two-dimensional feature image into a multimodal Transformer module to obtain integrated feature information;

S5、将所述整合特征信息及二维特征图片输入下一层的卷积模块，将所述一维特征向量输入下一层的全连接模块；S5, inputting the integrated feature information and the two-dimensional feature image into the convolution module of the next layer, and inputting the one-dimensional feature vector into the fully connected module of the next layer;

S6、重复步骤S2-S5，直至到达多模态特征提取网络的第q层结构，进行逐层特征提取，得到多个特征信息。S6. Repeat steps S2-S5 until the qth layer structure of the multimodal feature extraction network is reached, and perform feature extraction layer by layer to obtain multiple feature information.

进一步地，所述多模态Transformer模块包括两个多头注意力网络，分别对应所述一维特征向量与二维特征图像的长距离关系交互。Furthermore, the multimodal Transformer module includes two multi-head attention networks, which respectively correspond to the long-distance relationship interaction between the one-dimensional feature vector and the two-dimensional feature image.

进一步地，所述多模态Transformer模块整合所述一维特征向量及所述二维特征图片包括以下步骤：Furthermore, the multimodal Transformer module integrates the one-dimensional feature vector and the two-dimensional feature image, including the following steps:

将所述一维特征向量分为n个tokens子向量；Divide the one-dimensional feature vector into n tokens sub-vectors;

将所述二维特征图片等分为n个特征块，将所述特征块延展为n个tokens子向量；Divide the two-dimensional feature image into n feature blocks, and extend the feature blocks into n tokens sub-vectors;

在第一多头注意力网络中，将二维特征图片对应的tokens输入到矩阵中，将一维特征向量对应的tokens输入到矩阵与中，将计算得到带有图片特征信息的查询矩阵Q与一维幅值特征n信息的键矩阵K进行匹配度计算，将得到的匹配度赋值于对应的特征值矩阵V上，完成图像特征映射至幅值特征的操作；In the first multi-head attention network, the tokens corresponding to the two-dimensional feature image are input into the matrix In the example, the tokens corresponding to the one-dimensional feature vector are input into the matrix and In the process, the query matrix Q with the image feature information is calculated and the key matrix K with the one-dimensional amplitude feature n information is matched, and the obtained matching degree is assigned to the corresponding eigenvalue matrix V to complete the operation of mapping the image feature to the amplitude feature;

在第二多头注意力网络中，将二维特征图片对应的tokens输入到矩阵中，将一维特征向量对应的tokens输入到矩阵与中，将计算得到带有图片特征信息的查询矩阵Q与一维幅值特征n信息的键矩阵K进行匹配度计算，将得到的匹配度赋值于对应的特征值矩阵V上，完成图像特征映射至幅值特征的操作；In the second multi-head attention network, the tokens corresponding to the two-dimensional feature image are input into the matrix In the example, the tokens corresponding to the one-dimensional feature vector are input into the matrix and In the process, the query matrix Q with the image feature information is calculated and the key matrix K with the one-dimensional amplitude feature n information is matched, and the obtained matching degree is assigned to the corresponding eigenvalue matrix V to complete the operation of mapping the image feature to the amplitude feature;

其中，在两个多头注意力网络中，对应的两组Q、K、V向量分别由和矩阵计算得到；Among them, in the two multi-head attention networks, the corresponding two sets of Q, K, and V vectors are respectively and The matrix is calculated;

对两个多头注意力网络的输出一维特征向量进行合并，并采用全连接层与ReLU激活函数对合并后的特征进行激活计算，最后将得到的1×1×C向量进行整型计算，得到H×W×C维度的特征图。The output one-dimensional feature vectors of the two multi-head attention networks are merged, and the fully connected layer and ReLU activation function are used to activate the merged features. Finally, the obtained 1×1×C vector is integer-calculated to obtain a feature map of H×W×C dimensions.

进一步地，所述特征聚合模块将所述多模态特征提取网络最底层的3个卷积模块所输出的特征图FM₁、FM₂及FM₃进行聚合，包括以下步骤：Furthermore, the feature aggregation module aggregates the feature maps FM ₁ , FM ₂ and FM ₃ output by the three convolution modules at the bottom layer of the multimodal feature extraction network, including the following steps:

对每个特征图的特征信息进行提升；Improve the feature information of each feature map;

FM₃由反卷积将特征图的大小扩大两倍、通道数压缩为原来的二分之一，采用add融合方式将FM₃的特征赋值于FM₂上，得到FM′₂；FM ₃ doubles the size of the feature map by deconvolution and compresses the number of channels to half of the original. The features of FM ₃ are assigned to FM ₂ using the add fusion method to obtain FM′ ₂ ;

对FM₂特征图进行上采样及通道压缩，并与FM₁特征图的特征进行融合得到FM′₁；Upsample and channel compress the FM ₂ feature map, and fuse it with the features of the FM ₁ feature map to obtain FM′ ₁ ;

采用concat层对FM′₁、FM′₂及FM₃三个特征图进行合并，在合并操作过程中对FM′₁进行下采样，对FM₃进行上采样，聚合得到特征图FM。The concat layer is used to merge the three feature maps FM′ ₁ , FM′ ₂ and FM _3. During the merging operation, FM′ ₁ is downsampled and FM ₃ is upsampled to obtain the feature map FM.

进一步地，所述特征混组模块对聚合后的特征图FM进行特征混组，包括以下步骤：Furthermore, the feature mixing module performs feature mixing on the aggregated feature map FM, including the following steps:

采用1x1卷积层将FM的特征通道数压缩合并至与FM′₂相同，采用分组卷积将FM平分为FM_1与FM_2两个特征图，其大小与FM相同，通道数为FM的二分之一；A 1x1 convolutional layer is used to compress and merge the feature channels of FM to the same number as FM′ ₂ , and a grouped convolution is used to divide FM into two feature maps, FM_1 and FM_2, which have the same size as FM and half the number of channels as FM.

对分组后的特征图FM_1与FM_2进行特征提取操作，得到FM_1′及FM_2′；Perform feature extraction operations on the grouped feature maps FM_1 and FM_2 to obtain FM_1′ and FM_2′;

采用相同的并行卷积模块从FM_1′及FM_2′中提取出大小、通道数不同的特征图，构成两组配对组；The same parallel convolution module is used to extract feature maps of different sizes and number of channels from FM_1′ and FM_2′ to form two paired groups;

采用concat层与1x1卷积层组合的融合模块将所述配对组中的特征图进行合并混组，得到FM1、FM2及FM3三个不同大小的特征图。A fusion module composed of a concat layer and a 1x1 convolution layer is used to merge and mix the feature maps in the pairing groups to obtain three feature maps of different sizes, namely FM1, FM2 and FM3.

进一步地，对特征图FM_1采用空间注意力计算，提升异常信号对应的形态特征信息FM_1′；对FM_2采用通道注意力计算，以提升异常信号的语义特征权值FM_2′；Furthermore, spatial attention calculation is used on the feature map FM_1 to enhance the morphological feature information FM_1′ corresponding to the abnormal signal; channel attention calculation is used on FM_2 to enhance the semantic feature weight FM_2′ of the abnormal signal;

在FM_1′提取的配对组中包含了空间注意力提权特征值，在FM_2′中提取的配对组中包含了通道注意力提权特征值。The pairing group extracted from FM_1′ contains the spatial attention weighted feature value, and the pairing group extracted from FM_2′ contains the channel attention weighted feature value.

进一步地，采用所述多尺度检测模块对所述特征混组模块输出的特征图进行检测，采用交叉熵分类损失以调节检测器的分类模块，同时基于CIoU位置评估关系进行位置损失的计算。Furthermore, the multi-scale detection module is used to detect the feature map output by the feature mixing module, and the cross entropy classification loss is used to adjust the classification module of the detector, and the position loss is calculated based on the CIoU position evaluation relationship.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明考虑到发电机振动信号的多模态，通过多模态特征提取网络提取发动机振动信号中的一维幅值向量与二维特征向量，并将所提取到的特征由混通道特征融合检测网络进行特征二次处理及检测，在混通道特征融合模块中包含有空间及通道双维度注意力计算机制，对特征图进行空间域与通道域的提权计算，以实现对振动噪声的抑制，使得网络可抵抗环境噪声和操作条件变化对最终检测结果的影响，适用于实际应用下的强噪声环境，同时在多工况条件下具备普适性；The present invention takes into account the multi-modality of the generator vibration signal, extracts the one-dimensional amplitude vector and the two-dimensional feature vector in the engine vibration signal through the multi-modal feature extraction network, and performs secondary feature processing and detection on the extracted features through the mixed channel feature fusion detection network. The mixed channel feature fusion module includes a space and channel dual-dimensional attention calculation mechanism, and performs weighted calculations in the space domain and the channel domain on the feature map to achieve the suppression of vibration noise, so that the network can resist the influence of environmental noise and changes in operating conditions on the final detection results, and is suitable for strong noise environments in practical applications, and has universality under multiple working conditions;

本发明的准确性和抗噪声性能明显优于现有技术，在不同工作条件下构建的四个数据集上，即使信噪比为-4dB，本发明的准确率也至少达到了99.008％，远高于其他方法。The accuracy and anti-noise performance of the present invention are significantly better than those of the prior art. On four data sets constructed under different working conditions, even when the signal-to-noise ratio is -4dB, the accuracy of the present invention reaches at least 99.008%, which is much higher than other methods.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明多模态特征提取网络的结构示意图。FIG1 is a schematic diagram of the structure of a multimodal feature extraction network of the present invention.

图2为本发明实施例中气缸正常及异常工作情况下的波形信号二维图，其中(2a)表示正常工作情况，(2b)表示异常工作情况。FIG. 2 is a two-dimensional diagram of waveform signals of a cylinder under normal and abnormal working conditions in an embodiment of the present invention, wherein (2a) represents a normal working condition and (2b) represents an abnormal working condition.

图3为本发明实施例中多模态Transformer模块的结构示意图。FIG3 is a schematic diagram of the structure of a multimodal Transformer module in an embodiment of the present invention.

图4为本发明实施例中混通道特征融合检测网络的结构示意图。FIG4 is a schematic diagram of the structure of a mixed channel feature fusion detection network in an embodiment of the present invention.

图5为本发明实施例中的MITDCNN网络结构示意图。FIG5 is a schematic diagram of the MITDCNN network structure in an embodiment of the present invention.

图6为本发明实施例中，在1800rpm的运行速度下，柴油发动机单缸失火的典型时域信号。FIG6 is a typical time domain signal of a single cylinder misfire of a diesel engine at an operating speed of 1800 rpm in an embodiment of the present invention.

图7为本发明实施例中，在1800rpm的运行速度下，柴油发动机单缸失火的时域信号转换为二维图像后的示意图。FIG7 is a schematic diagram of a time domain signal of a single cylinder misfire in a diesel engine at an operating speed of 1800 rpm converted into a two-dimensional image in an embodiment of the present invention.

图8为本发明实施例中网络3(MITDCNN)的loss迭代曲线。FIG8 is a loss iteration curve of network 3 (MITDCNN) in an embodiment of the present invention.

图9为本发明实施例中网络3(MITDCNN)AP迭代曲线。FIG9 is an AP iteration curve of network 3 (MITDCNN) in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention is described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is implemented based on the technical solution of the present invention, and provides a detailed implementation method and specific operation process, but the protection scope of the present invention is not limited to the following embodiments.

为解决现有技术中存在的问题，本发明提出了一种基于多模态Transformer特征提取的卷积神经网络(MITDCNN)，用于在强环境噪声和不同工作条件下的柴油发动机失火诊断。In order to solve the problems existing in the prior art, the present invention proposes a convolutional neural network (MITDCNN) based on multimodal Transformer feature extraction for diesel engine misfire diagnosis under strong environmental noise and different working conditions.

本实施例通过实验收集不同速度下发动机气缸头的振动信号，并从中分别提取出一维幅值向量特征与二维图像特征输入到多模态特征提取网络中，并将所提取到的特征由混通道特征融合检测网络进行特征二次处理及检测，在混通道特征融合模块中包含有空间及通道双维度注意力计算机制，以实现对振动噪声的抑制，使得网络可抵抗环境噪声和操作条件变化对最终检测结果的影响。本实施例通过实验收集的数据集以及与现有代表性算法的比较，验证了所提方法的有效性。结果表明，所提出的MITDCNN的准确性和抗噪声性能明显优于现有算法。在不同工作条件下构建的四个数据集上，即使信噪比为-4dB，所提方法的准确率也至少达到了99.008％，远高于其他方法。This embodiment collects vibration signals of the engine cylinder head at different speeds through experiments, and extracts one-dimensional amplitude vector features and two-dimensional image features from them, and inputs them into the multimodal feature extraction network, and the extracted features are subjected to secondary feature processing and detection by the mixed channel feature fusion detection network. The mixed channel feature fusion module contains a spatial and channel dual-dimensional attention calculation mechanism to achieve the suppression of vibration noise, so that the network can resist the influence of environmental noise and changes in operating conditions on the final detection results. This embodiment verifies the effectiveness of the proposed method through the data set collected by the experiment and the comparison with the existing representative algorithms. The results show that the accuracy and anti-noise performance of the proposed MITDCNN are significantly better than the existing algorithms. On the four data sets constructed under different working conditions, even if the signal-to-noise ratio is -4dB, the accuracy of the proposed method is at least 99.008%, which is much higher than other methods.

为了强化网络对于异常信号的敏感度，本发明设计了多模态特征提取网络，用于检测柴油发动机的异常信号。相比于单模态特征提取网络，本发明所设计网络同时对一维及二维的数据进行特征提取，提升所提取特征信息的丰富度，整体网络结构如图1所示。In order to enhance the sensitivity of the network to abnormal signals, the present invention designs a multimodal feature extraction network for detecting abnormal signals of diesel engines. Compared with a single-modal feature extraction network, the network designed by the present invention simultaneously extracts features from one-dimensional and two-dimensional data, thereby improving the richness of the extracted feature information. The overall network structure is shown in FIG1 .

如图1所示，由柴油发电机气缸提取得到的数据信息为波形信号(waveformsignal)，首先通过整形网络(ReshapeNet)将波形信号由一维数据根据时间与幅值转换为二维图片，其次是将波形信号按时间单位抽取出相应的幅值构成一维向量，以此获得在相同时间段内两种不同型态的数据。As shown in Figure 1, the data information extracted from the diesel generator cylinder is a waveform signal. First, the waveform signal is converted from one-dimensional data to a two-dimensional image according to time and amplitude through a reshaping network (ReshapeNet). Secondly, the waveform signal is extracted according to the time unit. The corresponding amplitude forms a one-dimensional vector, thereby obtaining two different types of data in the same time period.

气缸正常工作下所输出的波形信号具有一定的规律性，其所转换得到的二维图像为具有一定纹理规则的图像，而当气缸工作状态异常时，则在转换后的图片中出现不规则区域，正常工作及异常工作下所转换得到的二维图像如图2所示。The waveform signal output under normal working condition of the cylinder has certain regularity, and the converted two-dimensional image is an image with certain texture regularity. When the working state of the cylinder is abnormal, irregular areas appear in the converted image. The two-dimensional images converted under normal working condition and abnormal working condition are shown in Figure 2.

同理，对于一维向量数据，当气缸出现异常时，其值将出现不规律变换。如图1所示在网络中分为x轴与y轴两个方向对特征进行提取与聚合，在x轴中波形信号通过整形后得到二维图像，之后由卷积模块(ConvNet)进行特征提取，得到二维特征图片，并将所提取到的二维特征图片输入到多模态Transformer模块(Multi-Modal Transformer)中；而多模态Transformer模块的输出特征与卷积模块输出进行聚合，将多模态特征映射到特征图中，在主干网络中由6个上述结构体所叠加组合构成多模态特征提取网络。另一方面，在y轴中，提取到的一维向量向量后采用全连接模块(FCNet)进行特征提取，得到一维特征向量，以获取时序与幅值之间的特征关系，将得到的特征向量输入到多模态Transformer模块中，与二维图像的特征信息进行相互补充，学习异常波形的特征型态。Similarly, for one-dimensional vector data, when the cylinder is abnormal, its value will change irregularly. As shown in Figure 1, the network is divided into two directions, the x-axis and the y-axis, to extract and aggregate features. In the x-axis, the waveform signal is shaped to obtain a two-dimensional image, and then the convolution module (ConvNet) extracts features to obtain a two-dimensional feature image, and the extracted two-dimensional feature image is input into the multi-modal Transformer module (Multi-Modal Transformer); the output features of the multi-modal Transformer module are aggregated with the output of the convolution module, and the multi-modal features are mapped into the feature map. In the backbone network, the multi-modal feature extraction network is composed of the superposition of the six above structures. On the other hand, in the y-axis, the extracted one-dimensional vector is vectorized and then extracted by the fully connected module (FCNet) to obtain a one-dimensional feature vector to obtain the characteristic relationship between the timing and the amplitude. The obtained feature vector is input into the multi-modal Transformer module to complement the feature information of the two-dimensional image and learn the characteristic pattern of the abnormal waveform.

综上，在多模态特征提取网络而提取二维图像的特征信息为主，同时增加对一维时序特征的提取及融合为补充，扩增网络特征丰富性。网络的核心单元为多模态Transformer模块，负责构建一维及二维输入特征间的关联。In summary, the multimodal feature extraction network mainly extracts the feature information of two-dimensional images, while adding the extraction and fusion of one-dimensional temporal features as a supplement to expand the richness of network features. The core unit of the network is the multimodal Transformer module, which is responsible for building the association between one-dimensional and two-dimensional input features.

对于一维时序特征，本发明提出的多模态Transformer模块相比于循环神经网络有更优的长依赖解决能力，能够获取更长时序中的特征信息。且将多模态Transformer模块用于对图像特征的提取，则能够提取图各区域特征之间的关联，即全局特征。在本发明中为将一维及二维的输入信息进行融合，设计了多模态Transformer模块，同时对两类输入进行多头注意力计算，并对提取到的特征信息进行融合，多模态Transformer模块结构如图3所示。For one-dimensional time series features, the multimodal Transformer module proposed in the present invention has better long-dependency solving ability than recurrent neural networks, and can obtain feature information in longer time series. And the multimodal Transformer module is used to extract image features, so the correlation between features of each region of the image, that is, the global features, can be extracted. In the present invention, in order to fuse one-dimensional and two-dimensional input information, a multimodal Transformer module is designed, and multi-head attention calculations are performed on the two types of inputs at the same time, and the extracted feature information is fused. The structure of the multimodal Transformer module is shown in Figure 3.

如图3所示，在多模态Transformer模块结构中共包含有两个多头注意力网络(Multi Head Attention)，分别对应一维特征向量与二维特征图像的长距离关系交互。As shown in Figure 3, the multimodal Transformer module structure contains two multi-head attention networks (Multi Head Attention), which correspond to the long-distance relationship interaction between the one-dimensional feature vector and the two-dimensional feature image.

输入到多模态Transformer模块中的两类数据分别做如下预处理操作：一维幅值向量经由全连接层提取到一维特征向量后，将一维特征向量分为n个tokens子向量；二维特征图片由卷积模块提取到特征图后将特征图均等分为n个特征块，之后将特征块延展为n个一维特征向量tokens。The two types of data input into the multimodal Transformer module are preprocessed as follows: the one-dimensional amplitude vector is extracted into a one-dimensional feature vector through the fully connected layer, and then the one-dimensional feature vector is divided into n tokens sub-vectors; the two-dimensional feature image is extracted into a feature map by the convolution module, and then the feature map is equally divided into n feature blocks, and then the feature blocks are extended into n one-dimensional feature vector tokens.

在两个Transformer多头注意力网络中，其对应的两组Q、K、V向量分别由和六个矩阵计算而得。In the two Transformer multi-head attention networks, the corresponding two sets of Q, K, and V vectors are respectively and Six matrices are calculated.

首先在第一个Transformer模块(Transformer-1Multi Head Attention)中，将转换后图片的二维特征图片对应的tokens输入到矩阵中，将一维幅值向量所对应的tokens输入到与矩阵中，将计算得到带有图片特征信息的查询矩阵Q与一维幅值特征n信息的键矩阵K进行匹配度计算，将得到的匹配度赋值与对应的特征值矩阵V上，以此完成图像特征映射至幅值特征的操作；First, in the first Transformer module (Transformer-1Multi Head Attention), the tokens corresponding to the two-dimensional feature image of the transformed image are input into In the matrix, the tokens corresponding to the one-dimensional magnitude vector are input into and In the matrix, the query matrix Q with the image feature information is calculated and the key matrix K with the one-dimensional amplitude feature n information is matched, and the obtained matching degree is assigned to the corresponding eigenvalue matrix V, so as to complete the operation of mapping the image feature to the amplitude feature;

同理，将一维幅值特征对应的tokens输入到第一个Transformer模块(Transformer-2Multi Head Attention)的中，而图像特征所对应的tokens输入至与矩阵中，以此将幅值特征映射至特征图中。Similarly, the tokens corresponding to the one-dimensional amplitude features are input into the first Transformer module (Transformer-2Multi Head Attention) In the example, the tokens corresponding to the image features are input into and The matrix is used to map the amplitude features into the feature map.

随后对于多头注意力网络的输出一维特征向量进行合并，并采用全连接层(FCLayer)与ReLU激活函数对合并后的特征进行激活计算，增强其非线性。最后将得到的1×1×C向量进行整型计算，得到H×W×C维度的特征图。Then, the one-dimensional feature vectors output by the multi-head attention network are merged, and the fully connected layer (FCLayer) and ReLU activation function are used to activate the merged features to enhance their nonlinearity. Finally, the obtained 1×1×C vector is integer-calculated to obtain a feature map of H×W×C dimensions.

在完成多模态特征的提取后，本发明设计了混通道特征融合网络以将不同层次的网络特征进行合并，提升特征的综合性，并且由于异常信号的出现时长不同，其在转换后的图像中所显示的异常图片块也不同，因此对于异常信号图像的检测，本发明采用了多尺度检测方案，对不同大小的特征图进行检测，综上所设计的网络结构如图4所示。After completing the extraction of multimodal features, the present invention designs a mixed-channel feature fusion network to merge network features at different levels to improve the comprehensiveness of the features. Since the appearance duration of abnormal signals is different, the abnormal picture blocks displayed in the converted image are also different. Therefore, for the detection of abnormal signal images, the present invention adopts a multi-scale detection scheme to detect feature maps of different sizes. The network structure designed in summary is shown in Figure 4.

混通道特征融合网络整体可分为：特征聚合模块、特征混组模块及多尺度检测模块三个部分，每个部分的工作流程及作用分别如下：The mixed-channel feature fusion network can be divided into three parts: feature aggregation module, feature mixing module and multi-scale detection module. The workflow and function of each part are as follows:

首先，在特征聚合模块中将多模态特征提取网络最底层的三个卷积模块所输出的特征图进行聚合，如图4所示，输入到特征聚合模块的为三个大小不同的特征图，分别命名为FM₁、FM₂及FM₃(Feature Map，FM)。从特征维度上分析，FM₁至FM₃呈语义特征逐级增强状态，因此在聚合操作时首先对每个特征图的特征信息进行提升(UpSample，上采样)，FM₃由反卷积将特征图的大小扩大两倍、通道数压缩为原先的二分之一，采用add融合方式将其特征赋值于FM₂上，得到FM′₂；同理对融合后的FM₂特征图进行上采样及通道压缩，与FM₁特征图的特征进行融合得到FM′₁。之后是采用concat层对FM′₁、FM′₂及FM₃三个特征图进行合并，在合并操作过程中对FM′₁进行下采样(DownSample)，对FM₃进行上采样，聚合得到的特征图FM其大小同FM′₂，该特征图包含有三个层次的特征信息，因此在特征信息丰富度上要优于FM₁等三个特征图。First, in the feature aggregation module, the feature maps output by the three convolution modules at the bottom of the multimodal feature extraction network are aggregated. As shown in Figure 4, three feature maps of different sizes are input to the feature aggregation module, named FM ₁ , FM ₂ and FM ₃ (Feature Map, FM). From the perspective of feature dimension, FM ₁ to FM ₃ show a state of step-by-step enhancement of semantic features. Therefore, during the aggregation operation, the feature information of each feature map is first improved (UpSample). FM ₃ doubles the size of the feature map by deconvolution and compresses the number of channels to half of the original. The add fusion method is used to assign its features to FM ₂ to obtain FM′ _2. Similarly, the fused FM ₂ feature map is upsampled and the channel is compressed, and fused with the features of the FM ₁ feature map to obtain FM′ ₁ . Then, the concat layer is used to merge the three feature maps FM′ ₁ , FM′ ₂ and FM _3. During the merging operation, FM′ ₁ is downsampled (DownSample) and FM ₃ is upsampled. The aggregated feature map FM has the same size as FM′ _2. This feature map contains three levels of feature information, so it is superior to FM ₁ and other three feature maps in terms of feature information richness.

其次，利用特征混组模块对聚合后的特征图FM进行特征混组。采用1x1卷积层将FM的特征通道数压缩合并至与FM′₂相同，之后采用分组卷积将FM平分为FM_1与FM_2两个特征图，其大小与FM相同，通道数为FM的二分之一。对分组后的特征图进行特征提取操作，以强化特征图中有效特征特征的权值。在提权操作中，对特征图FM_1采用空间注意力计算(Spatial Attention模型)，提升异常信号对应的形态特征信息FM_1′；对FM_2采用通道注意力计算(Channel Attention模型)，以提升异常信号的语义特征权值FM_2′。之后采用相同的并行卷积模块从FM_1′及FM_2′中提取出大小、通道数不同的特征图。以FM_1′为例，使用滑动步长为2的3x3卷积得到大小与FM₁相同，通道数为其二分之一的特征图FM_1′₁；采用步长为1的3x3卷积得到大小与FM₂相同，通道为其二分之一的FM_1′₂，最后是采用3x3反卷积得到大小与FM₃相同，通道数同样为其二分之一的FM_1′₃。同理，对FM_2′特征图采用并行卷积模块计算得到FM_2′₁、FM_2′₂与FM_2′₃。在特征图参数上，FM_1′₁与FM_2′₁、FM_1′₂与FM_2′₂、FM_1′₃与FM_2′₃相同，将上述三组特征图称之为配对组，在每个配对组中包含了空间注意力提权特征值与通道注意力提权特征值，采用concat层与1x1卷积层组合的融合模块将配对组中的两个特征图进行合并混组，以此得到FM1、FM2及FM3三个不同大小的特征图。Secondly, the feature mixing module is used to mix the features of the aggregated feature map FM. A 1x1 convolution layer is used to compress and merge the number of feature channels of FM to the same as FM′ ₂ , and then a grouped convolution is used to divide FM into two feature maps, FM_1 and FM_2, which have the same size as FM and half the number of channels as FM. Feature extraction operations are performed on the grouped feature maps to enhance the weights of the effective feature features in the feature maps. In the weighting operation, spatial attention calculation (Spatial Attention model) is used on the feature map FM_1 to enhance the morphological feature information FM_1′ corresponding to the abnormal signal; channel attention calculation (Channel Attention model) is used for FM_2 to enhance the semantic feature weight FM_2′ of the abnormal signal. Then, the same parallel convolution module is used to extract feature maps of different sizes and numbers of channels from FM_1′ and FM_2′. Taking FM_1′ as an example, a 3x3 convolution with a sliding step size of 2 is used to obtain a feature map FM_1′ ₁ with the same size as FM ₁ and half the number of channels; a 3x3 convolution with a step size of 1 is used to obtain FM_1′ ₂ with the same size as FM 2 and half the number of channels _; and finally, a 3x3 deconvolution is used to obtain FM_1′ ₃ with the same size as FM ₃ and half the number of channels. Similarly, the FM_2′ feature map is calculated using a parallel convolution module to obtain FM_2′ ₁ , FM_2′ ₂ , and FM_2′ ₃ . In terms of feature map parameters, FM_1′ ₁ is the same as FM_2′ ₁ , FM_1′ ₂ is the same as FM_2′ ₂ , and FM_1′ ₃ is the same as FM_2′ _3. The above three groups of feature maps are called paired groups. Each paired group contains spatial attention weighted feature values and channel attention weighted feature values. The two feature maps in the paired group are merged and mixed using a fusion module composed of a concat layer and a 1x1 convolutional layer to obtain three feature maps of different sizes, namely FM1, FM2 and FM3.

最后，利用多尺度检测模块对混组后的三个特征图进行检测，通过检测特征图中是否有异常、不规则纹理区域以判断在当前这一时段内柴油机气缸是否有异常振动信号。对于检测器，所采用的分类损失与回归损失分别如下所述：在本发明的检测任务中，仅判断转换后的图形区域是否有异常信号，因此其为二分类问题，采用交叉熵分类损失以调节检测器的分类模块，其损失函数公式为：Finally, the three mixed feature maps are detected using a multi-scale detection module. By detecting whether there are abnormal and irregular texture areas in the feature maps, it is determined whether there are abnormal vibration signals in the diesel engine cylinder during the current period. For the detector, the classification loss and regression loss used are as follows: In the detection task of the present invention, only the transformed graphic area is judged whether there are abnormal signals, so it is a two-classification problem. The cross entropy classification loss is used to adjust the classification module of the detector, and its loss function formula is:

Loss＝-y*log(p)-(1-y)*log(1-p)Loss＝-y*log(p)-(1-y)*log(1-p)

而在位置损失的计算中，异常信息号所生成的图像块区域为规则的矩形，无复杂的边界信息，因此在位置损失基于CIoU位置评估关系进行，其损失公式可表示为：In the calculation of position loss, the image block area generated by the abnormal information number is a regular rectangle without complex boundary information. Therefore, the position loss is based on the CIoU position evaluation relationship, and its loss formula can be expressed as:

Loss＝1-CIoULoss = 1-CIoU

综上，为更精确的检测到柴油发动机的失火信号，本发明基于基于多模态特征提取网络及混通道特征融合检测网络构造了MITDCNN网络，整体网络结构如图5所示。网络的整体工作原理为：对于提取到的柴油发动机震动信号，首先采用多模态特征提取网络从一维幅值数据及二维图像数据中提取出关于异常信号的图像特征，之后采用混通道特征融合检测网络将特征图拆分为两组，分别采用空间注意力机制与通道注意力机制对特征图进行空间域与通道域的提权计算，并将计算得到的特征图再次分组，并通过合并操作得到多维度提权的特征图，最后采用多尺度检测器同时对三个特征图进行检测，判断在该时段内的信号是否存在有异常状态。所构建的网络细节如下表1所示：In summary, in order to more accurately detect the misfire signal of the diesel engine, the present invention constructs a MITDCNN network based on a multimodal feature extraction network and a mixed channel feature fusion detection network, and the overall network structure is shown in Figure 5. The overall working principle of the network is: for the extracted diesel engine vibration signal, a multimodal feature extraction network is first used to extract image features about abnormal signals from one-dimensional amplitude data and two-dimensional image data, and then a mixed channel feature fusion detection network is used to split the feature map into two groups, and the spatial attention mechanism and the channel attention mechanism are used to perform weighted calculations on the feature map in the spatial domain and the channel domain respectively, and the calculated feature maps are grouped again, and a multi-dimensional weighted feature map is obtained through a merging operation, and finally a multi-scale detector is used to simultaneously detect the three feature maps to determine whether the signal in this period is in an abnormal state. The details of the constructed network are shown in Table 1 below:

表1MITDCNN网络层结构组成表Table 1. MITDCNN network layer structure composition table

基于以上所述，本实施例对柴油机在三种不同运行速度下的失火情况进行研究。三种运行速度为1300rpm、1800rpm、2200rpm，分别对应模拟低速、中速、高速工况。由于三个及以上的气缸失火会导致发动机剧烈振动，无需进行故障检测即可被操作者肉眼观测到，因此本实施例重点聚焦于单缸失火和双缸(混合缸)失火。如表4所示，在1300rpm、1800rpm、2200rpm三种速度下进行单缸失火检测，在1800rpm下进行双缸失火检测。每组除失火故障外，均将正常运行时的情况作为对比参照。发动机常见的失火故障基本如表2所列的失火类型所示，由于上述故障所涉及的相关数据已在实验过程中被较为全面地收集起来，因此本实施例采用的数据集具有普适性。Based on the above, this embodiment studies the misfire of a diesel engine at three different operating speeds. The three operating speeds are 1300rpm, 1800rpm, and 2200rpm, corresponding to the simulated low-speed, medium-speed, and high-speed operating conditions, respectively. Since the misfire of three or more cylinders will cause severe vibration of the engine, it can be observed by the operator with the naked eye without fault detection, so this embodiment focuses on single-cylinder misfire and double-cylinder (mixed cylinder) misfire. As shown in Table 4, single-cylinder misfire detection is performed at three speeds of 1300rpm, 1800rpm, and 2200rpm, and double-cylinder misfire detection is performed at 1800rpm. In addition to the misfire fault, each group uses the normal operation situation as a comparative reference. Common engine misfire faults are basically shown in the misfire types listed in Table 2. Since the relevant data involved in the above faults have been collected more comprehensively during the experiment, the data set used in this embodiment is universal.

表2柴油发动机在不同运行速度下失效Table 2 Diesel engine failure at different operating speeds

本实施例中，以25.6kHz的采样频率采集振动信号，每种失火类型的采样时间均为41s，采样时间至少包含900个工作周期，共获得1,049,600个振动序列点。不同运行速度下单缸失火的典型时域信号如图6所示。由于相同工况下不同失火类型的时域信号之间的差异极其微小，很难依据时域信号直接诊断出具体的失火类型。因此，需要借助计算机视觉，分析信号的“纹理结构”进而判断其是否存在异常。同时将振动信号转换为二维图像后的样本如图7所示。In this embodiment, the vibration signal is collected at a sampling frequency of 25.6kHz, the sampling time for each misfire type is 41s, the sampling time includes at least 900 working cycles, and a total of 1,049,600 vibration sequence points are obtained. The typical time domain signal of a single cylinder misfire at different operating speeds is shown in Figure 6. Since the difference between the time domain signals of different misfire types under the same operating conditions is extremely small, it is difficult to directly diagnose the specific misfire type based on the time domain signal. Therefore, it is necessary to use computer vision to analyze the "texture structure" of the signal and then determine whether it is abnormal. At the same time, the sample after converting the vibration signal into a two-dimensional image is shown in Figure 7.

四种不同工况下的数据集划分结果如表3所示。数据集A、B和C分别对应1300rpm、1800rpm和2200rpm三种运行速度下的单缸失火情况，数据集D对应1800rpm运行速度下的混合缸失火情况。由于每种失火故障在不同速度下收集了1,049,600个振动序列点，因此每个标签可以获得511个样本，每个数据集均有Ⅰ、Ⅱ、Ⅲ、Ⅳ、Ⅴ五种标签，可知每个数据集共有2555个样本。最后，训练集、测试集和验证集在这些数据集中分别占80％，10％，10％。The data set division results under four different working conditions are shown in Table 3. Datasets A, B and C correspond to single cylinder misfires at three operating speeds of 1300rpm, 1800rpm and 2200rpm, respectively, and data set D corresponds to mixed cylinder misfires at 1800rpm. Since 1,049,600 vibration sequence points were collected at different speeds for each misfire fault, 511 samples can be obtained for each label, and each data set has five labels, Ⅰ, Ⅱ, Ⅲ, Ⅳ, and Ⅴ. It can be seen that each data set has a total of 2555 samples. Finally, the training set, test set and validation set account for 80%, 10% and 10% of these data sets respectively.

表3不同工作条件下的数据集划分情况Table 3 Dataset division under different working conditions

其次，本实施例所采用的环境如表4所示、Secondly, the environment used in this embodiment is shown in Table 4.

表4实验环境Table 4 Experimental environment

本实施例中，在硬件环境中中央处理器采用了Inter i7 12100，图形处理器采用了Nvidia RTX 3080，其拥有8704个CUDA核心与184个Tensor核心，拥有10GB显存。在软件方面，采用了PyTorch 1.11.0作为深度学习框架API，CUDA运算平台版本为11.3，cuDNN计算加速库版本为8.2.1。In this embodiment, in the hardware environment, the CPU uses Inter i7 12100, and the graphics processor uses Nvidia RTX 3080, which has 8704 CUDA cores and 184 Tensor cores, and has 10GB of video memory. In terms of software, PyTorch 1.11.0 is used as the deep learning framework API, the CUDA computing platform version is 11.3, and the cuDNN computing acceleration library version is 8.2.1.

本实施例采用如下评价指标以判断网络模型的性能：This embodiment uses the following evaluation indicators to judge the performance of the network model:

(1)Precision：评估模型对于信号是否异常的判断准确性，即查准率，其计算公式为：(1) Precision: The accuracy of the model in judging whether a signal is abnormal, i.e., the precision rate. The calculation formula is:

(2)Recall：评估模型对于异常信号的敏感性，即查全率，其计算公式为：(2) Recall: evaluates the model's sensitivity to abnormal signals, i.e., recall rate. Its calculation formula is:

(3)F1-Score：Precision与Recall指标的调和平均数，其计算公式为：(3) F1-Score: The harmonic mean of Precision and Recall indicators. Its calculation formula is:

(4)Average Precision(AP)：评估模型的平均检测精度，通过计算Precision-Recall指标曲线的下包围面积，反映模型的综合性能，其计算公式为(4) Average Precision (AP): Evaluates the average detection accuracy of the model. It reflects the overall performance of the model by calculating the lower bounding area of the Precision-Recall index curve. The calculation formula is:

在上式中TP、FP、FN为混淆矩阵中的元素，每个元素所代表的含义分别为：In the above formula, TP, FP, and FN are elements in the confusion matrix, and the meaning of each element is:

TP：检测正确目标个数，即正确检测出异常信号图像区域的个数；TP: the number of correctly detected targets, that is, the number of image areas with abnormal signals correctly detected;

FP：检测错误目标个数，即将正常信号的图像区域判断为异常信号的个数；FP: the number of false detection targets, that is, the number of image areas with normal signals judged as abnormal signals;

FN：漏检目标个数，即未将异常信号的图像区域检测出的个数。FN: The number of missed targets, that is, the number of image areas where abnormal signals are not detected.

在消融实验中，首先将本发明所设计的网络MITDCNN拆解为以下三组网络：In the ablation experiment, the network MITDCNN designed by the present invention is first disassembled into the following three groups of networks:

网络1：在特征提取网络中仅采用卷积模块对二维图像特征进行提取，不使用混通道特征融合网络，仅搭配检测网络；Network 1: In the feature extraction network, only the convolution module is used to extract the two-dimensional image features, and the mixed channel feature fusion network is not used, and only the detection network is used;

网络2：在特征提取网络中采用多模态特征提取网络对一维及二维特征进行提取，不使用混道特征融合网络，仅搭配检测网络；Network 2: In the feature extraction network, a multimodal feature extraction network is used to extract one-dimensional and two-dimensional features. The mixed-channel feature fusion network is not used, and only the detection network is used.

网络3：采用多模态特征提取网络与混通道特征融合检测网络即MITDCNN网络。Network 3: A multimodal feature extraction network and a mixed channel feature fusion detection network, namely the MITDCNN network.

将上述三个网络采用上述评价指标，对不同工作速度下发动机的振动信号数据集进行测试，对比测试结果如表5-8所示。The above three networks were tested on the vibration signal data set of the engine at different working speeds using the above evaluation indicators. The comparative test results are shown in Tables 5-8.

表5 1300rpm单缸低速工况数据集对比情况Table 5 Comparison of 1300rpm single cylinder low speed condition data set

不同网络在数据集A上的对比测试情况如表5所示，网络1的Precision、Recall、F1-Score及AP指标分别为83.128％、84.035％、83.579％、88.268％，均在[80％,90％]范围内。网络2的Precision、Recall、F1-Score及AP指标均超过了90％，相比于网络1分别提升了9.229％、11.017％、10.106％、6.589％。在网络3中其Precision、Recall、F1-Score及AP指标分别为99.738％、99.888％、99.812％、99.926％，极其接近100％，且其标准偏差控制在1％以内。The comparative test results of different networks on dataset A are shown in Table 5. The Precision, Recall, F1-Score and AP indicators of network 1 are 83.128%, 84.035%, 83.579% and 88.268% respectively, all within the range of [80%, 90%]. The Precision, Recall, F1-Score and AP indicators of network 2 are all over 90%, which are 9.229%, 11.017%, 10.106% and 6.589% higher than those of network 1. In network 3, its Precision, Recall, F1-Score and AP indicators are 99.738%, 99.888%, 99.812% and 99.926% respectively, which are extremely close to 100%, and its standard deviation is controlled within 1%.

表6 1800rpm单缸中速工况数据集对比情况Table 6 Comparison of 1800rpm single cylinder medium speed condition data set

不同网络在数据集B上的对比测试情况如表6所示，网络1的Precision为83.571％，Recall为84.268％。网络2和网络3的F1-Score及AP指标分别为93.724％、94.968％和99.735％、99.853％，均超过了90％。The comparative test results of different networks on dataset B are shown in Table 6. The Precision of Network 1 is 83.571% and the Recall is 84.268%. The F1-Score and AP indicators of Network 2 and Network 3 are 93.724%, 94.968% and 99.735%, 99.853% respectively, all exceeding 90%.

表7 2200rpm单缸高速工况数据集对比情况Table 7 Comparison of 2200rpm single cylinder high-speed working condition data set

不同网络在数据集C上的对比测试情况如表7所示，网络1的Precision及Recall分别为84.287％、85.937％，高于数据集A、B中网络1对应的Precision及Recall值。网络2的Recall、F1-Score及AP指标均高于95％，分别为96.872％、95.201％、95.587％。在网络3中其Precision、Recall、F1-Score及AP指标分别为99.184％、99.179％、99.181％、99.217％，相比于网络1提升了14.897％、13.242％、14.077％、11.292％。The comparative test results of different networks on dataset C are shown in Table 7. The Precision and Recall of network 1 are 84.287% and 85.937% respectively, which are higher than the Precision and Recall values of network 1 in datasets A and B. The Recall, F1-Score and AP indicators of network 2 are all higher than 95%, which are 96.872%, 95.201% and 95.587% respectively. In network 3, its Precision, Recall, F1-Score and AP indicators are 99.184%, 99.179%, 99.181% and 99.217% respectively, which are 14.897%, 13.242%, 14.077% and 11.292% higher than those of network 1.

表8 1800rpm混合缸中速工况数据集对比情况Table 8 Comparison of 1800rpm mixing cylinder medium speed condition data set

不同网络在数据集D上的对比测试情况如表8所示，网络1的Precision、Recall、F1-Score及AP指标分别为81.918％、84.268％、83.076％、86.687％，标准偏差在4％以内。网络2的Precision、Recall、F1-Score及AP指标相比于网络1提升了8.619％、11.225％、9.872％、4.571％。在网络3中其Precision、Recall、F1-Score及AP指标均高于99％，相比于网络2提升了8.473％、4.089％、6.347％、8.481％。The comparative test results of different networks on dataset D are shown in Table 8. The Precision, Recall, F1-Score and AP indicators of network 1 are 81.918%, 84.268%, 83.076% and 86.687% respectively, and the standard deviation is within 4%. The Precision, Recall, F1-Score and AP indicators of network 2 are improved by 8.619%, 11.225%, 9.872% and 4.571% compared with network 1. In network 3, its Precision, Recall, F1-Score and AP indicators are all higher than 99%, which are improved by 8.473%, 4.089%, 6.347% and 8.481% compared with network 2.

综合表5-8的对比情况，网络1仅采用卷积模块对转换后的图像提取相关特征，而信号转换所得到的灰点纹理图具有无法明显的表示信号正常与否的差异，因此在四个数据集上所得到的precision与recall指标均较低，使得其F1-Score及AP综合性能指标也较低。在网络2中，相比于网络1其对于特征提取的方式采用了多模态特征提取，主干网络融合-维幅值特征后特征的丰富性得到了大幅提高，且采用Transformer模块能够提取出全局特征与卷积模块的局部特征进行互补，特征层面更加完善。因此在上表对比结果中，相比于网络1、网络2的各项指标有较大幅度的提升。最后在网络3中，其相比于网络2主要差异体现在加入了混通道特征融合检测网络，通过对提取到的不同层次特征进行混组聚合，且从空间域与通道域两个维度层面进行特征提权，提升异常信号特征的显著性。从测试及对比结果中，网络3的各项数据均为最优，且相比于网络2提升幅度也较大，因此也说明混通道特征融合检测网络对于提升网络性能有较大帮助。According to the comparison in Tables 5-8, network 1 only uses the convolution module to extract relevant features from the converted image, and the gray point texture map obtained by signal conversion has a difference that cannot clearly indicate whether the signal is normal or not. Therefore, the precision and recall indicators obtained on the four data sets are low, which makes its F1-Score and AP comprehensive performance indicators also low. In network 2, compared with network 1, it uses multimodal feature extraction for feature extraction. The richness of features has been greatly improved after the backbone network fuses the -dimensional amplitude features, and the use of the Transformer module can extract global features to complement the local features of the convolution module, and the feature level is more complete. Therefore, in the comparison results in the above table, the various indicators of network 1 and network 2 have been greatly improved. Finally, in network 3, the main difference compared with network 2 is the addition of a mixed channel feature fusion detection network, which improves the significance of abnormal signal features by mixing and aggregating the extracted features at different levels, and weighting features from the two dimensions of spatial domain and channel domain. From the test and comparison results, all the data of network 3 are the best, and the improvement is also larger than that of network 2. Therefore, it also shows that the mixed channel feature fusion detection network is very helpful in improving network performance.

其次，为了模拟在不同噪声环境下的失火故障信号检测性能,将不同信噪比的噪声添加到原始数据集中。设P_signal和P_noise分别为信号和噪声能量，则信噪比SNR定义如下：Secondly, in order to simulate the misfire fault signal detection performance under different noise environments, noise with different signal-to-noise ratios is added to the original data set. Let P _signal and P _noise be the signal and noise energy respectively, then the signal-to-noise ratio SNR is defined as follows:

在噪声环境的消融实验中，对发动机不同的工作速度及缸数的原始信号添加了不同信噪比(-4、-2、0、2、4、6、8、10dB)的高斯白噪声，对比测试了各网络在不同工作条件及噪声环境下对于异常信号的检测性能，三个网络的测试结果分别如表9-11所示。In the ablation experiment in a noisy environment, Gaussian white noise with different signal-to-noise ratios (-4, -2, 0, 2, 4, 6, 8, 10dB) was added to the original signals of the engine at different operating speeds and cylinder numbers. The detection performance of each network for abnormal signals under different working conditions and noise environments was compared and tested. The test results of the three networks are shown in Tables 9-11 respectively.

表9不同信噪比对网络1在各个工作环境下的检测性能测试Table 9 Detection performance test of network 1 under various working environments with different signal-to-noise ratios

表9展示了网络1在不同数据集及不同强度噪声干扰下的检测性能指标，在数据集A中其精确率(Precision)、召回率(Recall)、F1-Score(衡量模型寻找正例的能力)及平均精准度(AP)的平均指标为76.636％、78.058％、77.340％、82.038％。在数据集B、C中其Precision值极为接近，分别为75.616％和75.278％。在数据集D中其Precision值为73.350％，是四个数据集中的最低值。在-4dB噪声干扰的情况下，网络1在四个数据集中的Precision、Recall、F1-Score及AP平均值在[70％,80％]范围内波动，分别为70.760％、72.757％、71.743％、76.995％。Table 9 shows the detection performance indicators of network 1 under different data sets and noise interference of different intensities. In data set A, the average indicators of its precision, recall, F1-Score (measurement of the model's ability to find positive examples) and average precision (AP) are 76.636%, 78.058%, 77.340%, and 82.038%. In data sets B and C, its Precision values are very close, 75.616% and 75.278% respectively. In data set D, its Precision value is 73.350%, the lowest value among the four data sets. Under -4dB noise interference, the average values of Precision, Recall, F1-Score and AP of network 1 in the four data sets fluctuate in the range of [70%, 80%], which are 70.760%, 72.757%, 71.743%, and 76.995% respectively.

表10不同信噪比对网络2在各个工作环境下的检测性能测试Table 10 Detection performance test of network 2 under various working environments with different signal-to-noise ratios

网络2在不同数据集及不同强度噪声干扰下的检测性能指标如表10所示。在数据集A中其Recall值高于90％，为91.786％，比网络1在数据集A中的Recall值高了13.728％。在数据集B、C、D中其AP值的平均指标分别为92.482％、92.965％、89.345％，且数据集C中的标准偏差多为1.414％。在-2dB噪声干扰的情况下，网络2在四个数据集中Recall、F1-Score及AP的平均值均高于90％，分别为91.940％、90.845％、92.225％，比8dB噪声干扰下Recall、F1-Score及AP的平均值分别高出1.934％、1.587％、1.455％。The detection performance indicators of network 2 under different data sets and noise interference of different intensities are shown in Table 10. In data set A, its Recall value is higher than 90%, which is 91.786%, which is 13.728% higher than the Recall value of network 1 in data set A. The average indicators of its AP value in data sets B, C, and D are 92.482%, 92.965%, and 89.345%, respectively, and the standard deviation in data set C is 1.414%. Under -2dB noise interference, the average values of Recall, F1-Score, and AP of network 2 in the four data sets are all higher than 90%, which are 91.940%, 90.845%, and 92.225%, respectively, which are 1.934%, 1.587%, and 1.455% higher than the average values of Recall, F1-Score, and AP under 8dB noise interference, respectively.

表11不同信噪比对网络3在各个工作环境下的检测性能测试Table 11 Detection performance test of network 3 under various working environments with different signal-to-noise ratios

表11展示了网络3在不同数据集及不同强度噪声干扰下的检测性能指标，在数据集A中其Precision、Recall、F1-Score及AP的平均指标均高于99.7％，分别为99.711％、99.754％、99.732％、99.856％。在数据集B、C中其Precision、Recall、F1-Score及AP的平均指标均在[99.1％-99.9％]范围内波动，极其接近100％。在数据集D中其Precision、Recall、F1-Score及AP的平均指标为98.995％、99.304％、99.149％、99.699％，远远高于网络1、2在数据集D中的Precision、Recall、F1-Score及AP的平均指标。Table 11 shows the detection performance indicators of network 3 under different data sets and noise interference of different strengths. In data set A, the average indicators of its Precision, Recall, F1-Score and AP are all higher than 99.7%, which are 99.711%, 99.754%, 99.732% and 99.856% respectively. In data sets B and C, the average indicators of its Precision, Recall, F1-Score and AP fluctuate in the range of [99.1%-99.9%], which is extremely close to 100%. In data set D, the average indicators of its Precision, Recall, F1-Score and AP are 98.995%, 99.304%, 99.149% and 99.699%, which are much higher than the average indicators of Precision, Recall, F1-Score and AP of networks 1 and 2 in data set D.

对比三个网络在不同工况及噪声环境下的测试结果，可以得出以下结论，对于网络1其仅由卷积模块对特征进行提取，加入噪声后转换得到的图像纹理特征受到较大干扰，因此在测试结果中网络1的各项性能指标波动较大；而网络2采用了多模态特征提取的方案。通过增加一维特征在一定程度上提高了网络的抗噪性。网络3相比于其他两个网络增加了混通道特征融合模块，在该模块中包含了空间注意力及通道注意力计算，可在空间域与通道域上对噪声进行抑制，因此在测试结果中网络3的各项指标受噪声影响所造成的波动均较小，体现出了很强的鲁棒性。其次网络3(MITDCNN)模型训练过程中损失函数值及精度值的迭代曲线如图8、图9所示。By comparing the test results of the three networks under different working conditions and noise environments, the following conclusions can be drawn: for network 1, only the convolution module is used to extract features, and the image texture features converted after adding noise are greatly disturbed. Therefore, in the test results, the performance indicators of network 1 fluctuate greatly; while network 2 adopts a multimodal feature extraction scheme. The noise resistance of the network is improved to a certain extent by adding one-dimensional features. Compared with the other two networks, network 3 adds a mixed channel feature fusion module, which includes spatial attention and channel attention calculations, and can suppress noise in the spatial domain and channel domain. Therefore, in the test results, the fluctuations caused by noise in various indicators of network 3 are small, reflecting strong robustness. Secondly, the iterative curves of the loss function value and accuracy value during the training of the network 3 (MITDCNN) model are shown in Figures 8 and 9.

在本实施例中经过多次实验测试取得以下最优超参数设置：epoch迭代次数为200次，learn rate初始学习率设置为0.001，经训练逐渐收敛至0.00001，动量参数momentum设置为0.954，权重衰减率weight decay设置为0.0013。从图8-9中可以看到，在100个epoch后，损失和模型准确率都保持稳定。损失函数值在短期内快速下降收敛，同时AP值也快速提升，说明网络对于目标特征的学习能力较强。In this embodiment, after multiple experimental tests, the following optimal hyperparameter settings are obtained: the number of epoch iterations is 200, the initial learning rate is set to 0.001, and gradually converges to 0.00001 after training, the momentum parameter momentum is set to 0.954, and the weight decay rate weight decay is set to 0.0013. As can be seen from Figures 8-9, after 100 epochs, the loss and model accuracy remain stable. The loss function value quickly decreases and converges in a short period of time, and the AP value also increases rapidly, indicating that the network has a strong learning ability for the target features.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术人员无需创造性劳动就可以根据本发明的构思做出诸多修改和变化。因此，凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present invention are described in detail above. It should be understood that a person skilled in the art can make many modifications and changes based on the concept of the present invention without creative work. Therefore, any technical solution that can be obtained by a person skilled in the art through logical analysis, reasoning or limited experiments based on the concept of the present invention on the basis of the prior art should be within the scope of protection determined by the claims.

Claims

1. A mechanical fault detection method based on multimodal analysis of engine vibration signals, characterized in that it comprises the following steps:

Collect vibration signals of generators;

Inputting the vibration signal into a multimodal feature extraction network to obtain multiple feature information;

Select p feature information and input it into the mixed channel feature fusion detection network, perform secondary feature processing and detection, and output the detection results

The multimodal feature extraction network includes a shaping network, a convolution module, a fully connected module and a multimodal Transformer module;

The shaping network is used to convert the input engine vibration signal into a two-dimensional image, and the convolution module performs feature extraction based on the two-dimensional image to obtain a two-dimensional feature image;

The fully connected module is used to extract a one-dimensional feature vector of the input engine vibration signal; the multimodal Transformer module is used to integrate the one-dimensional feature vector and the two-dimensional feature image;

The mixed-channel feature fusion detection network includes a feature aggregation module, a feature mixing module and a multi-scale detection module;

The feature aggregation module is used to aggregate the p feature information to obtain an aggregated feature map FM;

The feature mixing module is used to perform feature mixing on the aggregated feature map FM to obtain a plurality of feature maps of different sizes;

The multi-scale detection module is used to detect the mixed feature map, and to determine whether there is an abnormal vibration signal in the diesel engine cylinder during the current period by detecting whether there is an abnormal or irregular texture area in the feature map.

2. According to claim 1, a mechanical fault detection method based on multimodal analysis of engine vibration signals is characterized in that the multimodal feature extraction network includes a q-layer structure, and each layer of the structure includes a convolution module, a fully connected module and a multimodal Transformer module.

3. A mechanical fault detection method based on multimodal analysis of engine vibration signals according to claim 2, characterized in that the multimodal feature extraction network extracts features from the vibration signal to obtain multiple feature information, comprising the following steps:

S1, the vibration signal is input into the shaping network and the fully connected module respectively, the shaping network outputs a two-dimensional image, and the two-dimensional image is input into the convolution module for feature extraction;

S2, the fully connected module outputs a one-dimensional feature vector;

S3, the convolution module outputs a two-dimensional feature image;

S4, inputting the one-dimensional feature vector and the two-dimensional feature image into a multimodal Transformer module to obtain integrated feature information;

S5, inputting the integrated feature information and the two-dimensional feature image into the convolution module of the next layer, and inputting the one-dimensional feature vector into the fully connected module of the next layer;

S6. Repeat steps S2-S5 until the qth layer structure of the multimodal feature extraction network is reached, and perform feature extraction layer by layer to obtain multiple feature information.

4. According to claim 1, a mechanical fault detection method based on multimodal analysis of engine vibration signals is characterized in that the multimodal Transformer module includes two multi-head attention networks, which respectively correspond to the long-distance relationship interaction between the one-dimensional feature vector and the two-dimensional feature image.

5. A mechanical fault detection method based on multimodal analysis of engine vibration signals according to claim 4, characterized in that the multimodal Transformer module integrates the one-dimensional feature vector and the two-dimensional feature image, comprising the following steps:

Divide the one-dimensional feature vector into n tokens sub-vectors;

Divide the two-dimensional feature image into n feature blocks, and extend the feature blocks into n tokens sub-vectors;

In the first multi-head attention network, the tokens corresponding to the two-dimensional feature image are input into the matrix In the example, the tokens corresponding to the one-dimensional feature vector are input into the matrix and In the process, the query matrix Q with the image feature information is calculated and the key matrix K with the one-dimensional amplitude feature n information is matched, and the obtained matching degree is assigned to the corresponding eigenvalue matrix V to complete the operation of mapping the image feature to the amplitude feature;

In the second multi-head attention network, the tokens corresponding to the two-dimensional feature image are input into the matrix In the example, the tokens corresponding to the one-dimensional feature vector are input into the matrix and In the process, the query matrix Q with the image feature information is calculated and the key matrix K with the one-dimensional amplitude feature n information is matched, and the obtained matching degree is assigned to the corresponding eigenvalue matrix V to complete the operation of mapping the image feature to the amplitude feature;

Among them, in the two multi-head attention networks, the corresponding two sets of Q, K, and V vectors are respectively and The matrix is calculated;

The output one-dimensional feature vectors of the two multi-head attention networks are merged, and the fully connected layer and ReLU activation function are used to activate the merged features. Finally, the obtained 1×1×C vector is integer-calculated to obtain a feature map of H×W×C dimensions.

6. A mechanical fault detection method based on multimodal analysis of engine vibration signals according to claim 1, characterized in that the feature aggregation module aggregates the feature maps FM ₁ , FM ₂ and FM ₃ output by the three convolution modules at the bottom layer of the multimodal feature extraction network, comprising the following steps:

Improve the feature information of each feature map;

FM ₃ doubles the size of the feature map by deconvolution and compresses the number of channels to half of the original. The features of FM ₃ are assigned to FM ₂ using the add fusion method to obtain FM ₂ ′;

Upsample and channel compress the FM ₂ feature map, and fuse it with the features of the FM ₁ feature map to obtain FM ₁ ′;

The concat layer is used to merge the three feature maps FM ₁ ′, FM ₂ ′ and FM _3. During the merging operation, FM ₁ ′ is downsampled and FM ₃ is upsampled to obtain the feature map FM.

7. A mechanical fault detection method based on multimodal analysis of engine vibration signals according to claim 6, characterized in that the feature mixing module performs feature mixing on the aggregated feature map FM, comprising the following steps:

A 1x1 convolutional layer is used to compress and merge the number of feature channels of FM to the same as that of FM ₂ ′, and a grouped convolution is used to divide FM into two feature maps, FM_1 and FM_2, which have the same size as FM and half the number of channels as FM.

Perform feature extraction operations on the grouped feature maps FM_1 and FM_2 to obtain FM_1′ and FM_2′;

The same parallel convolution module is used to extract feature maps of different sizes and number of channels from FM_1′ and FM_2′ to form two paired groups;

A fusion module composed of a concat layer and a 1x1 convolution layer is used to merge and mix the feature maps in the pairing groups to obtain three feature maps of different sizes, namely FM1, FM2 and FM3.

8. A mechanical fault detection method based on multimodal analysis of engine vibration signals according to claim 7, characterized in that spatial attention calculation is used for feature map FM_1 to enhance the morphological feature information FM_1′ corresponding to the abnormal signal; channel attention calculation is used for FM_2 to enhance the semantic feature weight FM_2′ of the abnormal signal;

The pairing group extracted from FM_1′ contains the spatial attention weighted feature value, and the pairing group extracted from FM_2′ contains the channel attention weighted feature value.

9. According to claim 1, a mechanical fault detection method based on multimodal analysis of engine vibration signals is characterized in that the multi-scale detection module is used to detect the feature map output by the feature mixing module, and the cross entropy classification loss is used to adjust the classification module of the detector.

10. According to claim 1, a mechanical fault detection method based on multimodal analysis of engine vibration signals is characterized in that the multi-scale detection module is used to detect the feature map output by the feature mixing module, and the position loss is calculated based on the CIoU position evaluation relationship.