CN116612351A

CN116612351A - Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder

Info

Publication number: CN116612351A
Application number: CN202310596242.0A
Authority: CN
Inventors: 王红军; 陈云
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-18

Abstract

The application discloses a urban rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder, and relates to the technical field of rail traffic, wherein the method comprises the steps of constructing a training set based on vehicle bottom monitoring data, wherein the training set comprises a large-scale non-tag training set and a small number of tagged training sets; constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters; and embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into the network structure of the downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small amount of labeled training sets to obtain a vehicle bottom abnormality recognition model. The application can effectively solve the problems of difficult data annotation, low metadata information utilization rate, uneven positive and negative samples, interference by optical imaging environment and the like in the prior art.

Description

Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder

Technical Field

The application relates to the technical field of train detection, in particular to a urban rail vehicle bottom anomaly detection method based on a multi-scale mask characteristic self-encoder.

Background

With the increasing mileage of urban rail trains, the running speed of the urban rail trains is higher and higher, and in order to ensure safe running, the maintenance frequency of the train body is higher and higher. In order to maintain effective and high-speed operation of the urban rail train, potential safety risks at the bottom of the urban rail train need to be detected in time, and maintenance frequency is improved. The detection items of the abnormal urban rail train bottom in general comprise parts loss, damage, foreign matter adhesion and the like of key parts (a shaft box cover, a shock absorber fixing device and an air spring).

At present, the research on defect anomaly detection of key parts of the urban rail train bottom is mainly focused on the following three aspects:

firstly, in the aspect of template matching, zhang and the like detect loosening and losing faults of train parts by using a template matching algorithm based on a contour, and classify samples according to differences by comparing pictures at specific positions with a standard template library. The method has strong flexibility and good effect on the picture with high pixel similarity. Lu and the like realize the defect detection of the cut-off plug handle by carrying out template matching on the shape sub of the circumscribed rectangle and the outline of the target, and experimental results show that the shape sub-matching has better performance in time and precision than gray template matching. However, the template matching algorithm is easy to be influenced by factors such as illumination, imaging view angle and the like, and is serious in failure under the condition that the target is deformed, poor in robustness and difficult to be applied to an actual train detection system in a large scale.

Secondly, in the aspect of traditional machine learning, in the fault detection of a brake shoe bolt in a train, zhou and the like, the position of a part is positioned through the characteristics of a gradient coding histogram (GradientEncodingHistogram, GEH) and a support vector machine (SupportVectorMachine, SVM) algorithm, and the fault state is judged by using the support vector machine, so that the accuracy of 99.2% and the picture processing speed of 5fps/s are realized; qin et al utilize the support vector machine to realize the automatic detection of the missing trouble of the handle in the train bent angle cock, have reached good instantaneity and higher detection accuracy. However, the method of feature extraction by adopting the traditional machine learning algorithm needs to be manually designed in combination with a specific scene, has low positioning efficiency on a target object, depends on the traditional method, such as gray projection, contour detection, exhaustion of a sliding window and the like, is only suitable for specific parts with strong markability, such as bolts, blocking keys and the like, and still has difficulty in meeting the actual requirement on accuracy.

Third, in the aspect of deep learning, sun and the like adopt convolutional neural networks from coarse to fine to be positioned in a train side frame key and shaft bolt area, and then a multi-classification model is trained to identify four typical faults related to loss looseness, so that higher identification capability and good robustness are achieved under the condition of low-quality imaging. The method combines the traditional method and the deep learning method, is applied to fault detection of train parts, positions a bolt area through the traditional method, then utilizes a convolutional neural network combined with self-coding (SAE) to identify the state of the bolt, and finally realizes higher detection accuracy for fault pictures in the middle plate bolt. The defect detection method based on deep learning has stronger robustness, and can cope with more complex field environments, imaging conditions, weather conditions and the like. The device has better detection performance for minor faults such as bolts, brake pads, gear keys and the like, and can realize high-precision positioning of a target to be detected under complex conditions. However, due to the acquisition of the sample, particularly due to the lack of negative samples during fault discrimination, the training model is prone to over-fitting and the detection time is relatively longer.

In summary, in the prior art, although defect anomaly detection of key components of a rail train can be performed by template matching, machine learning, conventional deep learning and other methods, due to the complex and changeable imaging environment, the influence of many factors such as optical distortion of a lens and complex background exists, meanwhile, the problems of unknown defects, less accurate defect labeling, high cost of labeling samples, uneven transverse defects of positive and negative samples, irregular defects, large variation range of defect shapes and scales and the like exist, and the accuracy, reliability and stability of defect anomaly detection are still not ideal in mass application in industrial landing.

Disclosure of Invention

Aiming at the defects in the prior art, the application provides a urban rail vehicle bottom anomaly detection method based on a multi-scale mask characteristic self-encoder.

The technical scheme of the application is as follows: a urban rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder comprises the following steps:

constructing a training set based on vehicle bottom monitoring data, wherein the training set comprises a large-scale unlabeled training set and a small number of labeled training sets, and the vehicle bottom monitoring data comprises point cloud data and image data;

constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters;

and embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into the network structure of the downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small amount of labeled training sets to obtain a vehicle bottom abnormality recognition model.

Preferably, constructing the training set based on the underbody monitoring data includes:

preprocessing the point cloud data and the image data respectively;

correlating the preprocessed point cloud data with the image data to obtain correlation data;

and carrying out weighted fusion on the point cloud data and the image data in the associated data to obtain a multi-mode fusion matrix.

Preferably, the preprocessing of the point cloud data comprises filtering, registering and resampling, and the preprocessing of the image data comprises bilateral filtering and histogram equalization.

Preferably, when the label-free training set is used for self-supervised reconstruction image training of the multi-scale mask feature self-encoder, a minimum mask feature diagram and a reconstruction image scaling cosine error are taken as reconstruction criteria.

Preferably, performing self-supervised reconstructed image training on the multi-scale mask feature self-encoder by using the unlabeled training set includes:

updating model parameters of the multi-scale mask feature self-encoder using back propagation;

the model parameters of the encoder in the multi-scale mask feature self-encoder are updated after each back-propagation using a momentum update mechanism.

Preferably, the encoder in the multi-scale mask feature self-encoder comprises a multi-scale mask feature extraction module and a mask feature fusion module.

Preferably, the multi-scale mask feature extraction module comprises a first feature extraction stage, a second feature extraction stage and a third feature extraction stage;

the first feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring first scale mask features; the second feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring second scale mask features; the third feature extraction stage comprises a patch mapping module and a transform module, and is used for acquiring a third scale mask feature;

a downsampling layer is arranged between the first feature extraction stage and the second feature extraction stage, and a downsampling layer is arranged between the second feature extraction stage and the third feature extraction stage.

Preferably, the mask feature fusion module is configured to downsample the first scale mask feature and the second scale mask feature, and fuse the downsampled first scale mask feature, second scale mask feature and third scale mask feature to obtain a fused feature map.

Preferably, downsampling the first and second scale mask features respectively includes: downsampling the first scale mask feature with stride=2 and downsampling the second scale mask feature with stride=4.

Preferably, the loss function of the multi-scale mask feature self-encoder is:

L＝L _recon +L _cos +λ ₁ L _feat +λ ₂ L _fusion

wherein L is _recon Representing reconstruction error, L _cos Represents a scaled cosine error, L _feat Representing feature map vector error, L _fusion Representing characteristic map errors, lambda ₁ And lambda (lambda) ₂ Is a super parameter; I.I. | ₁ Represents L1 norm, x _i Andrepresenting the original image of the ith small block and the reconstructed image of the tau small block output by the decoder respectively; s (A, B) represents cosine similarity, F _i And->The mask feature map of the ith small block and the mask feature map of the tau small block output by the decoder are respectively represented; z and->A vector representation of the fused feature map and a vector representation of the fused feature map output by the decoder, I.I. | ₂ Represents an L2 norm; e (E) _t And->And respectively representing the fusion characteristic diagram and the fusion characteristic diagram output by the decoder.

The beneficial effects of the application are as follows: the application provides a urban rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder, which carries out multi-mode fusion by collecting point cloud data and image data, transfers a mask from the idea of the encoder to a graph, reconstructs an image from the encoder by the multi-scale mask feature self-encoder, and solves the problem of high resolution operation cost faced by a general graph self-encoder from the angles of a loss function and a model structure; meanwhile, a mask feature reconstruction strategy with a scaled cosine error as a reconstruction criterion is designed, so that self-adaptive sample weighing is realized, and the problem of non-uniformity of positive and negative samples is solved; and finally, transferring the characterization code of the self-supervision learning to the small sample learning, fully utilizing the advantages of the automatic learning feature or characterization in the non-supervision learning, enabling the model to be more general and robust, reducing the dependence on labeled data, and effectively solving the problems of difficult data labeling, low metadata information utilization rate, uneven positive and negative samples, interference by optical imaging environment and the like by applying the vehicle bottom anomaly identification model provided by the application to the anomaly detection field of urban rail vehicle bottom foreign matters.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.

FIG. 1 is a flow chart of a metro rail car bottom anomaly detection method based on a multi-scale mask feature self-encoder provided by an embodiment of the present application;

FIG. 2 is a block diagram of a metro rail car bottom anomaly detection method based on a multi-scale mask feature self-encoder provided by an embodiment of the present application;

FIG. 3 is a block diagram of a multi-scale mask feature self-encoder provided by an embodiment of the present application;

fig. 4 is a task phase diagram of a metro rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder according to an embodiment of the present application.

Detailed Description

Embodiments of the technical scheme of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and thus are merely examples, and are not intended to limit the scope of the present application.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

Referring to fig. 1 and 2, an embodiment of the present application provides a method for detecting urban rail vehicle bottom anomalies based on a multi-scale mask feature self-encoder, including:

step one, a training set is constructed based on vehicle bottom monitoring data, wherein the training set comprises a large-scale non-tag training set and a small number of tagged training sets, and the vehicle bottom monitoring data comprises point cloud data and image data.

In an embodiment of the present application, constructing a training set based on vehicle bottom monitoring data includes: preprocessing the point cloud data and the image data respectively; correlating the preprocessed point cloud data with the image data to obtain correlation data; and carrying out weighted fusion on the point cloud data and the image data in the associated data to obtain a multi-mode fusion matrix.

Specifically, the point cloud data is acquired through a radar, and the image data is acquired through a 2D linear array camera.

Specifically, preprocessing the point cloud data comprises filtering, registering and resampling; specifically, the point cloud data is smoothed by using a Gaussian filter to remove noise and enhance signals, and meanwhile, the point cloud data and the reference point cloud are registered and downsampled to eliminate the influence of different positions and postures, and meanwhile, the data quantity is reduced. Specifically, preprocessing the image data comprises bilateral filtering and histogram equalization; specifically, a bilateral filter is used for denoising an input image to reduce noise and artifacts in the image, and an adaptive histogram equalization is used for enhancing the image to improve contrast and robustness of the image.

After preprocessing the point cloud data and the image data, the point cloud data and the image data are required to be in one-to-one correspondence through time stamp alignment and external calibration, and the pixel coordinates of the camera are converted into real world coordinates through internal calibration so as to facilitate subsequent weighted fusion.

In order to improve the utilization rate of data and the quality of the data, the point cloud data and the image data need to be fused, specifically, different weight coefficients are respectively configured for the point cloud data and the image data, and the weighted fusion is performed according to the weight coefficients, wherein the formula of the weighted fusion is as follows:

f _fusion (x _i )＝w _radar ·f _radar (x _i )+w _camera ·f _camera (x _i ) (1)

wherein f _radar (x _i ) Feature vector representing point cloud data, f _camera (x _i ) Feature vector f representing image data _fusion (x _i ) And representing the multi-modal fusion matrix after weighted fusion.

The method carries out multi-mode fusion on the point cloud and the linear array camera image data, and improves the comprehensiveness and accuracy of the data.

Constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters.

Referring to fig. 3, in an embodiment of the present application, when the self-supervised reconstruction image training is performed on the multi-scale mask feature self-encoder by using the unlabeled training set, a minimum mask feature map and a reconstruction image scaling cosine error are taken as a reconstruction criterion.

In an embodiment of the present application, performing self-supervised reconstructed image training on the multi-scale mask feature self-encoder using the unlabeled training set includes: updating model parameters of the multi-scale mask feature self-encoder using back propagation; the model parameters of the encoder in the multi-scale mask feature self-encoder are updated after each back-propagation using a momentum update mechanism.

For a better understanding of the scaled cosine error, the basic definition of the scaled cosine is elaborated below, the cosine distance between the two vectors u and v, the scaled cosine being defined as follows:

wherein,,<u，v>representing the dot product of the vectors u and v, I U I ₂ L representing vector u ₂ A norm; assume that the input mixing matrix is X ε R ^W×H×C Wherein C is the number of channels, and outputs a restoration matrixDividing the original matrix into N equal-sized small blocks, each having a size W×H×C, using a binary mask m _i Indicating whether the tile contains target information, if so, m _i =1, otherwise m _i =0. Based on the above description, it should be appreciated that if the multi-scale mask features are set to θ from the parameters of the encoder and decoder in the encoder, respectively _E And theta _D Then potentially denoted as z ε R ^K K < W x H x C, the reconstruction criterion is by minimizing the mask profile F and decoder output +.>The scaled cosine error between, namely:

wherein F is _i A mask feature map representing the i-th patch,a mask feature map representing the corresponding decoder output.

Further, during training, the potential representation is calculated by forward propagation:

and a reconstructed feature map:

finally, the gradient is calculated by back propagation and the parameters are updated.

It should be noted that throughout the training process, the encoder network and decoder network together learn the representation and reconstruction of the mask feature map.

In an embodiment of the application, an encoder in the multi-scale mask feature self-encoder comprises a multi-scale mask feature extraction module and a mask feature fusion module. The multi-scale mask feature extraction module comprises a first feature extraction stage, a second feature extraction stage and a third feature extraction stage;

In the embodiment of the application, the mask feature fusion module is used for respectively downsampling the first scale mask feature and the second scale mask feature, and fusing the downsampled first scale mask feature, second scale mask feature and third scale mask feature to obtain a fusion feature map.

The multi-scale mask feature self-encoder is used for gradually abstracting an input multi-mode feature vector into multi-scale token embedding, generating feature graphs with different resolutions in a mode similar to an image pyramid, and dividing each feature graph into a plurality of blocks, wherein each block corresponds to one token. For early high-resolution token embedding, a convolution block is used for encoding local content; for these high resolution token embeddings, a convolution block is used to extract local spatial features and convert them into lower dimensional feature vectors as a representation of the token; for the later low-resolution token embedding, fusing global context information by using a transducer block; for these low resolution token embeddings, the relationship between all tokens is encoded using a transducer block to obtain a more global feature representation. To avoid feature aliasing, we gradually upsample the mask to a greater resolution in the early convolution stage. Meanwhile, mask convolution is added in the early convolution stage, so that the features processed in the convolution block can be completely separated into mask token and visible token, and feature confusion of a mask area and a visible area in the later stage is avoided.

Specifically, the multi-scale mask feature self-encoder combines convolutional neural networks and a transducer architecture. First, an input low-resolution image is subjected to a series of convolution operations and masking convolution operations, and a feature map is extracted. And then, splicing the feature map of the mask area with the feature map of the visible area through upsampling to obtain a high-resolution feature map. In the latter converter stage, the high resolution feature map is processed and the mask is used for self-encoding purposes.

The network structure adopts the coding structure of special scenes and the proposed general network structure to be used as an encoder or a decoder.

Specific codingThe part of the device comprises 3 stages, H and W are the size of the input matrix, and the output characteristics of each stage are respectively

The first two stages are hybrid convolution modules, operating on features using Masked Convolutional Block, whose structure is shown in the lower right hand corner of fig. 2 (where the hole convolution Dailated Convolution uses a 3 x 3 size convolution kernel), with one stride of 2 convolving between each stage to perform the downsampling operation.

The last stage is a generic encoder network, where a transducer fusion module is used, which acts primarily to pull up the receptive field and fuse all patch features. The network part of the final stage can be replaced by a common ResNet residual network or a full-roll machine network structure of FCN types such as U-Net and the like according to different downstream tasks.

For better training, the acquired first scale mask features E are also required ₁ Second scale mask feature E ₂ Second scale mask feature E ₃ Fusing; specifically, will E ₁ And E is ₂ Downsampling with stride=2 and stride=4, respectively, followed by E ₃ Adding, fusing the multi-scale features, and Linear Transformer to obtain feature E to be input to Decoder _t ：

E _t ＝Linear(StrideConv(E ₁ ，4)+StrideConv(E ₂ ，2)+E ₃ )

Where stride conv (·, k) represents the convolution of stride=k, E _t A fused feature map representing three scale features.

In an embodiment of the present application, the loss function of the multi-scale mask feature self-encoder is:

L＝L _recon +L _cos +λ ₁ L _feat +λ ₂ L _fusion

wherein L is _recon Representing reconstruction error, L _cos Represents a scaled cosine error, L _feat Representing feature map vector error, L _fusion Representing characteristic map errors, lambda ₁ And lambda (lambda) ₂ Is a super parameter; I.I. | ₁ Represents L1 norm, x _i Andrepresenting the original image of the ith small block and the reconstructed image of the tau small block output by the decoder respectively; s (A, B) represents cosine similarity, _F i and->The mask feature map of the ith small block and the mask feature map of the tau small block output by the decoder are respectively represented; z and->A vector representation of the fused feature map and a vector representation of the fused feature map output by the decoder, I.I. | ₂ Represents an L2 norm; e (E) _t And->And respectively representing the fusion characteristic diagram and the fusion characteristic diagram output by the decoder.

The mask feature map in this embodiment includes a first scale mask feature, a second scale mask feature, and a third scale mask feature.

And thirdly, embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into a network structure of a downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small number of labeled training sets to obtain a vehicle bottom abnormality recognition model.

Specifically, the three mask feature vectors obtained in the previous step are further input into a multi-scale cascade fransfomer network (the backbone network can be selected at will, such as a variant ResNet, maskRCNN series network, and the like, so long as three-branch matrix input is processed, dynamic selection can be performed according to different downstream tasks), after hidden characterization is obtained through self-supervision training of the self-encoder, the characterization is used for solving various downstream tasks, the advantages of automatic learning features or characterization in unsupervised learning are fully utilized, the model is more universal and robust, and meanwhile, the specific dependence of label data can be reduced.

Specifically, parameters of the encoder network are fixed, potential representation is taken as output, the output is combined with a few-sample learning model, then the whole network is subjected to fine tuning training by using labeled data to obtain a final prediction model, the representation capability of the pre-training model, which is learned on large-scale unlabeled data, is utilized, and the performance of the model is improved by fine tuning the labeled data.

In summary, according to the urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder, point cloud data and image data are collected to perform multi-mode fusion, the mask is migrated from the encoder idea to the image, the image is reconstructed from the encoder through the multi-scale mask feature self-encoder, and the problem of high resolution operation cost faced by a general image self-encoder is solved from the angles of a loss function and a model structure; meanwhile, a mask feature reconstruction strategy with a scaled cosine error as a reconstruction criterion is designed, so that self-adaptive sample weighing is realized, and the problem of non-uniformity of positive and negative samples is solved; and finally, transferring the characterization code of the self-supervision learning to the small sample learning, fully utilizing the advantages of the automatic learning feature or characterization in the non-supervision learning, enabling the model to be more general and robust, reducing the dependence on labeled data, and effectively solving the problems of difficult data labeling, low metadata information utilization rate, uneven positive and negative samples, interference by optical imaging environment and the like by applying the vehicle bottom anomaly identification model provided by the application to the anomaly detection field of urban rail vehicle bottom foreign matters.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application, and are intended to be included within the scope of the appended claims and description.

Claims

1. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask characteristic self-encoder is characterized by comprising the following steps of:

2. The urban rail vehicle bottom anomaly detection method based on the multi-scale masking feature self-encoder of claim 1, wherein constructing the training set based on the vehicle bottom monitoring data comprises:

preprocessing the point cloud data and the image data respectively;

3. The metro rail car bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 2, wherein preprocessing the point cloud data comprises filtering, registering, resampling, preprocessing the image data comprises bilateral filtering, histogram equalization.

4. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder according to claim 3, wherein when the self-inspection degree training of the reconstructed image is performed on the multi-scale mask feature self-encoder by using the unlabeled training set, a minimum mask feature map and a scaled cosine error of the reconstructed image are taken as reconstruction criteria.

5. The urban rail vehicle bottom anomaly detection method based on the multi-scale masking feature self-encoder of claim 3, wherein self-supervised reconstructed image training of the multi-scale masking feature self-encoder with the unlabeled training set comprises:

6. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 1, wherein the encoder in the multi-scale mask feature self-encoder comprises a multi-scale mask feature extraction module and a mask feature fusion module.

7. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 6, wherein the multi-scale mask feature extraction module comprises a first feature extraction stage, a second feature extraction stage, and a third feature extraction stage;

8. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 6, wherein the mask feature fusion module is configured to downsample the first scale mask feature and the second scale mask feature, and fuse the downsampled first scale mask feature, second scale mask feature and third scale mask feature to obtain a fused feature map.

9. The metro rail car bottom anomaly detection method based on the multi-scale mask feature self encoder of claim 8, wherein downsampling the first and second scale mask features respectively comprises: downsampling the first scale mask feature with stride=2 and downsampling the second scale mask feature with stride=4.

10. The urban rail car bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 1, wherein the loss function of the multi-scale mask feature self-encoder is:

L＝L _recon +L _cos +λ ₁ L _feat +λ ₂ L _fusion

wherein L is _recon Representing reconstruction error, L _cos Represents a scaled cosine error, L _feat Representing feature map vector error, L _fusion Representing characteristic map errors, lambda ₁ And lambda (lambda) ₂ Is a super parameter; I.I. | ₁ Represents L1 norm, x _i Andrepresenting the original image of the ith small block and the reconstructed image of the tau small block output by the decoder respectively; s (A, B) represents cosine similarity, F _i And->Mask feature map and decoder output representing the ith tile, respectivelyA mask characteristic diagram of the tau-th small block is obtained; z and->A vector representation of the fused feature map and a vector representation of the fused feature map output by the decoder, I.I. | ₂ Represents an L2 norm; e (E) _t And->And respectively representing the fusion characteristic diagram and the fusion characteristic diagram output by the decoder.