CN116612351A - Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder - Google Patents
Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder Download PDFInfo
- Publication number
- CN116612351A CN116612351A CN202310596242.0A CN202310596242A CN116612351A CN 116612351 A CN116612351 A CN 116612351A CN 202310596242 A CN202310596242 A CN 202310596242A CN 116612351 A CN116612351 A CN 116612351A
- Authority
- CN
- China
- Prior art keywords
- encoder
- scale mask
- self
- feature
- mask feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 54
- 238000012544 monitoring process Methods 0.000 claims abstract description 10
- 230000005856 abnormality Effects 0.000 claims abstract description 8
- 238000000605 extraction Methods 0.000 claims description 37
- 230000004927 fusion Effects 0.000 claims description 36
- 239000013598 vector Substances 0.000 claims description 17
- 238000010586 diagram Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 6
- 230000002146 bilateral effect Effects 0.000 claims description 4
- 230000000873 masking effect Effects 0.000 claims description 4
- 238000012952 Resampling Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims 1
- 238000000034 method Methods 0.000 abstract description 12
- 238000012634 optical imaging Methods 0.000 abstract description 3
- 230000007547 defect Effects 0.000 description 11
- 238000012512 characterization method Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000003384 imaging method Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 2
- 238000005303 weighing Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 239000006096 absorbing agent Substances 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000010587 phase diagram Methods 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000035939 shock Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/803—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a urban rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder, and relates to the technical field of rail traffic, wherein the method comprises the steps of constructing a training set based on vehicle bottom monitoring data, wherein the training set comprises a large-scale non-tag training set and a small number of tagged training sets; constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters; and embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into the network structure of the downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small amount of labeled training sets to obtain a vehicle bottom abnormality recognition model. The application can effectively solve the problems of difficult data annotation, low metadata information utilization rate, uneven positive and negative samples, interference by optical imaging environment and the like in the prior art.
Description
Technical Field
The application relates to the technical field of train detection, in particular to a urban rail vehicle bottom anomaly detection method based on a multi-scale mask characteristic self-encoder.
Background
With the increasing mileage of urban rail trains, the running speed of the urban rail trains is higher and higher, and in order to ensure safe running, the maintenance frequency of the train body is higher and higher. In order to maintain effective and high-speed operation of the urban rail train, potential safety risks at the bottom of the urban rail train need to be detected in time, and maintenance frequency is improved. The detection items of the abnormal urban rail train bottom in general comprise parts loss, damage, foreign matter adhesion and the like of key parts (a shaft box cover, a shock absorber fixing device and an air spring).
At present, the research on defect anomaly detection of key parts of the urban rail train bottom is mainly focused on the following three aspects:
firstly, in the aspect of template matching, zhang and the like detect loosening and losing faults of train parts by using a template matching algorithm based on a contour, and classify samples according to differences by comparing pictures at specific positions with a standard template library. The method has strong flexibility and good effect on the picture with high pixel similarity. Lu and the like realize the defect detection of the cut-off plug handle by carrying out template matching on the shape sub of the circumscribed rectangle and the outline of the target, and experimental results show that the shape sub-matching has better performance in time and precision than gray template matching. However, the template matching algorithm is easy to be influenced by factors such as illumination, imaging view angle and the like, and is serious in failure under the condition that the target is deformed, poor in robustness and difficult to be applied to an actual train detection system in a large scale.
Secondly, in the aspect of traditional machine learning, in the fault detection of a brake shoe bolt in a train, zhou and the like, the position of a part is positioned through the characteristics of a gradient coding histogram (GradientEncodingHistogram, GEH) and a support vector machine (SupportVectorMachine, SVM) algorithm, and the fault state is judged by using the support vector machine, so that the accuracy of 99.2% and the picture processing speed of 5fps/s are realized; qin et al utilize the support vector machine to realize the automatic detection of the missing trouble of the handle in the train bent angle cock, have reached good instantaneity and higher detection accuracy. However, the method of feature extraction by adopting the traditional machine learning algorithm needs to be manually designed in combination with a specific scene, has low positioning efficiency on a target object, depends on the traditional method, such as gray projection, contour detection, exhaustion of a sliding window and the like, is only suitable for specific parts with strong markability, such as bolts, blocking keys and the like, and still has difficulty in meeting the actual requirement on accuracy.
Third, in the aspect of deep learning, sun and the like adopt convolutional neural networks from coarse to fine to be positioned in a train side frame key and shaft bolt area, and then a multi-classification model is trained to identify four typical faults related to loss looseness, so that higher identification capability and good robustness are achieved under the condition of low-quality imaging. The method combines the traditional method and the deep learning method, is applied to fault detection of train parts, positions a bolt area through the traditional method, then utilizes a convolutional neural network combined with self-coding (SAE) to identify the state of the bolt, and finally realizes higher detection accuracy for fault pictures in the middle plate bolt. The defect detection method based on deep learning has stronger robustness, and can cope with more complex field environments, imaging conditions, weather conditions and the like. The device has better detection performance for minor faults such as bolts, brake pads, gear keys and the like, and can realize high-precision positioning of a target to be detected under complex conditions. However, due to the acquisition of the sample, particularly due to the lack of negative samples during fault discrimination, the training model is prone to over-fitting and the detection time is relatively longer.
In summary, in the prior art, although defect anomaly detection of key components of a rail train can be performed by template matching, machine learning, conventional deep learning and other methods, due to the complex and changeable imaging environment, the influence of many factors such as optical distortion of a lens and complex background exists, meanwhile, the problems of unknown defects, less accurate defect labeling, high cost of labeling samples, uneven transverse defects of positive and negative samples, irregular defects, large variation range of defect shapes and scales and the like exist, and the accuracy, reliability and stability of defect anomaly detection are still not ideal in mass application in industrial landing.
Disclosure of Invention
Aiming at the defects in the prior art, the application provides a urban rail vehicle bottom anomaly detection method based on a multi-scale mask characteristic self-encoder.
The technical scheme of the application is as follows: a urban rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder comprises the following steps:
constructing a training set based on vehicle bottom monitoring data, wherein the training set comprises a large-scale unlabeled training set and a small number of labeled training sets, and the vehicle bottom monitoring data comprises point cloud data and image data;
constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters;
and embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into the network structure of the downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small amount of labeled training sets to obtain a vehicle bottom abnormality recognition model.
Preferably, constructing the training set based on the underbody monitoring data includes:
preprocessing the point cloud data and the image data respectively;
correlating the preprocessed point cloud data with the image data to obtain correlation data;
and carrying out weighted fusion on the point cloud data and the image data in the associated data to obtain a multi-mode fusion matrix.
Preferably, the preprocessing of the point cloud data comprises filtering, registering and resampling, and the preprocessing of the image data comprises bilateral filtering and histogram equalization.
Preferably, when the label-free training set is used for self-supervised reconstruction image training of the multi-scale mask feature self-encoder, a minimum mask feature diagram and a reconstruction image scaling cosine error are taken as reconstruction criteria.
Preferably, performing self-supervised reconstructed image training on the multi-scale mask feature self-encoder by using the unlabeled training set includes:
updating model parameters of the multi-scale mask feature self-encoder using back propagation;
the model parameters of the encoder in the multi-scale mask feature self-encoder are updated after each back-propagation using a momentum update mechanism.
Preferably, the encoder in the multi-scale mask feature self-encoder comprises a multi-scale mask feature extraction module and a mask feature fusion module.
Preferably, the multi-scale mask feature extraction module comprises a first feature extraction stage, a second feature extraction stage and a third feature extraction stage;
the first feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring first scale mask features; the second feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring second scale mask features; the third feature extraction stage comprises a patch mapping module and a transform module, and is used for acquiring a third scale mask feature;
a downsampling layer is arranged between the first feature extraction stage and the second feature extraction stage, and a downsampling layer is arranged between the second feature extraction stage and the third feature extraction stage.
Preferably, the mask feature fusion module is configured to downsample the first scale mask feature and the second scale mask feature, and fuse the downsampled first scale mask feature, second scale mask feature and third scale mask feature to obtain a fused feature map.
Preferably, downsampling the first and second scale mask features respectively includes: downsampling the first scale mask feature with stride=2 and downsampling the second scale mask feature with stride=4.
Preferably, the loss function of the multi-scale mask feature self-encoder is:
L=L recon +L cos +λ 1 L feat +λ 2 L fusion
wherein L is recon Representing reconstruction error, L cos Represents a scaled cosine error, L feat Representing feature map vector error, L fusion Representing characteristic map errors, lambda 1 And lambda (lambda) 2 Is a super parameter; I.I. | 1 Represents L1 norm, x i Andrepresenting the original image of the ith small block and the reconstructed image of the tau small block output by the decoder respectively; s (A, B) represents cosine similarity, F i And->The mask feature map of the ith small block and the mask feature map of the tau small block output by the decoder are respectively represented; z and->A vector representation of the fused feature map and a vector representation of the fused feature map output by the decoder, I.I. | 2 Represents an L2 norm; e (E) t And->And respectively representing the fusion characteristic diagram and the fusion characteristic diagram output by the decoder.
The beneficial effects of the application are as follows: the application provides a urban rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder, which carries out multi-mode fusion by collecting point cloud data and image data, transfers a mask from the idea of the encoder to a graph, reconstructs an image from the encoder by the multi-scale mask feature self-encoder, and solves the problem of high resolution operation cost faced by a general graph self-encoder from the angles of a loss function and a model structure; meanwhile, a mask feature reconstruction strategy with a scaled cosine error as a reconstruction criterion is designed, so that self-adaptive sample weighing is realized, and the problem of non-uniformity of positive and negative samples is solved; and finally, transferring the characterization code of the self-supervision learning to the small sample learning, fully utilizing the advantages of the automatic learning feature or characterization in the non-supervision learning, enabling the model to be more general and robust, reducing the dependence on labeled data, and effectively solving the problems of difficult data labeling, low metadata information utilization rate, uneven positive and negative samples, interference by optical imaging environment and the like by applying the vehicle bottom anomaly identification model provided by the application to the anomaly detection field of urban rail vehicle bottom foreign matters.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.
FIG. 1 is a flow chart of a metro rail car bottom anomaly detection method based on a multi-scale mask feature self-encoder provided by an embodiment of the present application;
FIG. 2 is a block diagram of a metro rail car bottom anomaly detection method based on a multi-scale mask feature self-encoder provided by an embodiment of the present application;
FIG. 3 is a block diagram of a multi-scale mask feature self-encoder provided by an embodiment of the present application;
fig. 4 is a task phase diagram of a metro rail vehicle bottom anomaly detection method based on a multi-scale mask feature self-encoder according to an embodiment of the present application.
Detailed Description
Embodiments of the technical scheme of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and thus are merely examples, and are not intended to limit the scope of the present application.
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.
Referring to fig. 1 and 2, an embodiment of the present application provides a method for detecting urban rail vehicle bottom anomalies based on a multi-scale mask feature self-encoder, including:
step one, a training set is constructed based on vehicle bottom monitoring data, wherein the training set comprises a large-scale non-tag training set and a small number of tagged training sets, and the vehicle bottom monitoring data comprises point cloud data and image data.
In an embodiment of the present application, constructing a training set based on vehicle bottom monitoring data includes: preprocessing the point cloud data and the image data respectively; correlating the preprocessed point cloud data with the image data to obtain correlation data; and carrying out weighted fusion on the point cloud data and the image data in the associated data to obtain a multi-mode fusion matrix.
Specifically, the point cloud data is acquired through a radar, and the image data is acquired through a 2D linear array camera.
Specifically, preprocessing the point cloud data comprises filtering, registering and resampling; specifically, the point cloud data is smoothed by using a Gaussian filter to remove noise and enhance signals, and meanwhile, the point cloud data and the reference point cloud are registered and downsampled to eliminate the influence of different positions and postures, and meanwhile, the data quantity is reduced. Specifically, preprocessing the image data comprises bilateral filtering and histogram equalization; specifically, a bilateral filter is used for denoising an input image to reduce noise and artifacts in the image, and an adaptive histogram equalization is used for enhancing the image to improve contrast and robustness of the image.
After preprocessing the point cloud data and the image data, the point cloud data and the image data are required to be in one-to-one correspondence through time stamp alignment and external calibration, and the pixel coordinates of the camera are converted into real world coordinates through internal calibration so as to facilitate subsequent weighted fusion.
In order to improve the utilization rate of data and the quality of the data, the point cloud data and the image data need to be fused, specifically, different weight coefficients are respectively configured for the point cloud data and the image data, and the weighted fusion is performed according to the weight coefficients, wherein the formula of the weighted fusion is as follows:
f fusion (x i )=w radar ·f radar (x i )+w camera ·f camera (x i ) (1)
wherein f radar (x i ) Feature vector representing point cloud data, f camera (x i ) Feature vector f representing image data fusion (x i ) And representing the multi-modal fusion matrix after weighted fusion.
The method carries out multi-mode fusion on the point cloud and the linear array camera image data, and improves the comprehensiveness and accuracy of the data.
Constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters.
Referring to fig. 3, in an embodiment of the present application, when the self-supervised reconstruction image training is performed on the multi-scale mask feature self-encoder by using the unlabeled training set, a minimum mask feature map and a reconstruction image scaling cosine error are taken as a reconstruction criterion.
In an embodiment of the present application, performing self-supervised reconstructed image training on the multi-scale mask feature self-encoder using the unlabeled training set includes: updating model parameters of the multi-scale mask feature self-encoder using back propagation; the model parameters of the encoder in the multi-scale mask feature self-encoder are updated after each back-propagation using a momentum update mechanism.
For a better understanding of the scaled cosine error, the basic definition of the scaled cosine is elaborated below, the cosine distance between the two vectors u and v, the scaled cosine being defined as follows:
wherein,,<u,v>representing the dot product of the vectors u and v, I U I 2 L representing vector u 2 A norm; assume that the input mixing matrix is X ε R W×H×C Wherein C is the number of channels, and outputs a restoration matrixDividing the original matrix into N equal-sized small blocks, each having a size W×H×C, using a binary mask m i Indicating whether the tile contains target information, if so, m i =1, otherwise m i =0. Based on the above description, it should be appreciated that if the multi-scale mask features are set to θ from the parameters of the encoder and decoder in the encoder, respectively E And theta D Then potentially denoted as z ε R K K < W x H x C, the reconstruction criterion is by minimizing the mask profile F and decoder output +.>The scaled cosine error between, namely:
wherein F is i A mask feature map representing the i-th patch,a mask feature map representing the corresponding decoder output.
Further, during training, the potential representation is calculated by forward propagation:
and a reconstructed feature map:
finally, the gradient is calculated by back propagation and the parameters are updated.
It should be noted that throughout the training process, the encoder network and decoder network together learn the representation and reconstruction of the mask feature map.
In an embodiment of the application, an encoder in the multi-scale mask feature self-encoder comprises a multi-scale mask feature extraction module and a mask feature fusion module. The multi-scale mask feature extraction module comprises a first feature extraction stage, a second feature extraction stage and a third feature extraction stage;
the first feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring first scale mask features; the second feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring second scale mask features; the third feature extraction stage comprises a patch mapping module and a transform module, and is used for acquiring a third scale mask feature;
a downsampling layer is arranged between the first feature extraction stage and the second feature extraction stage, and a downsampling layer is arranged between the second feature extraction stage and the third feature extraction stage.
In the embodiment of the application, the mask feature fusion module is used for respectively downsampling the first scale mask feature and the second scale mask feature, and fusing the downsampled first scale mask feature, second scale mask feature and third scale mask feature to obtain a fusion feature map.
The multi-scale mask feature self-encoder is used for gradually abstracting an input multi-mode feature vector into multi-scale token embedding, generating feature graphs with different resolutions in a mode similar to an image pyramid, and dividing each feature graph into a plurality of blocks, wherein each block corresponds to one token. For early high-resolution token embedding, a convolution block is used for encoding local content; for these high resolution token embeddings, a convolution block is used to extract local spatial features and convert them into lower dimensional feature vectors as a representation of the token; for the later low-resolution token embedding, fusing global context information by using a transducer block; for these low resolution token embeddings, the relationship between all tokens is encoded using a transducer block to obtain a more global feature representation. To avoid feature aliasing, we gradually upsample the mask to a greater resolution in the early convolution stage. Meanwhile, mask convolution is added in the early convolution stage, so that the features processed in the convolution block can be completely separated into mask token and visible token, and feature confusion of a mask area and a visible area in the later stage is avoided.
Specifically, the multi-scale mask feature self-encoder combines convolutional neural networks and a transducer architecture. First, an input low-resolution image is subjected to a series of convolution operations and masking convolution operations, and a feature map is extracted. And then, splicing the feature map of the mask area with the feature map of the visible area through upsampling to obtain a high-resolution feature map. In the latter converter stage, the high resolution feature map is processed and the mask is used for self-encoding purposes.
The network structure adopts the coding structure of special scenes and the proposed general network structure to be used as an encoder or a decoder.
Specific codingThe part of the device comprises 3 stages, H and W are the size of the input matrix, and the output characteristics of each stage are respectively
The first two stages are hybrid convolution modules, operating on features using Masked Convolutional Block, whose structure is shown in the lower right hand corner of fig. 2 (where the hole convolution Dailated Convolution uses a 3 x 3 size convolution kernel), with one stride of 2 convolving between each stage to perform the downsampling operation.
The last stage is a generic encoder network, where a transducer fusion module is used, which acts primarily to pull up the receptive field and fuse all patch features. The network part of the final stage can be replaced by a common ResNet residual network or a full-roll machine network structure of FCN types such as U-Net and the like according to different downstream tasks.
For better training, the acquired first scale mask features E are also required 1 Second scale mask feature E 2 Second scale mask feature E 3 Fusing; specifically, will E 1 And E is 2 Downsampling with stride=2 and stride=4, respectively, followed by E 3 Adding, fusing the multi-scale features, and Linear Transformer to obtain feature E to be input to Decoder t :
E t =Linear(StrideConv(E 1 ,4)+StrideConv(E 2 ,2)+E 3 )
Where stride conv (·, k) represents the convolution of stride=k, E t A fused feature map representing three scale features.
In an embodiment of the present application, the loss function of the multi-scale mask feature self-encoder is:
L=L recon +L cos +λ 1 L feat +λ 2 L fusion
wherein L is recon Representing reconstruction error, L cos Represents a scaled cosine error, L feat Representing feature map vector error, L fusion Representing characteristic map errors, lambda 1 And lambda (lambda) 2 Is a super parameter; I.I. | 1 Represents L1 norm, x i Andrepresenting the original image of the ith small block and the reconstructed image of the tau small block output by the decoder respectively; s (A, B) represents cosine similarity, F i and->The mask feature map of the ith small block and the mask feature map of the tau small block output by the decoder are respectively represented; z and->A vector representation of the fused feature map and a vector representation of the fused feature map output by the decoder, I.I. | 2 Represents an L2 norm; e (E) t And->And respectively representing the fusion characteristic diagram and the fusion characteristic diagram output by the decoder.
The mask feature map in this embodiment includes a first scale mask feature, a second scale mask feature, and a third scale mask feature.
And thirdly, embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into a network structure of a downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small number of labeled training sets to obtain a vehicle bottom abnormality recognition model.
Specifically, the three mask feature vectors obtained in the previous step are further input into a multi-scale cascade fransfomer network (the backbone network can be selected at will, such as a variant ResNet, maskRCNN series network, and the like, so long as three-branch matrix input is processed, dynamic selection can be performed according to different downstream tasks), after hidden characterization is obtained through self-supervision training of the self-encoder, the characterization is used for solving various downstream tasks, the advantages of automatic learning features or characterization in unsupervised learning are fully utilized, the model is more universal and robust, and meanwhile, the specific dependence of label data can be reduced.
Specifically, parameters of the encoder network are fixed, potential representation is taken as output, the output is combined with a few-sample learning model, then the whole network is subjected to fine tuning training by using labeled data to obtain a final prediction model, the representation capability of the pre-training model, which is learned on large-scale unlabeled data, is utilized, and the performance of the model is improved by fine tuning the labeled data.
In summary, according to the urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder, point cloud data and image data are collected to perform multi-mode fusion, the mask is migrated from the encoder idea to the image, the image is reconstructed from the encoder through the multi-scale mask feature self-encoder, and the problem of high resolution operation cost faced by a general image self-encoder is solved from the angles of a loss function and a model structure; meanwhile, a mask feature reconstruction strategy with a scaled cosine error as a reconstruction criterion is designed, so that self-adaptive sample weighing is realized, and the problem of non-uniformity of positive and negative samples is solved; and finally, transferring the characterization code of the self-supervision learning to the small sample learning, fully utilizing the advantages of the automatic learning feature or characterization in the non-supervision learning, enabling the model to be more general and robust, reducing the dependence on labeled data, and effectively solving the problems of difficult data labeling, low metadata information utilization rate, uneven positive and negative samples, interference by optical imaging environment and the like by applying the vehicle bottom anomaly identification model provided by the application to the anomaly detection field of urban rail vehicle bottom foreign matters.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application, and are intended to be included within the scope of the appended claims and description.
Claims (10)
1. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask characteristic self-encoder is characterized by comprising the following steps of:
constructing a training set based on vehicle bottom monitoring data, wherein the training set comprises a large-scale unlabeled training set and a small number of labeled training sets, and the vehicle bottom monitoring data comprises point cloud data and image data;
constructing a multi-scale mask feature self-encoder, and performing self-supervision training on a reconstructed image of the multi-scale mask feature self-encoder by using the unlabeled training set to obtain model parameters;
and embedding the acquired model parameters of the encoder part in the multi-scale mask characteristic self-encoder and the corresponding network structure into the network structure of the downstream vehicle bottom abnormality detection task as a skeleton network, and performing supervised fine tuning training on the model parameters by using a small amount of labeled training sets to obtain a vehicle bottom abnormality recognition model.
2. The urban rail vehicle bottom anomaly detection method based on the multi-scale masking feature self-encoder of claim 1, wherein constructing the training set based on the vehicle bottom monitoring data comprises:
preprocessing the point cloud data and the image data respectively;
correlating the preprocessed point cloud data with the image data to obtain correlation data;
and carrying out weighted fusion on the point cloud data and the image data in the associated data to obtain a multi-mode fusion matrix.
3. The metro rail car bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 2, wherein preprocessing the point cloud data comprises filtering, registering, resampling, preprocessing the image data comprises bilateral filtering, histogram equalization.
4. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder according to claim 3, wherein when the self-inspection degree training of the reconstructed image is performed on the multi-scale mask feature self-encoder by using the unlabeled training set, a minimum mask feature map and a scaled cosine error of the reconstructed image are taken as reconstruction criteria.
5. The urban rail vehicle bottom anomaly detection method based on the multi-scale masking feature self-encoder of claim 3, wherein self-supervised reconstructed image training of the multi-scale masking feature self-encoder with the unlabeled training set comprises:
updating model parameters of the multi-scale mask feature self-encoder using back propagation;
the model parameters of the encoder in the multi-scale mask feature self-encoder are updated after each back-propagation using a momentum update mechanism.
6. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 1, wherein the encoder in the multi-scale mask feature self-encoder comprises a multi-scale mask feature extraction module and a mask feature fusion module.
7. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 6, wherein the multi-scale mask feature extraction module comprises a first feature extraction stage, a second feature extraction stage, and a third feature extraction stage;
the first feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring first scale mask features; the second feature extraction stage comprises a patch embedding module and a Masked Convolutional Block module, and is used for acquiring second scale mask features; the third feature extraction stage comprises a patch mapping module and a transform module, and is used for acquiring a third scale mask feature;
a downsampling layer is arranged between the first feature extraction stage and the second feature extraction stage, and a downsampling layer is arranged between the second feature extraction stage and the third feature extraction stage.
8. The urban rail vehicle bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 6, wherein the mask feature fusion module is configured to downsample the first scale mask feature and the second scale mask feature, and fuse the downsampled first scale mask feature, second scale mask feature and third scale mask feature to obtain a fused feature map.
9. The metro rail car bottom anomaly detection method based on the multi-scale mask feature self encoder of claim 8, wherein downsampling the first and second scale mask features respectively comprises: downsampling the first scale mask feature with stride=2 and downsampling the second scale mask feature with stride=4.
10. The urban rail car bottom anomaly detection method based on the multi-scale mask feature self-encoder of claim 1, wherein the loss function of the multi-scale mask feature self-encoder is:
L=L recon +L cos +λ 1 L feat +λ 2 L fusion
wherein L is recon Representing reconstruction error, L cos Represents a scaled cosine error, L feat Representing feature map vector error, L fusion Representing characteristic map errors, lambda 1 And lambda (lambda) 2 Is a super parameter; I.I. | 1 Represents L1 norm, x i Andrepresenting the original image of the ith small block and the reconstructed image of the tau small block output by the decoder respectively; s (A, B) represents cosine similarity, F i And->Mask feature map and decoder output representing the ith tile, respectivelyA mask characteristic diagram of the tau-th small block is obtained; z and->A vector representation of the fused feature map and a vector representation of the fused feature map output by the decoder, I.I. | 2 Represents an L2 norm; e (E) t And->And respectively representing the fusion characteristic diagram and the fusion characteristic diagram output by the decoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310596242.0A CN116612351A (en) | 2023-05-24 | 2023-05-24 | Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310596242.0A CN116612351A (en) | 2023-05-24 | 2023-05-24 | Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116612351A true CN116612351A (en) | 2023-08-18 |
Family
ID=87681402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310596242.0A Pending CN116612351A (en) | 2023-05-24 | 2023-05-24 | Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116612351A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117011718A (en) * | 2023-10-08 | 2023-11-07 | 之江实验室 | Plant leaf fine granularity identification method and system based on multiple loss fusion |
CN117372720A (en) * | 2023-10-12 | 2024-01-09 | 南京航空航天大学 | Unsupervised anomaly detection method based on multi-feature cross mask repair |
CN117496276A (en) * | 2023-12-29 | 2024-02-02 | 广州锟元方青医疗科技有限公司 | Lung cancer cell morphology analysis and identification method and computer readable storage medium |
CN117635451A (en) * | 2023-10-12 | 2024-03-01 | 中国石油大学(华东) | Multi-source multi-scale digital core image fusion method based on attention guidance |
CN118400195A (en) * | 2024-06-27 | 2024-07-26 | 合肥城市云数据中心股份有限公司 | Malicious traffic detection method based on mask automatic encoder pre-training |
-
2023
- 2023-05-24 CN CN202310596242.0A patent/CN116612351A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117011718A (en) * | 2023-10-08 | 2023-11-07 | 之江实验室 | Plant leaf fine granularity identification method and system based on multiple loss fusion |
CN117011718B (en) * | 2023-10-08 | 2024-02-02 | 之江实验室 | Plant leaf fine granularity identification method and system based on multiple loss fusion |
CN117372720A (en) * | 2023-10-12 | 2024-01-09 | 南京航空航天大学 | Unsupervised anomaly detection method based on multi-feature cross mask repair |
CN117635451A (en) * | 2023-10-12 | 2024-03-01 | 中国石油大学(华东) | Multi-source multi-scale digital core image fusion method based on attention guidance |
CN117372720B (en) * | 2023-10-12 | 2024-04-26 | 南京航空航天大学 | Unsupervised anomaly detection method based on multi-feature cross mask repair |
CN117496276A (en) * | 2023-12-29 | 2024-02-02 | 广州锟元方青医疗科技有限公司 | Lung cancer cell morphology analysis and identification method and computer readable storage medium |
CN117496276B (en) * | 2023-12-29 | 2024-04-19 | 广州锟元方青医疗科技有限公司 | Lung cancer cell morphology analysis and identification method and computer readable storage medium |
CN118400195A (en) * | 2024-06-27 | 2024-07-26 | 合肥城市云数据中心股份有限公司 | Malicious traffic detection method based on mask automatic encoder pre-training |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116612351A (en) | Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder | |
Zhang et al. | Concrete bridge surface damage detection using a single‐stage detector | |
CN110728658A (en) | High-resolution remote sensing image weak target detection method based on deep learning | |
CN110598613B (en) | Expressway agglomerate fog monitoring method | |
CN111340855A (en) | Road moving target detection method based on track prediction | |
CN112686207A (en) | Urban street scene target detection method based on regional information enhancement | |
CN116612106A (en) | Method for detecting surface defects of optical element based on YOLOX algorithm | |
CN112766056A (en) | Method and device for detecting lane line in low-light environment based on deep neural network | |
CN114926456A (en) | Rail foreign matter detection method based on semi-automatic labeling and improved deep learning | |
CN115171001A (en) | Method and system for detecting vehicle on enhanced thermal infrared image based on improved SSD | |
CN116309407A (en) | Method for detecting abnormal state of railway contact net bolt | |
CN116523881A (en) | Abnormal temperature detection method and device for power equipment | |
CN116681979A (en) | Power equipment target detection method under complex environment | |
Tao et al. | Fault detection of train mechanical parts using multi-mode aggregation feature enhanced convolution neural network | |
CN112258483B (en) | Coupler yoke pin inserting and supporting dislocation and nut loss fault detection method | |
CN116994074A (en) | Camera dirt detection method based on deep learning | |
CN117115616A (en) | Real-time low-illumination image target detection method based on convolutional neural network | |
CN114972760B (en) | Ionization diagram automatic tracing method based on multi-scale attention-enhancing U-Net | |
CN111626175B (en) | Shaft type identification method based on deep convolutional neural network | |
Kaleybar et al. | Efficient Vision Transformer for Accurate Traffic Sign Detection | |
Kang et al. | Efficient Object Detection with Deformable Convolution for Optical Remote Sensing Imagery | |
CN117152646B (en) | Unmanned electric power inspection AI light-weight large model method and system | |
CN118644797A (en) | YOLOv7-OBB model-based intelligent tour inspection method and related system for high-voltage transformer substation | |
Hu et al. | Fault Diagnosis of Train Body Sign Abnormal Pattern with Deep Learning Based Target Detection | |
CN117173051A (en) | Infrared visible light multisource image enhancement system based on multiple characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |