Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a container weak and small serial number target detection and identification method based on deep learning, so that the efficiency, the robustness and the accuracy of container serial number target detection and identification are improved at a port and a dock under high load.
In order to realize the above summary, the invention provides a method for detecting and identifying a target of a weak and small serial number of a container based on deep learning, which comprises the following steps:
(1) Collecting an image sample of a container serial number;
(2) Preprocessing and labeling the image;
(3) Training set preparation based on data augmentation;
(4) The addition of a Swin-transducer encoder to replace the CSP module in YOLOv5s improves feature extraction: aiming at insufficient global context characteristics of the YOLOv5 during characteristic extraction, a Swin-transform encoder is added to improve a characteristic extraction module of a YOLOv5s network for extracting more sufficient global context characteristics;
(5) Improved feature fusion network for neck network modules in YOLOv5 s: aiming at the task scene of target detection and identification of a weak and small serial number of a container, the feature fusion network of a neck network module in YOLOv5s is improved by considering that the target detection and identification frame is small in size, so that the fusion capability of the feature mapping of the small scale is improved;
(6) Adding multi-scale prediction of weak and small target detection heads: through a newly added micro-scale detection head TPH and an improved CSP module, a Biformer attention mechanism layer is introduced, detection and identification of weak and small targets are enhanced, and then prediction results of a plurality of detection heads are fused by using DIOU-NMS non-maximum suppression;
(7) Multimode fusion prediction based on loss function optimization;
the step of collecting the image samples of the container serial numbers refers to collecting the image samples in different environments, angles and distances, so that the image samples contain pictures with different illumination intensities, complex backgrounds, foreign matter shielding, multi-angle rotation and inclination and different scales; the image preprocessing and labeling means that the size of an input image is regulated to be 608 multiplied by 608 through an OpenCV library, the acquired image is semi-automatically labeled through a PPOCRLAbel semi-automatic labeling tool, detection feedback is carried out through an existing YOLOv5s model, and a labeling sample with a poor detection result is corrected by a Labelimage labeling tool in an auxiliary manual labeling mode; the training set manufacturing based on data augmentation refers to manufacturing a data set for training by using a method of metal-6 data augmentation, color disturbance and brightness adjustment, splicing by adopting a mode of randomly scaling, randomly cutting and randomly arranging 6 pictures, randomly changing RGB channel values in the images, adding or subtracting a random value to adjust the brightness of the images to simulate the change under different illumination conditions, obtaining new 6 images, and adopting random shielding treatment for the new images; the improved feature extraction of the Swin-transducer encoder added to replace the CSP module in the YOLOv5s means that the Swin-transducer encoder is added, and the feature extraction module of the YOLOv5s network is improved and used for extracting more sufficient global context features; the improvement of the feature fusion network of the neck network module in the YOLOv5s refers to improvement of the feature fusion network of the neck network module in the YOLOv5s, and improvement of the fusion capability of the micro-scale feature mapping; the multi-scale prediction of the added weak and small target detection head refers to that a BiFormer attention mechanism layer is introduced through a newly added small-scale detection head TPH and an improved CSP module, so that the detection and identification of the weak and small target are enhanced; the LOSS function optimization and the multi-model fusion prediction based on data division training are that the LOSS function is changed from CIOU LOSS to EIOU LOSS, and CTC LOSS function for adjusting the alignment condition of input and output sequences is introduced; dividing the training set self-service sampling into n training sets, respectively and independently training the n training sets to obtain n homogeneous models, fusing the prediction results of the n models by using WBF weighted bounding boxes to obtain final prediction results, adjusting parameters of each model by using the prediction results, and returning to the step of improving feature extraction by adding a Swin-transducer coder replacing a CSP module in YOLOv5s until the loss functions of each model are converged consistently, completing model training, completing integration of the n models, obtaining a final strong learner, and realizing the effect of jointly deciding by a plurality of models; and then, carrying out threshold filtering and boundary frame correction on the output result of the multi-model fusion prediction, setting a threshold value of classification confidence, only retaining the prediction result higher than the threshold value, and carrying out fine adjustment and correction on the predicted boundary frame so as to reduce false detection and improve accuracy.
Further, the adding of the Swin-transducer encoder for replacing CSP modules in the YOLOv5s improves feature extraction, namely, the 2 nd CSP module and the 4 th CSP module in a backbone network in the original YOLOv5s are replaced by a Swin-transducer encoder, a depth separable convolution layer, a Concat layer and a CBL layer, which are used for extracting more sufficient global context features, improving detail information modeling of different layers, enhancing high-level semantic modeling of small targets, retaining residual structures in the CSP modules, preventing gradient disappearance and reducing calculation cost, and simultaneously replacing the depth separable convolution layer with the Conv layer can reduce calculation amount of the model; the Swin-transform encoder consists of Patch Partition, linear coding and Swin-transform-Block, and the depth separable convolution layer consists of channel-by-channel convolution and point-by-point convolution.
Further, the improved feature fusion network of the neck network module in YOLOv5s refers to adding a cascade composed of a CSP module, a depth separable convolution layer, an Upsample layer, a Concat layer, a Swin-transform encoder, a depth separable convolution layer, a Concat layer, a CBL layer and a depth separable convolution layer aiming at micro-scale feature mapping on the basis of the original YOLOv5s feature fusion network, the cascade enhances the detection capability of a shallow feature map while reducing the network gradient disappearance and optimizing structure as much as possible, the original YOLOv5s feature fusion network module is properly adjusted while the cascade is constructed, the two inlet connection positions of the Concat corresponding to a small-scale feature map are changed, the upstream and downstream relation between the feature extraction module and the feature fusion module is reasonably changed, and the Upsample layer of the neck network is changed into CARAFE, so that the sensing range is enlarged, and the micro details in the shallow feature map are better reserved.
Further, the multi-scale prediction of adding the weak and small target detection head refers to adding a micro-scale detection head TPH based on the original YOLOv5s detection head, introducing a Biformer attention mechanism layer into the original small-scale detection heads of the micro-scale detection heads TPH and YOLOv5s, changing the CSP module corresponding to the original small-scale detection head and the original small-scale detection head TPH of YOLOv5s into a cascade connection of a Swin-transform encoder, a depth separable convolution layer, a Concat layer and a CBL layer, reserving a residual structure in the CSP module, preventing gradient from disappearing and reducing calculation cost, and pointedly enhancing fine granularity details by introducing the Biformer attention mechanism layer, thereby greatly improving the weak serial number target detection recognition capability of the container; meanwhile, a small-scale detection head TPH is added, so that a shallower high-resolution feature map can be used for calculating a prediction result of a small target, and the prediction result of the detection head and the prediction results of other detection heads are fused by using DIOU-NMS non-maximum suppression, so that a comprehensive prediction result is finally obtained.
Detailed Description
The technical route of the present invention will be further described in detail below by means of specific examples and accompanying drawings.
As shown in fig. 1 to 6, one embodiment of the method for detecting and identifying a container weak and small serial number target based on deep learning of the invention comprises the following steps:
1. image sample for collecting serial number of container
Image samples in different environments, angles and distances are collected, so that the image samples contain pictures with different illumination intensities, complex backgrounds, foreign matter shielding, multi-angle rotation and inclination and different scales.
2. Preprocessing and labeling the image
The size of an input image is regulated to be 608 multiplied by 608 through an OpenCV library, the acquired image is semi-automatically marked through PPOCRLabel, detection feedback is carried out through an existing YOLOv5s model, and a marked sample with a poor detection result is corrected through Labelimage by manual marking.
The method is characterized in that a mode of combining semi-automatic labeling and manual correction labeling after model detection feedback is adopted in the acquired image labeling stage, so that the accuracy and the speed of labeling can be greatly improved.
3. Training set making based on data augmentation
The data set for training was made using the methods of Mosaic-6 data enhancement, color perturbation, and brightness adjustment.
The method comprises the steps of acquiring images, adopting a mode of combining 6 pictures into a group, splicing the images in a random scaling, random cutting and random arrangement mode, randomly changing RGB channel values in the images, adding or subtracting a random value to adjust the brightness of the images to simulate the change under different illumination conditions, obtaining new 6 images, keeping the total input number of the new 6 images unchanged, and adopting random shielding treatment on the new images to simulate the corrosion of sea water, air and other factors on the serial number of the container under natural scenes.
4. Improved feature extraction by adding a Swin-transducer encoder that replaces CSP modules in Yolov5s
In order to overcome the defect that the extraction of contextual features in the original YOLOv5s is insufficient and a long-distance modeling relation is difficult to construct, the patent proposes adding a Swin-transducer encoder for replacing a CSP module in the YOLOv5s to improve the feature extraction, specifically, replacing a 2 nd CSP module and a 4 th CSP module in a backbone network in the original YOLOv5s network with a Swin-transducer encoder, a depth separable convolution layer, a Concat layer and a CBL layer cascade for extracting more sufficient global contextual features, improving the modeling of detail information of different layers, enhancing the high-level semantic modeling of small targets, retaining a residual structure in the CSP module, preventing gradient disappearance and reducing the calculation cost, and simultaneously replacing the depth separable convolution layer with the Conv layer can reduce the calculation amount of the model; the Swin-transform encoder consists of Patch Partition, linear coding and Swin-transform-Block, and the depth separable convolution layer consists of channel-by-channel convolution and point-by-point convolution.
The 2 nd CSP module and the 4 th CSP module in the backbone network in the original YOLOv5s network are replaced by a Swin-transform encoder, a depth separable convolution layer, a Concat layer and a CBL layer cascade, and residual structures in the CSP modules are reserved, as shown in figures 3 and 5 in particular; the backbone network uses RGB images with the size of 608 multiplied by 3 as input objects, inputs the RGB images into a Focus structure, performs slicing operation, changes the RGB images into Feature Map with the size of 304 multiplied by 12, and changes the RGB images into Feature Map with the size of 304 multiplied by 32 through CONV operation with the size of 32 convolution kernels; then, through CBL, CSP and CBL, swin-transform encoder, depth separable convolution layer, concat layer and CBL layer for multiple times, so as to better extract Feature Map, wherein the size of each CSP module and convolution kernel before cascade composed of the Swin-transform encoder, the depth separable convolution layer, the Concat layer and the CBL layer is 3×3, the step length is set to be 2, the downsampling effect is achieved, the computational complexity is reduced, the noise is reduced, and the robustness is improved; the CSP module divides the Feature Map of the Base layer into two parts, wherein one part is subjected to CBL, res Unit, conv and other parts are directly Conv and tensor splicing is carried out with the Feature Map Concat of the Feature Map of the Feature layer for merging, so that the gradient information repetition rate in network optimization is greatly reduced, the calculated amount is reduced, and the method is friendly to a low-core CPU; after the operations of CBL, CSP, CBL, swin-transducer encoder, depth separable convolution layer, concat layer and CBL layer are carried out for a plurality of times, four size feature graphs are obtained, wherein the number of convolution kernels of the CBL is 64, 128, 256 and 512 in sequence and used for enhancing the capability of network extraction, and the obtained feature graphs are 152 multiplied by 152, 76 multiplied by 76, 38 multiplied by 38 and 19 multiplied by 19 respectively based on the initial input of images with the size of 608 multiplied by 608 and are sequentially used for corresponding detection and identification of micro-scale objects, detection and identification of medium-scale objects and detection and identification of large-scale objects; then, through SPP pooling, adopting a maximum pooling mode of 1×1, 5×5, 9×9 and 13×13 for multi-scale fusion, retaining the original space latitude and increasing the receiving range of the trunk feature; among them, 5 modules such as CBL, res Unit, CSP, focus, SPP are shown in fig. 6.
Wherein, the Swin-transducer encoder consists of a Patch Partition, a Linear coding, a Swin-transducer-Block, as shown in FIG. 2; patch Partition: the input image is first segmented into a set of small rectangular patches, typically distributed in a regular grid pattern over the input image, each patch image block being a 4x4 size block of pixels; linear editing: the pixel information in each patch is flattened into a vector, and the pixels are mapped into higher-dimensional feature vectors by Conv operation, and the feature vectors become inputs of Swin transform, wherein each vector represents an image patch; swin-transducer-Block: the method comprises the steps that computing units such as a plurality of window multi-head self-attention (W-MSA) layers, a moving window multi-head self-attention (SW-MSA) layer, a multi-layer perceptron (MLP), layer Normalization (LN) and the like are connected through residual errors, so that information in a window and information among the windows can be acquired respectively; therefore, the nonlinear capability is increased, more parameter information is introduced, the model is facilitated to adapt to complex features, the model is adapted to complex natural environment scenes, the gradient vanishing problem is relieved, and the model can be better trained and optimized.
The characteristic diagram with the size of [ H, W,3] is processed by a Swin-transducer encoder, the [ H/4, W/4, C ] is obtained after passing through the 1 st module, the [ H/8,W/8,2C ] is obtained after passing through the 2 nd module, the [ H/16, W/16,4C ] is obtained after passing through the 3 rd module, the [ H/32, W/32,8C ] is obtained after passing through the 4 th module, wherein H is the height of the characteristic diagram, W is the width of the characteristic diagram, and C is the dimension, and then the multi-scale characteristic diagram is further extracted; after sample rearrangement, a plurality of small block sequences with the same size are obtained through a multilayer Swin-transducer structure, and the small block sequences are recombined according to the arrangement mode of the small block sequences in the characteristic diagram, so that a new characteristic diagram is formed.
The Swin-transform encoder fully utilizes the characteristics of Self-Attention mechanism, establishes long-distance modeling dependency relationship and global information of the image, and further improves the precision of the model. The Self-Attention calculation formula is as follows:
wherein q=x×w Q ,K=x×W K ,V=x×W V ,d k Is the characteristic dimension, B is the bias matrix, x is the input vector, W Q ,W K ,W V Is a transformation matrix, and is determined by initial parameters of a model;
wherein the depth separable convolution layer is composed of a channel-by-channel convolution and a point-by-point convolution, as shown in fig. 5;
assuming that the size of the input feature map is H×W, the number of channels is C, the convolution kernel size is K×K, the number of output channels (the number of convolution kernels) is M, and 1×1 is the size of the point-by-point convolution kernel; the parameter quantity of the depth separable convolution is KxKxC+1 x1xC xM, the number of the depth convolution kernels in the depth separable convolution is the same as the number of the input channels, the number of the point convolution kernels is the same as the number of the output channels, and the calculated quantity of the depth separable convolution is HxWx (KxKxC+1 x1xC x M); the parameter quantity and the calculated quantity are about 1/3 of those of the conventional convolution, and the comparison of the parameter quantity and the calculated quantity shows that the depth separable convolution has less parameter quantity and calculated quantity relative to the standard convolution, because the depth separable convolution splits the convolution operation into two smaller parts and the information fusion between channels is carried out through point-by-point convolution, thereby reducing the quantity of the parameter and the complexity of calculation; the unnecessary calculation cost caused by a large number of convolution operations in the feature extraction stage can be reduced to a large extent, and the calculation capability is greatly improved.
5. Improved feature fusion network for neck network modules in YOLOv5s
Aiming at the task scene of detecting and identifying the targets of the weak and small serial numbers of the container, the feature fusion network of the neck network module in YOLOv5s is improved by considering that the target detection and identification frame is small in size, so that the fusion capability of the feature mapping of the small scale is improved.
Considering that the feature fusion network of the neck network module in the original YOLOv5s cannot be well adapted to the feature map fusion of the weak small target of the container serial number, in order to reduce the calculation cost and the detection recognition capability, the patent proposes to improve the feature fusion network of the neck network module in the YOLOv5s, specifically, on the basis of the original YOLOv5s feature fusion network, a cascade composed of a CSP module, a depth separable convolution layer, an upscale layer, a Concat layer, a Swin-transform encoder, a depth separable convolution layer, a Concat layer, a CBL layer and a depth separable convolution layer aiming at the small scale feature map is added, the cascade reduces the network gradient disappearance and the optimization structure as far as possible, simultaneously enhances the detection capability of the shallow layer feature map, and properly adjusts the original YOLOv5s feature fusion network module, changes two inlet connection positions of the Concat corresponding to the small scale feature map, reasonably changes the relation between the feature extraction module and the feature fusion module, and the upstreams of the feature fusion module, and changes the upscale layer into the fine detail map, so that the small scale feature map can be better in the neck map, and the small detail map can be better reserved.
Based on the original YOLOv5s feature fusion network, a cascade composed of a CSP module, a depth separable convolution layer, an Upsamplelayer, a Concat layer, a Swin-transform encoder, a depth separable convolution layer, a Concat layer, a CBL layer and a depth separable convolution layer aiming at micro-scale feature mapping is added, as shown in figure 3; the cascade reduces the network gradient disappearance and optimizes the structure as much as possible, enhances the detection capability of the shallow feature map, properly adjusts the original YOLOv5s feature fusion network module while constructing the cascade, changes the two inlet connection positions of the Concat corresponding to the small-scale feature map, and reasonably corresponds to the upstream and downstream relation between feature extraction and feature fusion.
Changing the Upsample layer of the neck network into CARAFE so as to better keep the tiny details in the shallow feature map; the Updown layer of the original neck network adopts a simple nearest neighbor interpolation sampling method, but the nearest neighbor interpolation sampling method does not consider the interrelation among pixels, and only copies the nearest neighbor pixel value to sample, so that image details and texture information can be lost in the up-sampling process, and obvious jagged artifacts can be generated when the image is amplified; the CARAFE has the characteristics of larger receptive field, up-sampling based on input content and light weight, and can effectively avoid the problems; specifically, the CARAFE is divided into two main modules, namely an upsampling kernel prediction module and a feature recombination module, the upsampling multiplying power is sigma, for an input feature map with the size of H multiplied by W multiplied by C, the upsampling kernel is predicted by the upsampling kernel prediction module, and then upsampling is completed by the feature recombination module, so that an output feature map with the size of sigma H multiplied by sigma W multiplied by C is obtained.
Specifically, the upsampling kernel prediction module includes the following steps:
(1) Feature map channel compression: for an input feature map of H W C size, its channel number is compressed to C by a 1X 1 convolution m The aim is to reduce the calculation amount of the subsequent steps;
(2) Content coding and upsampling kernel prediction: (1) The compressed input feature diagram of the computer uses a k encoder ×k encoder Is used for predicting up-sampling kernel, and the number of input channels is C m The number of output channels isExpanding the channel dimension in the space dimension to obtain the shape +.>Is a upsampling kernel of (2);
(3) Up-sampling kernel normalization: normalizing the upsampled kernels obtained in (2) using Softmax to give a convolution kernel weight sum of 1, wherein Softmax is a classification function normalizing an n-dimensional vector input to an n-dimensional probability distribution, wherein the value of each element is between 0 and 1 and the sum of all elements is 1, with the specific formula:
wherein z is i Represents the i-th element of the input vector and n represents the dimension of the vector.
Specifically, the feature reorganization module includes the following steps:
for each position in the output profile, map it back to the input profile and take out k centered on the profile up ×k up And the predicted upsampling kernel for that point to obtain the output value, and different channels at the same location share the same upsampling kernel.
6. Multi-scale prediction with increased dim target detection heads
Through the newly added micro-scale detection head TPH and the improved CSP module, a BiFormer attention mechanism layer is introduced, detection and identification of weak and small targets are enhanced, and then prediction results of a plurality of detection heads are fused by using DIOU-NMS non-maximum suppression.
Considering that the target detection and identification of the weak and small serial number of the container are mostly target detection and identification under the micro scale, aiming at the problem, the patent proposes to increase multi-scale prediction of the weak and small target detection head, specifically, on the basis of the existing YOLOv5s detection head, a micro-scale detection head TPH is additionally arranged, a Biformer attention mechanism layer is introduced into the original small-scale detection heads of the micro-scale detection heads TPH and YOLOv5s, and the CSP module corresponding to the original small-scale detection heads and the micro-scale detection heads TPH of the YOLOv5s is replaced by a Swin-transform encoder, a depth separable convolution layer, a Concat layer and a CBL layer, so that residual structures in the CSP module are reserved, gradient disappearance is prevented, calculation cost is reduced, and fine granularity details are pertinently enhanced by introducing the Biformer attention mechanism layer, so that the target detection and identification capability of the weak and small serial number of the container is greatly improved; meanwhile, a small-scale detection head TPH is added, so that a shallower high-resolution feature map can be used for calculating a prediction result of a small target, and the prediction result of the detection head and the prediction results of other detection heads are fused by using DIOU-NMS non-maximum suppression, so that a comprehensive prediction result is finally obtained.
On the basis of the existing YOLOv5s detection head, a microscale detection head TPH is added, and for the detection and identification task of the small and weak serial number targets of the container, the perceptibility and the identification accuracy of the network to the small targets can be pertinently enhanced, the relation between the targets and the surrounding environment can be better understood, the flexibility and the expandability of the network are improved, and the detection and identification capability of the serial number of the container is greatly improved.
Replacing CSP modules corresponding to the original middle-small scale detection head and the original micro-scale detection head TPH of YOLOv5s with a Swin-transform encoder, a depth separable convolution layer, a Concat layer and a CBL layer, and reserving residual structures in the CSP modules, wherein the Swin-transform encoder consists of a Patch Partition, a Linear module and a Swin-transform-Block as shown in fig. 2 and 3; the feature modeling capability, the sequence relation modeling capability and the context awareness capability of the model can be improved, the detection capability of the weak and small serial number targets of the container is effectively improved, the defect of the CSP module on the global feature awareness capability is overcome, and the accuracy, the robustness and the efficiency of detecting and identifying the serial numbers of the container are improved.
Introducing a BiFormer attention mechanism layer into original small-scale detection heads of the small-scale detection heads TPH and YOLOv5s, and specifically inserting the middle part of a Conv layer connected with the two detection heads into the BiFormer attention mechanism layer; biFormer proposes a new dynamic sparse attention method that enables more flexibly distributed computation with content awareness through BRA (Bi-Level Routing Attention). Specifically, for queries, irrelevant key-value pairs are first filtered out at the coarse region level, then fine-grained Token-To-Token attention is applied in the union of the remaining candidate regions, computation and memory are saved with sparsity, and only GPU-friendly dense matrix multiplication is involved.
The BiFormer processes a small subset of related marks in a query self-adaptive manner and is not interfered by other irrelevant marks, so that the BiFormer has good performance and higher calculation efficiency, and particularly in a dense prediction task; meanwhile, the BRA saves the calculated amount through sparse sampling instead of downsampling, so that fine-granularity details are reserved, and the small-scale and micro-scale object detection and recognition capability in the container serial number detection and recognition task is greatly improved.
Neck portion modules such as CBL, concat, CSP modules are similar to CBL, concat, CSP modules in the trunk portion for feature fusion and processing of multi-scale information; the method comprises the steps that an original input image is subjected to trunk layer processing to obtain feature images with four sizes of 19 multiplied by 19, 38 multiplied by 38, 76 multiplied by 76 and 152 multiplied by 152, a final feature image can be obtained through a multi-layer CBL module, a FPN (Feature Pyramid Network) feature pyramid structure and an SPP module with a pooling core of 5 multiplied by 5, the feature images of different layers are fused together through up-sampling and down-sampling operations to generate a multi-scale feature pyramid, the top-down part is fused with the coarse-granularity feature image through up-sampling to realize feature fusion of different layers, and the feature images from different layers are fused from the bottom-up part through a convolution layer; specifically, the top-down part is fused with the coarse granularity feature map through upsampling to realize different-level feature fusion, and the method comprises the following three steps:
(1) Up-sampling the last layer of feature map to obtain a finer feature map;
(2) Fusing the feature map obtained after upsampling with the feature map of the upper layer to obtain richer feature expression;
(3) Repeating the above two steps until reaching the highest layer;
the feature graphs from different layers are fused through a convolution layer from the bottom to the top, and the method comprises the following three steps:
(1) Convolving the bottom-layer feature map to obtain a deeper feature expression;
(2) Fusing the convolved feature map with the feature map of the upper layer to obtain richer feature expression;
(3) Repeating the above two steps until reaching the highest layer;
and finally, fusing the feature graphs of the top-down part and the bottom-up part to obtain a final feature graph for target detection and identification.
The prediction results of the microscale detection heads TPH and the prediction results of other detection heads are fused by using DIOU-NMS non-maximum suppression, so that a comprehensive prediction result is obtained; the DIOU-NMS not only considers the IOU, but also considers the distance between the center points of the two frames, thereby greatly reducing the probability of missed detection.
The specific formula is as follows:
wherein s is i Is the classification confidence, ε is the NMS threshold, M is the box of highest confidence, B i For the frame to be processed ρ 2 (b,b gt ) Is the distance, b and b gt Representing two boxes, c is the diagonal length of the smallest box containing the two boxes.
7. Multi-model fusion prediction based on loss function optimization
Considering that the serial number detection and identification tasks of the weak and small targets of the container related to the patent are mostly rectangular text box targets, and in order to weaken the difference between data sets in the training process, the patent optimizes multi-model fusion prediction based on a LOSS function, specifically, the LOSS function is changed from CIOU LOSS to EIOU LOSS, and CTC LOSS function for adjusting the alignment condition of input and output sequences is introduced; dividing the training set self-service sampling into n training sets, respectively and independently training the n training sets to obtain n homogeneous models, fusing the prediction results of the n models by using WBF weighted bounding boxes to obtain final prediction results, adjusting parameters of each model by using the prediction results, and returning to the step of improving feature extraction by adding a Swin-transducer coder replacing a CSP module in YOLOv5s until the loss functions of each model are converged consistently, completing model training, completing integration of the n models, obtaining a final strong learner, and realizing the effect of jointly deciding by a plurality of models; and then, carrying out threshold filtering and boundary frame correction on the output result of the multi-model fusion prediction, setting a threshold value of classification confidence, only retaining the prediction result higher than the threshold value, and carrying out fine adjustment and correction on the predicted boundary frame so as to reduce false detection and improve accuracy.
Considering that strong dependency relationship does not exist between individual learners, a series of individual learners can be generated in parallel, and a Bagging (guide aggregation algorithm) algorithm is adopted for homogeneous integration, wherein the working mechanism of the Bagging algorithm is as follows:
(1) Carrying out n times of random sampling on the training set by using a self-help sampling method, and obtaining a sampling set of m samples from each time of sampling;
(2) For the n sampling sets, respectively and independently training n basic learners;
(3) And obtaining the final strong learner for the n base learners through a set strategy.
The self-service sampling method is a sampling method with a put-back function, and compared with an average division sampling method, the self-service sampling method can be more fit to actually simulate real data conditions.
Then fusing n homogeneous models obtained based on data division training with n prediction results by using WBF weighted boundary frames, wherein the n prediction results are respectivelyi has the values of 1, 2, 3, 4 and S i Confidence for the ith prediction box, +.>The lower right corner coordinates of the bounding box predicted for the ith detection head, +.>The upper left corner coordinates of the boundary frame predicted for the ith detection head; the final fusion prediction result is:
is the confidence of the fusion result.
Deep learning neural networks are a nonlinear method which provides greater flexibility and can divide a data set into different proportions for training, but one disadvantage of the flexibility is that the training is performed through a random training algorithm, the training data is sensitive to details, different weight sets can be found during each training, different predictions are generated, the neural network has high variance, and therefore the problem can be well avoided by adopting multi-model fusion prediction based on data division; it is noted that, because the model is a fusion model of YOLOv5s and Swin-transducer, the n value should not be too large in order to better function as a Swin-transducer module in the multi-model fusion prediction based on data partitioning, in this embodiment, n=5, considering that the self-attention mechanism in the Swin-transducer module generally requires a longer training time and a larger data set to exert its advantages.
The LOSS function is changed from CIOU LOSS to EIOU LOSS, and the aspect ratio LOSS term is split into the difference value between the predicted width and height and the width and height of the minimum external frame respectively, so that the convergence speed is accelerated and the regression precision is improved;
the specific formula is as follows:
L EIOU =L IOU +L dic +L asp (6)
wherein ρ is 2 (b,b gt ) Representing the Euclidean distanceB, w and h respectively represent the central point and width and height of the prediction frame, b gt ,w gt ,h gt Respectively represent the center point and the width and the height of the truth box, c w ,c h Representing the width and height of the combined minimum bounding rectangle for the prediction and truth boxes, respectively.
Introducing a CTC loss function for adjusting the alignment condition of the input and output sequences, wherein the optimization process of the CTC loss function aims at maximizing the conditional probability P (z|x) of the target sequence, so as to find the optimal alignment mode, and the predicted sequence is as close to the target sequence as possible;
the specific formula is as follows:
L(S)=-ln∏(x,z)∈sP(z|x)=-∑(x,z)∈slnP(z|x) (8)
where P (z|x) represents the probability of outputting the sequence z given the input x, S is the training set.
Compared with the prior art, the invention has the following beneficial effects:
(1) According to the invention, the 2 nd CSP module and the 4 th CSP module in a backbone network in an original Yolov5s network are replaced by a Swin-transform encoder, a depth separable convolution layer, a Concat layer and a CBL layer, the residual structure in the CSP module is reserved, the self-attention mechanism of the Swin-transform encoder well complements the defect of the CNN (Convolutional Neural Network) local attention mechanism in the original Yolov5s network, the global dependency relationship and the long-distance modeling capability are enhanced, the richer image shallow layer characteristics are extracted, the detection capability of weak and small targets is enhanced, the depth separable convolution layer is introduced, and the complexity of the model is reduced.
(2) Based on the original YOLOv5s feature fusion network, a cascade composed of a CSP module, a depth separable convolution layer, an Upsample layer, a Concat layer, a Swin-transform encoder, a depth separable convolution layer, a Concat layer, a CBL layer and a depth separable convolution layer aiming at micro-scale feature mapping is added, the Upsample layer of the neck network is changed into CARAFE, a feature fusion network for enhancing weak and small target detection and recognition capability is constructed, and a lightweight upsampling operator CARAFE is introduced, so that the calculation amount of a model is further reduced.
(3) According to the invention, a microscale detection head TPH is added, a Biformer attention mechanism layer is introduced into the microscale detection head TPH and the original small-scale detection head of YOLOv5s, CSP modules corresponding to the original small-scale detection head and the microscale detection head TPH of YOLOv5s are replaced by a Swin-transform encoder, a depth separable convolution layer, a Concat layer and a CBL layer, and residual structures in the CSP modules are reserved.
(4) According to the invention, multi-model fusion prediction is adopted, and prediction results obtained by multiple models are fused by using WBF weighted boundary frames, so that a final prediction result is obtained, when the data sets are huge, the difference between the data sets is effectively weakened, overfitting is prevented, and the generalization capability of the model is enhanced; meanwhile, the LOSS function is changed from CIOU LOSS to EIOU LOSS, so that the convergence speed is accelerated, the regression accuracy is improved, and a CTC LOSS function for adjusting the alignment condition of the input and output sequences is introduced, so that the method is more suitable for task scenes of detecting and identifying the targets of the weak and small serial numbers of the container.
Finally, it should be explained that: the above embodiments are merely for illustrating the technical aspects of the present invention and not for limiting the same, although the present invention has been described in detail with reference to the above embodiments, it will be understood by those skilled in the art that: modifications and substitutions may be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.