CN115527105A - Underwater target detection method based on multi-scale feature learning - Google Patents

Underwater target detection method based on multi-scale feature learning Download PDF

Info

Publication number
CN115527105A
CN115527105A CN202211190261.5A CN202211190261A CN115527105A CN 115527105 A CN115527105 A CN 115527105A CN 202211190261 A CN202211190261 A CN 202211190261A CN 115527105 A CN115527105 A CN 115527105A
Authority
CN
China
Prior art keywords
feature
texture
image
gray level
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211190261.5A
Other languages
Chinese (zh)
Inventor
罗俊海
陈瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202211190261.5A priority Critical patent/CN115527105A/en
Publication of CN115527105A publication Critical patent/CN115527105A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/05Underwater scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an underwater target detection method based on multi-scale feature learning. The method increases the number of the prediction heads to 4, can better detect small targets, reduces the resource occupation of a network model, ensures that the model is suitable for embedded equipment with limited hardware conditions, further improves the detection precision of underwater targets compared with other advanced target detectors, can effectively detect the targets, and has good detection precision and real-time property.

Description

Underwater target detection method based on multi-scale feature learning
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to an underwater target detection method based on multi-scale feature learning.
Background
Marine environments are a complex system. Since the underwater environment is very different from the land environment, some remote sensing technologies, including acoustics, magnetics and three-dimensional shallow earthquakes, have achieved good performance in the marine field. With the development of computer vision technology, it has become a new approach to explore the ocean by using computer vision technology. As a basic task of computer vision, target detection based on optical imaging has become a research hotspot in the marine field. In recent years, many researchers have begun to study underwater target detection based on optical imaging with great success. Currently, underwater target detection has many applications in marine environments, including marine ecosystem research, marine biota population estimation, marine species protection, ocean fishery, underwater unexplosive agent detection, underwater archaeology, and many other potential applications, providing an effective way to develop marine resources. Although the target detector based on the Deep Convolutional Neural Network (DCNN) performs well on general class data sets in recent years, it cannot be directly applied to underwater scenes due to the slow speed and large model size of large networks. Since underwater scenes are more complex than terrestrial scenes, the image samples obtained by underwater camera devices are small and generally of low quality. On the one hand, these images often suffer from high noise, low visibility, edge blur, low contrast and color cast.
Furthermore, although the object detection technique achieves good performance in general data sets, the image quality is generally poor due to low visibility and color deviation problems in complex underwater environments, the performance of YOLOv5, which is often used today, is not satisfactory for underwater object images in complex scenes, such as small objects, the YOLOv5 model still has much room for improvement, and the problem of small objects etc. results in less extractable information, which makes it difficult to obtain satisfactory results. The original feature extraction network of YOLOv5 has 53 layers, and parameters and calculation amount are huge, and in addition, due to the limitation of a convolution sampling method, the manual convolution network used by the CSPDarknet53 is not sensitive to target identification of different scales. Its ability to handle geometric changes in features is relatively limited, requiring extensive image training to improve the generalization ability of the network. In the practical use of YOLOv5, if the network encounters an element not in the data set, missed detection and false detection are likely to occur, thereby affecting the result of target detection. In the underwater target detection research based on deep learning, most researches mainly focus on using a deep network, so that the extracted features are not rich enough, the detection precision is not high enough, and the underwater target detection method is particularly directed to the detection of small targets. Therefore, better detection performance is needed for applying target detection techniques in marine environments.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an underwater target detection method based on multi-scale feature learning.
The technical scheme of the invention is as follows: an underwater target detection method based on multi-scale feature learning comprises the following specific steps:
the first step, the image pre-processing,
aiming at the detection of underwater targets, a data-independent enhancement method, namely a Mixup method, is adopted to construct a virtual sample, increase the robustness of the sample and enhance the images, in the training process, a Mosaic method is adopted to process a data set, four images are randomly read to perform operations such as scaling, overturning, cutting and the like, and are spliced to one image to serve as training data, so that the diversity of the data is increased, and the background of a detected object is enriched.
Step two, extracting the depth characteristics,
the CSPDarknet53 network structure is improved, the feature extraction network is replaced by ESNet, a double-backbone network is adopted to extract more abundant features, and an SE module is added in the ES block.
And when stride =2, adding depthwise depth-by-depth convolution and pointwise convolution to fuse different channel information, and when stride =1, introducing a Ghost module into the ES block to improve the performance of the ES block.
ES block in ESNet adopts depth separable convolution, the convolution is divided into Depthwise (DW) and Pointwise (PW) to extract a characteristic diagram, and Ghost block is applied to ES block with stride of 1.
Step three, extracting the texture characteristics,
aiming at the texture information extracted from the target, a Gray-Level Co-occurrrence Matrix (GLCM) is used for describing the correlation degree between adjacent pixel points in a local area and reflecting the comprehensive information of the Gray direction, interval and variation amplitude of the image.
The method comprises the steps of carrying out 3 x 3 grid division on an RGB image, converting a multi-channel image into a gray image aiming at each grid region, carrying out contrast adjustment in a histogram equalization mode, and then compressing the gray level of an adjustment result to reduce the post-calculation amount. And further analyzing texture classification characteristics caused by repeated appearance of a specific gray level structure by using statistic parameters calculated by CLCM evolution, and expressing the texture characteristics by selecting several parameters of contrast, energy and inverse variance.
The contrast expression is as follows:
Figure BDA0003869037960000021
wherein i represents the ith row of the gray matrix, j represents the jth column of the gray matrix, P (i, j) represents the probability that the gray level j appears with i as the starting point, and Con represents the contrast ratio.
The angular second moment (energy) expression is as follows:
Figure BDA0003869037960000031
wherein, P (i, j) represents the probability that the gray level takes i as a starting point and the gray level j appears, and Asm represents the solved angular second moment.
The inverse variance expression is as follows:
Figure BDA0003869037960000032
where d denotes the spatial distance, theta denotes the direction, H denotes the inverse variance, P (i, j | d, theta) denotes the probability that a gray level j occurs with i as the starting point, given the spatial distance d and the direction theta, and N denotes the maximum number of rows (columns) of the gray matrix.
By inspecting different texture rules in the horizontal direction, the vertical direction and the diagonal direction, namely taking four directions of 0 degree, 45 degrees, 90 degrees and 135 degrees respectively for the direction of the gray level co-occurrence matrix, extracting three types of rich texture characteristic parameters of contrast, second moment and inverse variance on the basis of the gray level co-occurrence matrix in each direction, calculating the characteristic mean value of all images and normalizing the characteristic mean value:
Figure BDA0003869037960000033
wherein x is a characteristic value to be calculated, x max Is the largest eigenvalue in the matrix, x min Is the smallest eigenvalue, x, in the matrix * Is a normalized characteristic value.
The classifier compares texture features between grid regions while paying attention to local grid texture features, and finally outputs 1 × 108 feature representation image texture features.
Step four, the depth characteristic and the texture characteristic are processed in a combined way,
the extracted texture features and the extracted depth features are fused, so that the network obtains richer feature information. After feature extraction, a Coordinate Attention mechanism is adopted, a new Coordinate Attention block is added in the last layer of a feature extraction network, position information is embedded into channel Attention by Coordinate Attention (CA), the channel Attention is decomposed into two one-dimensional feature coding processes by the Coordinate Attention block, features in different directions are gathered in the processes, and then the generated feature map is independently coded to form a pair of direction sensing and position sensing feature maps.
Step five, predicting the characteristic diagram to obtain a result,
the prediction module transfers and fuses the feature information into a feature map in an up-sampling mode, performs prediction after obtaining the feature map, and then obtains a final result, wherein the prediction module comprises 4 YOLO heads, the lower YOLO head applies the feature information of the upper YOLO head, and performs feature map splicing after up-sampling to obtain the feature map of the layer for prediction.
The invention has the beneficial effects that: the invention discloses an underwater target detection method based on multi-scale feature learning. The method increases the number of the prediction heads to 4, can better detect small targets, reduces the resource occupation of a network model, enables the model to be suitable for embedded equipment with limited hardware conditions, further improves the detection precision of underwater targets compared with other advanced target detectors, can effectively detect the targets, and has good detection precision and real-time performance.
Drawings
Fig. 1 is a flowchart of an underwater target detection method based on multi-scale feature learning according to the present invention.
Fig. 2 is an overall structure diagram of an underwater target detection method based on multi-scale feature learning according to the present invention.
Fig. 3 is a structural diagram of an ESNet according to an embodiment of the present invention.
Fig. 4 is an ES block structure diagram according to an embodiment of the present invention.
Fig. 5 is a structural diagram of a texture feature extraction module according to an embodiment of the present invention.
FIG. 6 is a diagram of a coordinate attention structure according to an embodiment of the present invention.
Detailed Description
The method of the present invention is further described with reference to the accompanying drawings and examples.
As shown in fig. 1, a flowchart of an underwater target detection method based on multi-scale feature learning according to the present invention includes the following specific steps:
as shown in fig. 2, the overall structure of the underwater target detection method based on multi-scale feature learning of the present invention is based on YOLOv5 detection algorithm, and is composed of a multi-scale feature extraction network, an attention mechanism based on coordinate attention block, and a YOLOv5 target detector, wherein a CBL module is composed of convolution operation, batch processing, and an activation function, the multi-scale feature extraction part includes a depth feature extraction part and a texture feature extraction part, and a schematic structural diagram in the depth feature extraction part is shown in fig. 3.
Step one, the image is preprocessed,
the quality of partial images of the data set is low due to the conditions of underwater depth, light and the like of the underwater images, and the problems of target blurring, low contrast or target overlapping and shielding are mainly shown, so that the detection precision is greatly influenced.
Aiming at the detection of an underwater target, a data-independent enhancement method, namely a Mixup method, is adopted to construct a virtual sample, increase the robustness of the sample and enhance an image, in the training process, a Mosaic method is adopted to process a data set, four images are randomly read to perform operations such as scaling, overturning, cutting and the like, and are spliced to one image to serve as training data, so that the diversity of the data is increased, and the background of a detected object is enriched.
Step two, extracting the depth characteristics,
the structure of the CSPDarknet53 network is improved, the feature extraction network is replaced by ESNet, and a double-backbone network is adopted to extract more abundant features, wherein the size of an input image is 416 multiplied by 3, so that the specific structures of three feature graphs with the sizes of 384, 192 and 96, which are output after the features are extracted by a single ESNet, are respectively shown in FIG. 4. And an SE module is added in the ES block, and the SE is a very good operation for improving the feature expression capability.
The design of the SE module refers to MobileNet V3, namely the two activation functions are Sigmoid and H-Sigmoid respectively. In order to solve the problem that fusion characteristics are lost due to the fact that channels are replaced by channels in ShuffLeNet V2 to provide information exchange, when stride =2, depthwise depth-by-depth convolution and pointwise convolution are added to fuse different channel information, a Ghost module in GhostNet can generate more characteristics with fewer parameters and improve the learning capability of a model, and therefore when stride =1, the Ghost module is introduced into ES block to improve the performance of the ES block.
At present, most convolution operations are point-by-point convolution for reducing dimensions, an ES block in ESNet adopts deep separable convolution, the convolution is divided into Depthwise (DW) and Pointwise (PW) parts for extracting feature maps, the ES block with stride of 1 is used for a Ghost block, the Ghost block combines linear operation and common convolution, and some redundant feature maps are linearly converted from the generated common convolution feature maps to obtain similar feature maps, so that a high-dimensional convolution effect is generated, and model parameters and calculated amount are reduced. And finally, outputting the image depth features S1, S2 and S3.
Step three, extracting the texture characteristics,
as shown in fig. 5, the texture feature extraction module mainly extracts texture information for a target, and uses a Gray-Level Co-occurrence Matrix (GLCM) to describe a correlation degree between adjacent pixels in a local area, so as to reflect comprehensive information of an image Gray direction, an image Gray interval, and an image change range.
The method comprises the steps of carrying out 3 x 3 grid division on an RGB image, converting a multi-channel image into a gray image aiming at each grid region, carrying out contrast adjustment in a histogram equalization mode, and then compressing the gray level of an adjustment result to reduce the post-calculation amount. Because the statistic gray level co-occurrence matrix feature dimension is large, the later-stage classifier has certain difficulty in distinguishing texture information according to the statistic parameter calculated by CLCM evolution, the texture classification feature caused by repeated occurrence of a specific gray level structure is further analyzed, and the following parameters are selected to express the texture feature:
contrast ratio: the values of the metric matrix are how distributed and how much of the local variation in the image reflects the sharpness of the image and the depth of the texture. The deeper the furrows of the texture, the greater the contrast, the clearer the effect; otherwise, if the contrast value is small, the grooves are shallow and the effect is blurred. The expression is as follows:
Figure BDA0003869037960000051
wherein i represents the ith row of the gray matrix, j represents the jth column of the gray matrix, P (i, j) represents the probability that the gray level j appears with i as the starting point, and Con represents the contrast ratio.
Angular second moment (energy): angular Second Moment (ASM) is an index reflecting the uniformity of the image gray distribution and the thickness of the texture. The image gray distribution in the local area is more uniform, and the ASM value is correspondingly larger. The expression is as follows:
Figure BDA0003869037960000061
wherein, P (i, j) represents the probability of gray level j taking i as a starting point, and Asm represents the angular second moment.
Inverse variance: the inverse variance reflects the size of local change of the image texture, if different regions of the image texture are uniform, the change is slow, the inverse variance is large, and otherwise, the inverse variance is small.
Figure BDA0003869037960000062
Where d represents the spatial distance, θ represents the direction, and H represents the inverse variance. P (i, j |, θ) represents the probability that a gray level j occurs starting from i for a given spatial distance d and direction θ, and N represents the maximum number of rows (columns) of the gray matrix.
By inspecting different texture rules in the horizontal direction, the vertical direction and the diagonal direction, namely taking four directions of 0 degree, 45 degrees, 90 degrees and 135 degrees respectively for the direction of the gray level co-occurrence matrix, extracting three types of rich texture characteristic parameters of contrast, second moment and inverse variance on the basis of the gray level co-occurrence matrix in each direction, calculating the characteristic mean value of all images and normalizing the characteristic mean value:
Figure BDA0003869037960000063
wherein x is a characteristic value to be calculated, x max Is the largest eigenvalue in the matrix, x min Is the smallest eigenvalue, x, in the matrix * Is a normalized characteristic value.
The classifier compares texture features between mesh regions while focusing on local mesh texture features, and finally outputs 1 × 108 feature T representing image texture features.
Step four, depth feature and texture feature fusion processing,
and respectively extracting image texture information from the images I1 and I2, then fusing the image texture information with the extracted depth features respectively, so that the network obtains richer feature information, and finally extracting network output features C3, C4 and C5 by using the features. Expressed as:
C3=concat(S11+T1,S21+T2)
C4=concat(S12,S22)
C5=concat(S13,S23)
wherein, T1 represents the texture feature information extracted from the 1 st image, T2 represents the texture feature information extracted from the 2 nd image, and Sij (i ∈ (1,2), j ∈ (1,2,3)) represents the jth extracted depth feature of the ith image.
After the feature map is reduced in size, a coordinate attention mechanism is adopted, and although the channel attention can significantly improve the performance of the model, the position information is usually ignored, which is very important for generating the spatial selective attention. To enable the model to extract more useful features, a new Coordinate Attention block is added at the last layer of the feature extraction network, as shown in fig. 6, and position information is embedded into channel Attention using Coordinate Attention (CA), so that the network can focus on a larger area, unlike channel Attention which converts input into a single feature vector through two-dimensional global pooling, the Coordinate Attention block decomposes the channel Attention into two one-dimensional feature encoding processes which aggregate features in different directions, and then the generated feature maps are separately encoded to form a pair of direction-aware and position-sensitive feature maps which can be applied complementarily to the input feature maps to enhance the representation of the object of interest.
Step five, predicting the characteristic diagram to obtain a result,
the prediction module is used for transferring and fusing the feature information into a feature map in an up-sampling mode, predicting after obtaining the feature map, and then obtaining a final result, wherein the prediction module comprises 4 YOLO heads, the lower YOLO head is applied with the feature information of the upper YOLO head, and the feature map is spliced after the up-sampling, so that the feature map of the layer is obtained for prediction. Compared with the original yollov 5 three-layer prediction, the improvement in this embodiment is that a YOLO header dedicated to detecting a small target sample is added in the prediction module, that is, the current YOLO header is: 96 is multiplied by 96, 48 is multiplied by 48, 24 is multiplied by 24, 12 is multiplied by 12, which is beneficial to detecting underwater small targets.
In the data labeling process, the size of a real frame is greatly different from the default size of an original algorithm, and the size of a prior frame has influence on the network detection speed and precision, so that the size of the prior frame matched with an experimental model is selected, the convergence speed in training can be increased, the target positioning is more accurate, the size of an anchor frame needs to be redesigned, and the size of the prior frame is changed from 9 default sizes to 12 default sizes after the network is improved and is distributed to 4 detection layers with different scales. The original YOLOv5 adopts a K-Means clustering algorithm, but the algorithm has certain defects, the K-Means clustering algorithm has the defects that clustering results are likely to have larger difference after different initial centroids are selected, and K-Means + + is adopted to replace K-Means to solve the problem aiming at the defect that the K-Means selects the initial centroids. The algorithm process is as follows:
firstly, randomly selecting a sample from data samples as an initial clustering centroid, then calculating the shortest distance D (x) between other samples in the data set and the currently existing class center, then calculating the probability P (x) of the sample selected as the next class center, wherein the sample with the largest probability is the next class center, and the formula for calculating P (x) is as follows:
Figure BDA0003869037960000071
where X represents a sample, X represents the set of all samples, and P (X) represents the probability that a sample is selected as the next cluster center.
And then, repeatedly calculating the distance and the probability when selecting one clustering center until K clustering centers are selected. Because a layer of scale detection is added, the clustering center K is set to be 12, then a K-Means + + clustering algorithm is adopted, 12 groups of prior frames with different sizes, namely the sizes of anchor frames can be obtained, wherein the smallest feature graph is suitable for larger target detection due to the largest receptive field, and the largest feature graph is provided with a smaller receptive field, so that the smallest anchor point is applied to detect small targets.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (1)

1. An underwater target detection method based on multi-scale feature learning comprises the following specific steps:
step one, the image is preprocessed,
aiming at the detection of an underwater target, a data-independent enhancement method, namely a Mixup method, is adopted to construct a virtual sample, increase the robustness of the sample and enhance an image, in the training process, a Mosaic method is adopted to process a data set, four images are randomly read to perform operations such as scaling, overturning, cutting and the like, and are spliced to one image to serve as training data, so that the diversity of the data is increased, and the background of a detected object is enriched;
step two, extracting the depth characteristics,
the CSPDarknet53 network structure is improved, a feature extraction network is replaced by ESNet, a double-backbone network is adopted to extract more abundant features, and an SE module is added in an ES block;
the channel is replaced by a channel in ShuffleNet V2 to provide information exchange, when stride =2, depthwise depth convolution and pointwise convolution are added to fuse different channel information, and when stride =1, a Ghost module is introduced into the ES block to improve the performance of the ES block;
the ES block in ESNet adopts depth separable convolution, the convolution is divided into Depthwise (DW) and Pointwise (PW) to extract a characteristic diagram, and the ES block with stride of 1 uses Ghost block;
step three, extracting the texture characteristics,
aiming at the texture information extracted from the target, describing the correlation degree between adjacent pixel points in a local area by using a gray level co-occurrence matrix GLCM (global likelihood matching model), and reflecting the comprehensive information of the gray level direction, interval and variation amplitude of the image;
carrying out 3 x 3 grid division on an RGB image, converting a multi-channel image into a gray image for each grid region, carrying out contrast adjustment in a histogram equalization mode, then compressing the gray level of an adjustment result, further analyzing texture classification features caused by repeated appearance of a specific gray level structure by using statistic parameters calculated by CLCM evolution, and expressing the texture features by selecting contrast, energy and inverse variance;
the contrast expression is as follows:
Figure FDA0003869037950000011
wherein i represents the ith row of the gray matrix, j represents the jth column of the gray matrix, P (i, j) represents the probability of gray level j taking i as a starting point, and Con represents the solved contrast;
the angular second moment (energy) expression is as follows:
Figure FDA0003869037950000012
wherein, P (i, j) represents the probability that the gray level takes i as a starting point and the gray level j appears, and Asm represents the solved angular second moment;
the inverse variance expression is as follows:
Figure FDA0003869037950000021
where d denotes the spatial distance, θ denotes the direction, H denotes the inverse variance, P (i, j | d, θ) denotes the probability that a gray level j occurs with i as the starting point, given the spatial distance d and the direction θ, and N denotes the maximum number of rows (columns) of the gray matrix;
inspecting different texture rules in four directions of horizontal, vertical and diagonal, namely taking four directions of 0 degree, 45 degrees, 90 degrees and 135 degrees respectively as the directions of the gray level co-occurrence matrix, extracting texture characteristic parameters with rich three types of contrast, second moment and inverse variance on the basis of the gray level co-occurrence matrix in each direction, calculating the characteristic mean value of all images and normalizing the characteristic mean value:
Figure FDA0003869037950000022
wherein x is a characteristic value to be calculated, x max Is the largest eigenvalue in the matrix, x min Is the smallest eigenvalue, x, in the matrix * The characteristic value is normalized;
the classifier compares texture features between grid areas while paying attention to local grid texture features, and finally outputs 1 × 108 feature representation image texture features;
step four, the depth characteristic and the texture characteristic are processed in a combined way,
fusing the extracted texture features and depth features, adopting a Coordinate Attention mechanism after feature extraction, adding a new Coordinate Attention block at the last layer of a feature extraction network, embedding position information into channel Attention by using Coordinate Attention (CA), decomposing the channel Attention into two one-dimensional feature coding processes by using the Coordinate Attention block, gathering features in different directions, and then independently coding the generated feature map to form a pair of direction perception and position sensitivity feature maps;
step five, predicting the characteristic diagram to obtain a result,
the prediction module transfers and fuses the feature information into a feature map in an up-sampling mode, performs prediction after obtaining the feature map, and then obtains a final result, wherein the prediction module comprises 4 YOLO heads, the lower YOLO head applies the feature information of the upper YOLO head, and performs feature map splicing after up-sampling to obtain the feature map of the layer for prediction.
CN202211190261.5A 2022-09-28 2022-09-28 Underwater target detection method based on multi-scale feature learning Pending CN115527105A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211190261.5A CN115527105A (en) 2022-09-28 2022-09-28 Underwater target detection method based on multi-scale feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211190261.5A CN115527105A (en) 2022-09-28 2022-09-28 Underwater target detection method based on multi-scale feature learning

Publications (1)

Publication Number Publication Date
CN115527105A true CN115527105A (en) 2022-12-27

Family

ID=84699071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211190261.5A Pending CN115527105A (en) 2022-09-28 2022-09-28 Underwater target detection method based on multi-scale feature learning

Country Status (1)

Country Link
CN (1) CN115527105A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953408A (en) * 2023-03-15 2023-04-11 国网江西省电力有限公司电力科学研究院 YOLOv 7-based lightning arrester surface defect detection method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953408A (en) * 2023-03-15 2023-04-11 国网江西省电力有限公司电力科学研究院 YOLOv 7-based lightning arrester surface defect detection method
CN115953408B (en) * 2023-03-15 2023-07-04 国网江西省电力有限公司电力科学研究院 YOLOv 7-based lightning arrester surface defect detection method

Similar Documents

Publication Publication Date Title
CN110738697B (en) Monocular depth estimation method based on deep learning
CN110020606B (en) Crowd density estimation method based on multi-scale convolutional neural network
CN111723693B (en) Crowd counting method based on small sample learning
CN113011329B (en) Multi-scale feature pyramid network-based and dense crowd counting method
CN114241548A (en) Small target detection algorithm based on improved YOLOv5
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN111488827A (en) Crowd counting method and system based on multi-scale feature information
CN113139489B (en) Crowd counting method and system based on background extraction and multi-scale fusion network
CN113888547A (en) Non-supervision domain self-adaptive remote sensing road semantic segmentation method based on GAN network
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN116052212A (en) Semi-supervised cross-mode pedestrian re-recognition method based on dual self-supervised learning
CN115527105A (en) Underwater target detection method based on multi-scale feature learning
Kavitha et al. Convolutional Neural Networks Based Video Reconstruction and Computation in Digital Twins.
CN113989718A (en) Human body target detection method facing radar signal heat map
CN117292324A (en) Crowd density estimation method and system
CN115953736A (en) Crowd density estimation method based on video monitoring and deep neural network
CN113780305B (en) Significance target detection method based on interaction of two clues
CN116403152A (en) Crowd density estimation method based on spatial context learning network
CN115797684A (en) Infrared small target detection method and system based on context information
CN115439926A (en) Small sample abnormal behavior identification method based on key region and scene depth
CN112488122B (en) Panoramic image visual saliency prediction method based on convolutional neural network
CN114693951A (en) RGB-D significance target detection method based on global context information exploration
CN114565764A (en) Port panorama sensing system based on ship instance segmentation
CN112925932A (en) High-definition underwater laser image processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination