CN117649544A

CN117649544A - Light-weight aquatic target detection method, device and medium

Info

Publication number: CN117649544A
Application number: CN202311392497.1A
Authority: CN
Inventors: 张卫东; 冯威翔; 贾泽华; 薛珊; 张云飞; 张义博; 张安民; 郭东生; 曹刚; 任佳; 邹勇华; 张文波
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-03-05

Abstract

The invention relates to a light-weight water target detection method, a light-weight water target detection device and a light-weight water target detection medium, wherein the method comprises the following steps: acquiring image data of an aquatic target, and constructing an aquatic target data set; constructing a target detection model based on improved YOLOv7, adopting a light linear bottleneck inverse residual error module to reconstruct a feature extraction module, introducing a coordinate attention mechanism to replace an SE module, and simultaneously, using an SPD combined non-stride convolution layer mode to replace a downsampling module in YOLOv 7; clustering the water target data sets by using a clustering algorithm, distributing the clustered data sets to detection heads with different scales, and training a target detection model; carrying out structural re-parameterization on the trained target detection model; and inputting the acquired image into a target detection model with a structure re-parameterized, and obtaining the position and class confidence information of the target. Compared with the prior art, the method and the device can realize rapid and accurate detection of the target on water under the condition of limited computing resources.

Description

Light-weight aquatic target detection method, device and medium

Technical Field

The invention relates to the technical field of computer vision, in particular to a lightweight aquatic target detection method, device and medium based on improved YOLOv 7.

Background

The ocean covers a large portion of the earth's surface, which is rich in resources. Unmanned equipment such as unmanned ship, unmanned aerial vehicle can adapt to the complex environment, has higher autonomy and mobility, can carry out some dangerous tasks, reduces the reliance of manpower resources, reduces personnel's security risk. The target detection provides strong perception capability for unmanned equipment such as unmanned ships, unmanned planes and the like. By carrying the target detection algorithm, the method can detect the targets on water such as ships, buoys, floaters and the like in real time, provide position and category confidence information for related decision makers, and complete tasks such as maritime search and rescue, resource exploration, environment monitoring, autonomous navigation, obstacle avoidance and the like.

Various deep learning water target detection algorithms exist today, and the number of model weight parameters increases exponentially. While large-scale models perform well in terms of accuracy, the functionality of these models is currently used, which typically requires deployment at the cloud or at a specific data center, accessed through the internet or application programming interface. The network communication in the marine environment has the problems of delay and the like, so that the target detection model is required to be deployed to the edge terminal for edge calculation, and the effect of real-time detection is realized. However, the computing power of the hardware is difficult to keep up with the increasing number of parameters in the target detection model, and unmanned equipment such as unmanned ships, unmanned planes and the like is generally limited by the computing power of the hardware, so that the deployment requirement of the large-scale water target detection model cannot be met.

Disclosure of Invention

The invention aims to provide a lightweight aquatic target detection method, device and medium based on improved YOLOv7, so that high-efficiency target detection capability is realized under limited computing resources, and the method, device and medium can be well deployed in an unmanned system to rapidly and accurately detect the aquatic target.

The aim of the invention can be achieved by the following technical scheme:

a lightweight water target detection method based on improved YOLOv7 comprises the following steps:

s1, obtaining image data of an aquatic target, and performing data enhancement operation on the image to construct an aquatic target data set;

s2, constructing a target detection model based on improved YOLOv7, adopting a light linear bottleneck inverse residual error module to reconstruct a feature extraction module, introducing a coordinate attention mechanism to replace an SE module, and simultaneously, using an SPD combined non-stride convolution layer mode to replace a downsampling module in YOLOv 7;

s3, clustering the water target data sets by using a clustering algorithm, distributing the clustered data sets to detection heads with different scales, and training a target detection model;

s4, carrying out structural re-parameterization on the trained target detection model;

s5, inputting the acquired image into a target detection model after structural re-parameterization to obtain the position and class confidence information of the target.

The step S1 specifically comprises the following steps: collecting the water video shot by shooting equipment, extracting image data by setting a frame interval, marking the image data, and performing data enhancement operation on the image so as to expand a water target data set; the data set is divided into a training set, a testing set and a verification set according to a certain proportion, and the labels are converted into yolo format.

The data enhancement operations include random rotation, random scaling, random translation, random flipping, tone conversion.

In the step S2, 13 light linear bottleneck inverse residual modules of the first 2-14 layers of the MobileNet V3-large are adopted as 2-14 layers of a feature extraction module of the YOLOv7, wherein the 2-7 layers adopt a ReLU6 activation function, the 8-14 layers adopt an h-swish activation function, 2-5 layers are defined as a stage 1, 6-8 layers are defined as a stage 2, 9-14 layers are defined as a stage 3, and 3 stages respectively extract features F with 3 scales which are 8, 16 and 32 times of downsampling ₁ (80×80×40)、F ₂ (40×40×80)、F ₃ (20 multiplied by 160) and is input into a feature fusion module after channel adjustment;

the SE modules of 5-7 layers and 12-14 layers of the feature extraction module are respectively replaced by a coordinate attention mechanism module;

and replacing downsampling in a layer 1 of a feature extraction module and a feature fusion module in the YOLOv7 model by combining the SPD with a non-stride convolution layer.

And the coordinate attention mechanism module carries out global average pooling on the input characteristic of C×H×W in the width and height directions to obtain two characteristic layers of C×H×1 and C×1×W, transposes W, H to the same latitude to obtain the characteristic layer of C×1× (W+H) after splicing, carries out convolution operation for 2 times, and finally obtains the full-image coordinate attention of C×H×W after a Sigmoid activation function.

The step S3 includes the steps of:

s31, clustering an overwater target data set by adopting a K-Means clustering algorithm, setting the number K of clusters, using 1-IOU (grouping box) as a distance d between samples, updating a cluster center, wherein the grouping box is a marked boundary frame, the anchor is an priori frame, and distributing the K priori frames obtained by clustering to three different scale detection heads from large to small in average;

s32, unifying the sizes of the input images, loading the reconstructed network structure for model training, randomly selecting training images with preset numbers each time during training, and randomly splicing the training images together to be used as the input images for training.

The step S4 specifically includes: all 3×3 convolution operations in the trained target detection model are structurally re-parameterized with the BN layer, and the 3×3 convolutions, 1×1 convolutions, and three branches of the BN layer in the REP module of the detection head portion are reconstructed into 13×3 convolutions.

A lightweight aquatic target detection device based on improved YOLOv7, comprising:

the data acquisition module is used for acquiring image data of the water target, carrying out data enhancement operation on the image and constructing a water target data set;

the target detection model construction module is used for constructing a target detection model based on improved YOLOv7, adopting a light linear bottleneck inverse residual error module to reconstruct a feature extraction module, introducing a coordinate attention mechanism to replace an SE module, and simultaneously, using an SPD combined with a non-stride convolution layer to replace a downsampling module in YOLOv 7;

the target detection model training module is used for clustering the water target data sets by using a clustering algorithm, distributing the clustered data sets to detection heads with different scales, and training the target detection model;

the structure re-parameterization module is used for carrying out structure re-parameterization on the trained target detection model;

the detection module is used for inputting the acquired image into a target detection model after the structure is re-parameterized to obtain the position and class confidence information of the target.

A lightweight aquatic target detection device based on improved YOLOv7 comprises a memory, a processor and a program stored in the memory, wherein the processor realizes the method when executing the program.

A storage medium having stored thereon a program which when executed performs a method as described above.

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the invention, the feature extraction network of the YOLOv7 is reconstructed by using a lightweight linear bottleneck inverse residual structure, and the trained network is subjected to structural re-parameterization, so that the parameter number and the calculated amount of the model are greatly reduced, and the forward reasoning speed of the model is accelerated.

(2) According to the invention, the anchor is obtained through the K-Means clustering algorithm, so that the anchor is more suitable for the target data set on water, and the detection precision is improved.

(3) According to the invention, the attention weight of the whole feature map is acquired by introducing a coordinate attention mechanism, so that the target is positioned more accurately.

(4) According to the invention, the undersampling of the stride convolution and the maximum pooling is replaced by introducing the SPD combined with the non-stride convolution layer, so that the loss of fine granularity characteristics is reduced, and the detection precision of a weak and small target is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is an overall block diagram of an object detection model in one embodiment;

FIG. 3 is a schematic diagram of a coordinate attention mechanism architecture in one embodiment;

FIG. 4 is a schematic diagram of an SPD structure in one embodiment;

FIG. 5 is a schematic diagram of a method of structure re-parameterization in one embodiment.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The embodiment provides a lightweight water target detection method based on improved YOLOv7, as shown in fig. 1, comprising the following steps:

s1, obtaining image data of an aquatic target, and performing data enhancement operation on the image to construct an aquatic target data set.

In the specific implementation, the method comprises the steps of acquiring water videos shot by shooting equipment such as unmanned aerial vehicles, unmanned boats, water monitoring and the like according to requirements, extracting image data by setting frame intervals, marking the image data, and carrying out data enhancement operations such as random rotation, random scaling, random translation, random overturning, tone transformation and the like on the image so as to expand a water target data set; the dataset is divided into a training set, a test set and a validation set in a certain ratio (e.g., 7:2:1), and the labels are converted into yolo format.

S2, constructing a target detection model based on improved YOLOv7, as shown in FIG. 2, adopting a light linear bottleneck inverse residual error module to reconstruct a feature extraction module, introducing a coordinate attention mechanism to replace an SE module, and simultaneously, using an SPD combined with a non-stride convolution layer to replace a downsampling module in YOLOv 7.

In this embodiment, the specific process of S2 includes:

s21, reconstructing a feature extraction module in the YOLOv7 network structure by adopting a light linear bottleneck inverse residual error module. The linear bottleneck inverse residual structure is firstly increased by 1X 1 convolution, then the features are extracted by depth separable convolution, and finally the features are fused by 1X 1 convolution.

In the deep learning model, convolution operation is a time-consuming and energy-consuming operation, and the depth separable convolution adopts different convolution kernels for each input channel to reduce the calculated amount.

Taking 13 light linear bottleneck inverse residual modules of the first 2-14 layers of the mobilenet v3-large as 2-14 layers of a feature extraction module of YOLOv7, as shown in the feature extraction module in fig. 2, the meanings represented by block_5× 5_S =2_re_ca are as follows: 5×5 denotes the size of the Depthwise convolution kernel in the linear bottleneck inverse residual structure, s=2 denotes the Depthwise convolution kernel step size of 2, re denotes the activation function ReLU6, HS denotes the activation function h-swish, and CA denotes that the layer has a coordinate attention module.

Wherein, the 2 nd layer to the 7 th layer adopt a ReLU6 activation function, the 8 th layer to the 14 th layer adopt an h-swish activation function, and the 2 th layer to the 5 th layer are definedExtracting features F of 3 scales of 8, 16 and 32 times downsampled for stage 1, 6-8 as stage 2 and 9-14 as stage 3,3 stages respectively ₁ (80×80×40)、F ₂ (40×40×80)、F ₃ (20×20×160), and is input to the feature fusion module after channel adjustment.

S22, introducing a coordinate attention mechanism to replace the SE module in the MobileNet V3-large.

Traditional attention mechanisms, such as SE, CBAM, are mainly modeling the channeling or spatial dependencies within feature maps. However, they lack the ability to explicitly capture the spatial relationship between different locations within the feature map. The coordinate attention mechanism locates the target more accurately by taking into account the attention weight of each location.

In this embodiment, the SE modules of the 5-7 layers and 12-14 layers of the feature extraction module are replaced by the coordinate attention mechanism module respectively. The coordinate attention is an attention mechanism focusing on global position information, the structure of a coordinate attention mechanism module is shown in fig. 3, and for an input c×h×w feature map X, global average pooling is performed on the feature map X along two directions of width and height to obtain two feature layers c×h×1 and c×1×w, where C is the number of channels of the input feature map, and W, H is the width and height of the input feature map.

Output of the c-th channel of height hOutput of c-th channel of width w +.>Can be expressed as:

w, H is transposed to the same latitude and spliced to obtain a characteristic layer of Cx1x (W+H), the attention about the horizontal and vertical directions is obtained after convolution operation, and then the characteristic diagram f is sliced to obtain f ^h (C×H×1)，f ^w (Cx1 xW), the specific derivation is as follows:

f ^h ，f ^w ＝slice(f)

concat is splicing along the width direction, conv is convolution operation, and slice is cutting along the width direction.

F obtained after slicing ^h ，f ^w Respectively and again convolving to obtain g ^h ,g ^w . Finally, the complexity and the calculation cost of the model are reduced through the Sigmoid activation function, and the full-graph coordinate weight omega is obtained. The specific derivation formula is as follows:

g ^h ＝σ(Conv _h (f ^h ))，g ^w ＝σ(Conv _w (f ^w ))

wherein σ is a Sigmoid activation function.

Finally, a global feature map is obtained, and y is output _c Can be expressed as the following formula:

s23, replacing a downsampling module in the model by using the form of combining SPD with a non-stride convolution layer.

Downsampling is often achieved by using a stride convolution or pooling layer in the deep learning model today, which can lead to loss of fine-grained information and thus affect detection of small targets. The on-water area is wide, the problem of weak and small target identification is frequently encountered in on-water target detection, and the SPD is combined with a non-stride convolution layer, so that the fine granularity characteristic is reserved, and the detection precision of the weak and small targets can be improved.

The SPD structure is shown in fig. 4, and the specific derivation formula is as follows:

wherein scale is a multiple of the required downsampling, X is an input feature map, f is a sub feature map, C is the channel number of the feature map, and S is the length and width of the feature map.

When scale=2, 4 sub-feature maps f can be obtained _0,0 ,f _0,1 ,f _1,0 ,f _1,1 The dimensions of the 4 sub-feature maps are:downsampling is achieved by a factor of 2.

Map f of four sub-features _0,0 ,f _0,1 ,f _1,0 ,f _1,1 And (3) splicing along the channel dimension to obtain a feature map X', wherein the feature map has the following dimensions:

and fusing the characteristic information and reducing the dimension through a non-stride convolution layer to obtain a characteristic diagram X', wherein the characteristic diagram has the following dimensions:

in this embodiment, the non-stride convolutional layer employs a 1×1 convolution with a step size of 1. The replacement of downsampling is specifically in layer 1 of the feature extraction module, and the downsampling portion of the feature fusion module.

And S3, clustering the water target data set by using a clustering algorithm, distributing the clustered data set to detection heads with different scales, and training a target detection model.

Many aquatic targets have a relatively long length and a relatively small width, and thus conventional anchors are not suitable. Clustering the constructed overwater target data sets to make the data sets more suitable.

In specific implementation, step S3 includes the following steps:

s31, clustering an on-water target data set by adopting a K-Means clustering algorithm to redesign the aspect ratio of an Anchor, setting the number K=9 of clusters, and updating a cluster center by using 1-IOU (bounding box) as a distance d between samples, wherein the bounding box is a marked boundary box, the Anchor is an priori box, and the 9 priori boxes obtained by clustering are distributed to detection heads with three different scales of 20 multiplied by 20, 40 multiplied by 40 and 80 multiplied by 80 from large to small.

Specifically, the clustering process is as follows:

1) Randomly selecting 9 groups from all the binding boxes of the data set to serve as cluster centers;

2) Calculating the distance d between each binding box and each cluster;

3) Assigning each binding box to the cluster nearest to the cluster;

4) Re-calculating a cluster center according to the bounding box of each cluster;

5) Repeating 3) -4) until the elements in each cluster no longer change.

S32, resetting the anchor with the obtained anchor, unifying the input image size to 640 multiplied by 640, loading the reconstructed network structure for model training, randomly selecting 4 training images each time during training, randomly splicing the training images together to serve as the input image for training, and simultaneously adopting a Mosaic enhancement method when the training model reads a data set. Batch size=16, epoch=300, and initial learning rate was 0.01. And taking the weight parameter with the minimum loss function as a final training result.

S4, carrying out structural re-parameterization on the trained target detection model.

Training models often uses multi-branch structures, which can increase the characterization capabilities of the model. Structural reparameterization refers to using a multi-branch model structure in the training phase, and equivalently converting parameters of the network structure into another set of parameters of a single-branch structure in the forward reasoning phase. On the premise of not influencing the result, the model is more efficient and more convenient to deploy.

In specific implementation, all 3×3 convolution operations in the trained target detection model are structurally re-parameterized with the BN layer, and the 3×3 convolution, the 1×1 convolution and the BN layer three branches in the REP module of the detection head portion are reconstructed into 13×3 convolutions. Wherein BN is Batch Normalization operated.

In this embodiment, the specific process of S4 includes:

s41, a convolution and BN fusion part shown in FIG. 5. For 1×1 convolution, 0 is complemented around the convolution kernel weight, turning into 3×3 convolution; for BN-only layers, a 3 x 3 convolution with 1 in the center and 0 around is constructed; the 3 x 3 convolution was fused with the BN layer.

Taking a convolution operation as an example, we can get the weights ω and offsets b of all convolution kernels for a trained network. For input X ₁ Can obtain the output Y ₁ ＝ωX ₁ +b。

Then output Y ₁ BN operation is required, and the formula is as follows:

where γ and β are learning hyper-parameters, μ is sample mean, σ is variance, ε represents a small number (prevent denominator is 0), all of which are known data after training.

The reduced formula can be obtained:

order theAvailable Y ₂ ＝ω′X ₁ +b′

And changing the new convolution kernel weight from omega to omega ', and changing the offset from b to b', thereby obtaining the data after equivalent conversion of the weight parameters.

S42, a multi-branch fusion part shown in fig. 5. After S41, a new output O is obtained, expressed as follows:

wherein omega' _i ，b′ _i (i=1, 2, 3) new weights and offsets after fusing BN for different convolution kernels,representing a convolution operation.

Simplifying and obtaining:

let w=ω' ₁ +ω′ ₂ +ω′ ₃ ，B＝b′ ₁ +b′ ₂ +b′ ₃ . W and B are the weight and bias after the structure is re-parameterized.

In one embodiment, when the target detection model is deployed in an actual environment, the hardware deep learning environment is configured to be consistent with training, the target detection model structure obtained in the step S2 is loaded, the weight file obtained in the step S4 after the structure is parameterized is loaded, the image data size acquired by the unmanned equipment on water is uniformly scaled to 640×640, and then the image data size is input into the target detection model, and finally the position and the category confidence information of the target are obtained, so that the detection of the target on water is realized.

The above description of the method embodiments further describes the solution of the present invention by means of device embodiments.

The present embodiment provides a lightweight aquatic target detection device based on improved YOLOv7, including:

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by a person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The lightweight water target detection method based on the improved YOLOv7 is characterized by comprising the following steps of:

2. The method for detecting a lightweight aquatic target based on improved YOLOv7 of claim 1, wherein the step S1 is specifically: collecting the water video shot by shooting equipment, extracting image data by setting a frame interval, marking the image data, and performing data enhancement operation on the image so as to expand a water target data set; the data set is divided into a training set, a testing set and a verification set according to a certain proportion, and the labels are converted into yolo format.

3. The improved YOLOv 7-based lightweight aquatic target detection method of claim 2, wherein the data enhancement operations include random rotation, random scaling, random translation, random flipping, tone transformation.

4. The lightweight water mesh based on improved YOLOv7 of claim 1The target detection method is characterized in that in the step S2, 13 light linear bottleneck inverse residual modules of the front 2-14 layers of the MobileNet V3-large are adopted as 2-14 layers of a feature extraction module of the YOLOv7, wherein the 2-7 layers adopt a ReLU6 activation function, the 8-14 layers adopt an h-swish activation function, 2-5 layers are defined as a stage 1, the 6-8 layers are defined as a stage 2, the 9-14 layers are defined as a stage 3, and 3 stages respectively extract features F of 3 scales which are 8, 16 and 32 times of downsampling ₁ (80×80×40)、F ₂ (40×40×80)、F ₃ (20 multiplied by 160) and is input into a feature fusion module after channel adjustment;

5. The method for detecting the light-weight water target based on the improved YOLOv7 according to claim 4, wherein the coordinate attention mechanism module performs global average pooling on the input characteristic of c×h×w in the width-height direction to obtain two characteristic layers of c×h×1 and c×1×w, transposes W, H to the same latitude to obtain the characteristic layer of c×1× (w+h) after splicing, performs convolution operation for 2 times, and finally obtains the full-image coordinate attention of c×h×w after Sigmoid activation function.

6. The method for detecting a lightweight aquatic target based on improved YOLOv7 of claim 1, wherein the step S3 comprises the steps of:

7. The method for detecting a lightweight aquatic target based on improved YOLOv7 of claim 1, wherein the step S4 is specifically: all 3×3 convolution operations in the trained target detection model are structurally re-parameterized with the BN layer, and the 3×3 convolutions, 1×1 convolutions, and three branches of the BN layer in the REP module of the detection head portion are reconstructed into 13×3 convolutions.

8. Lightweight aquatic target detection device based on improvement YOLOv7, characterized by, include:

9. A lightweight aquatic target detection device based on improved YOLOv7, comprising a memory, a processor, and a program stored in the memory, wherein the processor implements the method of any one of claims 1-7 when executing the program.

10. A storage medium having a program stored thereon, wherein the program, when executed, implements the method of any of claims 1-7.