CN117237986A

CN117237986A - Fish target individual position detection method based on improved YOLOv7 model

Info

Publication number: CN117237986A
Application number: CN202311164828.6A
Authority: CN
Inventors: 胡泽元; 李尹佳; 涂万; 张鹏; 韦思学
Original assignee: Dalian Ocean University
Current assignee: Dalian Ocean University
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-12-15

Abstract

The invention provides a fish target individual position detection method based on an improved YOLOv7 model, which comprises the steps of collecting an image in a culture water area as an image to be detected, and carrying out preprocessing and enhancement operation on the image to be detected to obtain a data enhancement image dataset; adding a BiFomer attention mechanism module after an SPPCSPC module of the YOLOv7 model, and adding an NWD loss function module in a loss function module of the YOLOv7 model to obtain an improved YOLOv7 model; inputting the preprocessed data enhancement image data set into an improved YOLOv7 model, and training the improved YOLOv7 model through the data enhancement image data set to obtain a trained improved YOLOv7 model; and inputting the acquired aquatic water area image to be detected into a trained improved YOLOv7 model to obtain the position information of the fish target individual. The invention improves the detection accuracy and recall rate of the individual positions of the fish targets in the high-density fish shoals in the turbid water area.

Description

Fish target individual position detection method based on improved YOLOv7 model

Technical Field

The invention belongs to the technical field of intelligent recognition, and particularly discloses a fish target individual position detection method based on an improved YOLOv7 model.

Description of the background

The automatic detection technology of the shoal of fish is beneficial to realizing intelligent production and scientific management of accurate cultivation, reducing cost and improving cultivation benefits, but in an industrial aquaculture environment, a large amount of residual baits and excreta exist, so that water turbidity is caused, visual characteristics of the dense shoal of fish are not obvious, and accurate quantity statistics and production management are difficult to carry out. Aiming at the problems, the target detection of the fish shoal in the industrialized culture environment by utilizing the computer vision technology has become an important technology in aquaculture, can detect the fish shoal target in a contactless manner, and avoids the interference of the traditional underwater equipment on the fish shoal behavior. In a computer vision task, the deep learning network extracts information layer by layer from the original data at the pixel level, abstracting the semantic concept so that it has a prominent advantage in extracting global features and context information of the image. However, the general model has poor detection effect in a complex underwater environment, is influenced by water light absorption and scattering, and the images captured by the underwater optical imaging system can encounter the problems of increased noise interference, insignificant texture characteristics, low contrast, color distortion and the like, so that the images of the fish shoals in the acquired data are blurred, and due to the turbidity of the water body, a large number of air bubbles, suspended matters, aquatic weeds and other interferents exist, the foreground and the background are easily confused, especially for the fish shoals cultivated in high density, the detection environment is frequently shielded and dynamic background, the detection difficulty is increased, and the higher requirements are put forward on the robustness of the algorithm. In summary, the task of fish target detection faces many challenges, and how to accurately, rapidly and stably detect fish-swarm targets in complex cultivation scenes with poor image visibility has become an urgent problem to be solved.

In the target detection of an underwater complex environment, fan et al increases the feature utilization rate and the detection accuracy rate by increasing the complexity of a model, but increases the calculated amount of the model and reduces the detection speed. Chen et al, by improving the feature fusion network, increase the detection accuracy without reducing the model speed, but do not solve the problem of difficult extraction of fish features under the condition of turbid water body, and the detection accuracy is lower. The Hao et al, by combining the YOLOv4 target detection network with the PANet, provides additional bottom-up path enhancement feature fusion capability, enhances learning capability for an underwater complex environment, but has limited improvement of detection accuracy of the combined network due to poor connectivity among multi-scale features due to lack of smoothness of simple bidirectional fusion features. Fan et al combines ASFF with the attention mechanism, controls contributions among the multi-scale features, improves the degree of association of the multi-scale features, and the algorithm improves smoothness of feature fusion, but the fusion path is complex, increasing computational cost and storage burden. Zhu et al, use the encoder of the transducer to improve the feature region correlation in YOLOv5, in order to improve the detection accuracy of the small target, but rely on the complex backbone network feature extraction too much, have increased model complexity and computational complexity, have brought difficulty to the real-time nature of the target detection task. Zhao et al, use lightweight YOLOv4 as backbone network, but the lightweight model lacks robust feature extraction and learning ability, is prone to blurring, and is more to be promoted in the detection effect of high-density blurred fish shoals in turbid waters. Li Haiqing et al, in YOLOV5, adaptive threshold module is introduced to reduce fixed thresholds, avoiding detection omission in high density scenarios, but still requires a fixed threshold to be set to reduce false detection. Li Haiqing et al then fuses the prior knowledge with the modified YOLOv5, and uses the prior knowledge to enhance the features of the blurred image, but the prediction of the model will depend excessively on the quality and quantity of the prior knowledge, which may be inaccurate or incomplete, resulting in deviation of the prediction result of the model, and if the prior knowledge contains erroneous information, the model may fit the erroneous information excessively, resulting in increased errors. Chen X et al introduces Conv2Former module in YOLOv7 to improve the feature extraction capability of the network to underwater blurred images, but the detection accuracy rate of high-density and shielding targets is required to be improved. Aiming at the problems, the research designs a novel fish target individual position detection method based on an improved YOLOv7 model, and the method is very necessary to overcome the problems in the existing fish target individual position detection method.

Disclosure of Invention

The invention provides a fish target individual position detection method based on an improved YOLOv7 model, which aims to solve the problems of blocking phenomenon caused by high-density aggregation of fish shoals, low recall rate of the detection method caused by too small, deformation and shielding of fish bodies in captured images, high noise and poor quality of images captured from a real environment caused by turbid water body and low visibility, and low accuracy of the detection method in the existing fish target individual position detection method.

The invention provides a fish target individual position detection method based on an improved YOLOv7 model, which comprises the following steps:

the method comprises the following steps:

s1, collecting an image in a culture water area as an image to be detected, and preprocessing and enhancing the image to be detected to obtain a data enhanced image dataset;

s2, adding a BiFomer attention mechanism module after an SPPCSPC module of the YOLOv7 model, and adding an NWD loss function module in a loss function module of the YOLOv7 model to obtain an improved YOLOv7 model;

s3, inputting the data enhancement image dataset obtained after preprocessing in the step S1 into the improved YOLOv7 model obtained in the step S2, and training the improved YOLOv7 model through the data enhancement image dataset to obtain a trained improved YOLOv7 model;

S4, inputting the acquired aquaculture water area image to be detected into the trained improved YOLOv7 model obtained in the step S3, and obtaining the position information of the fish target individual.

According to some embodiments of the application, the method for detecting the fish target individual position based on the improved YOLOv7 model, in the step S1, the preprocessing includes adjusting the size of the image and normalizing the pixel values.

According to some embodiments of the application, the method for detecting the individual position of the fish target based on the improved YOLOv7 model, in the step S1, the enhancement operation includes horizontal mirror-image flipping, vertical flipping, and horizontal vertical flipping.

According to some embodiments of the present application, in the step S2, the BiFomer attention mechanism module adopts a BiFomer attention mechanism for filtering out irrelevant key tensors and value tensors at a coarse area level, and retaining a small part of routing areas, wherein a basic building block of the BiFomer attention mechanism comprises a dual-layer routing attention, and the BiFomer attention mechanism comprises:

constructing a region-level affinity relationship graph: for a given input feature mapWherein W represents the width of the feature map, H represents the height of the feature map, C represents the number of channels of the feature map, and the feature map is divided into S×S pieces of HW/S ² Is not overlapping region of (1), get->The feature map is linearly projected into a query Q, a key K and a value V, and the calculation method is shown in formulas (1) - (3):

Q＝X ^r W ^q (1)

K＝X ^r W ^k (2)

V＝X ^r W ^v (3)

wherein W is ^q To inquire the Q projection weight, W ^k For projection weights of key K, W ^v ∈R ^C×C Projection weight for value V;

constructing a region-to-region route with a directed graph: the query Q and the key K are averaged along the area to respectively obtain area-level query Q ^r And bond K ^r WhereinThrough Q ^r And transpose K ^r Moment of (2)Matrix multiplication to obtain an adjacency matrix A representing the degree of correlation between regions ^r Adjacency matrix A ^r The element in (2) is used for measuring the similarity degree of two areas in the data enhancement image on the characteristic information, as shown in a formula (4):

A ^r ＝Q ^r (K ^r ) ^T (4)

each node only retains the first k connections for adjacency matrix a ^r Pruning to obtain a route index matrix I ^r As shown in formula (5):

I ^r ＝topkIndex(A ^r ) (5)

by routing index matrix I within each zone ^r Self-attention calculations are performed on fine-grained token: collecting key tensors and value tensors as shown in formulas (6) - (7):

K ^g ＝gather(K,I ^r ) (6)

V ^g ＝gather(V,I ^r ) (7)

wherein K is ^g In order to collect the key tensor,tensors for the collected values;

focusing attention on the collected key tensors and value tensors, an attention mechanism formula O of the BiFomer attention mechanism is obtained, as shown in formula (8):

O＝Attention(Q,K ^g ,V ^g )+LCE(V) (8)

where LCE (V) is an introduced local context enhancement term, the function LCE (·) uses deep convolution parameterization with kernel size 5.

According to some embodiments of the application, in the step S2, the NWD loss function module uses NWD loss functions, and the expression of the NWD loss functions is shown in formula (9):

wherein the method comprises the steps of，For the gaussian distribution model of the prediction block P, the prediction block p= (cx) _p ,cy _p ,w _p ,h _p )，(cx _p ,cy _p ) To predict the center coordinates of the frame P, w _p To predict the width of the frame P, h _p For predicting the height of the box P, +.>As a gaussian distribution model of the real frame G, the real frame g= (cx) _g ,cy _g ,w _g ,h _g )，(cx _g ,cy _g ) Is the center coordinate of the real frame G, w _g For the width of the real frame G, h _g For the height of the real box G, NWD is the normalized wasperstein distance, NWD is shown in formula (10):

where C is a constant closely related to the dataset,gaussian distribution model for prediction box P>Gaussian distribution model with real box G +.>The second order Wasserstein distance between them, as shown in equation (11):

where T represents the formula transpose.

According to some embodiments of the application, a fish target individual position detection method based on an improved YOLOv7 model, wherein the improved YOLOv7 model comprises an input end, a backbone network and a detection head; the input end is used for inputting an image to be detected; the backbone network comprises a CBS module, an SPPCSPC module, a BiFomer attention mechanism module, a Concat module and an ELAN-W module, wherein the CBS module is used for feature extraction, the CBS module comprises a convolution layer module, a batch standardization module and a loss function module, the convolution layer module is used for carrying out convolution operation, the batch standardization module is used for carrying out standardization operation, the loss function module comprises a SiLU loss function module and an NWD loss function module, the SPPCSPC module is of a spatial pyramid structure, the SPPCSPC module obtains different receptive fields through maximum pooling and is used for detecting targets with different sizes in an image to be detected, the BiFomer attention mechanism module is used for enabling an improved YOLOv7 model to realize sparsity of content perception in a self-adaptive query mode, the Concat module is used for carrying out connection operation, and the ELAN-W module is composed of a plurality of the CBS modules and is used for feature extraction; the detection Head comprises an ELAN-W module, a Concat module, an MP-2 module, a REP module and a Head module, wherein the MP-2 module is used for feature fusion and dimension reduction, the REP module is used for feature extraction, feature smoothing and feature transmission, and the Head module is used for carrying out prediction of a boundary frame according to features and outputting a prediction frame with highest confidence coefficient.

According to some embodiments of the application, in the step S3, the improved YOLOv7 model is trained based on the data-enhanced image dataset until training is completed when a set learning round is reached, and a trained improved YOLOv7 model is obtained.

According to the fish target individual position detection method based on the improved YOLOv7 model, in order to integrate the BiFomer attention mechanism module and the NWD loss function module with the YOLOv7 model, the BiFomer attention mechanism is an attention mechanism with dynamic and query perception sparsity, fine granularity details are reserved on the basis of reducing the calculated amount, valuable fish swarm information can be efficiently positioned, the scale and shape changes of different targets are better adapted, the targets and the background are better separated in a dense scene, and the detection accuracy and recall rate of the fish target individual position in a high-density fish swarm are improved; the NWD loss function module uses Wasserstein distance to detect a small target so as to reduce the sensitivity to the position deviation of a small target object in target detection, strengthen the recognition capability to high-density small target fish shoals on the basis of ensuring the light weight of a model, reduce the sensitivity to input changes and improve the smoothness of the position deviation, thereby improving the detection accuracy and recall rate of the detection method to the position of a fish target individual in the high-density fish shoals in turbid water areas.

Drawings

FIG. 1 is a schematic flow chart of a fish target individual position detection method based on an improved YOLOv7 model;

FIG. 2 is a schematic diagram of the structure of the improved YOLOv7 model of the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

Example 1

A fish target individual position detection method based on an improved YOLOv7 model is shown in fig. 1, and comprises the following steps:

preprocessing comprises adjusting the size of an image and normalizing pixel values, and enhancement operations comprise horizontal mirror image overturn, vertical overturn and horizontal vertical overturn;

s2, adding a BiFomer attention mechanism module after the SPPCSPC module of the YOLOv7 model, enhancing the learning capacity of the YOLOv7 model on fine granularity, and improving the calculation efficiency of the YOLOv7 model in a dense detection task; an NWD loss function module is added in a loss function module of the YOLOv7 model, so that smoothness of the YOLOv7 model to position deviation is improved, the YOLOv7 model is more suitable for detecting similar and fuzzy small-target fish swarm targets, and an improved YOLOv7 model is obtained;

The BiFomer attention mechanism module adopts a BiFomer attention mechanism, the BiFomer attention mechanism uses a double-layer routing attention Bi-Level Routing Attention as a basic building block, the key idea of the BiFomer attention mechanism is to filter out most uncorrelated key tensors and value tensors at the level of coarse areas, only a small part of routing areas are reserved, a token-to-token attention with fine granularity is applied in the combination of the routing areas, the efficient positioning of valuable key tensors and value tensors is realized, the learning ability of a YOLOv7 model on fish characteristics is enhanced on the basis of reducing the calculated amount, the dynamic query perception sparsity of the BiFomer attention mechanism is realized, and the BiFomer attention mechanism comprises:

Q＝X ^r W ^q (1)

K＝X ^r W ^k (2)

V＝X ^r W ^v (3) Wherein W is ^q To inquire the Q projection weight, W ^k For projection weights of key K, W ^v ∈R ^C×C Projection weight for value V;

Constructing a region-to-region route with a directed graph: the query Q and the key K are averaged along the area to respectively obtain area-level query Q ^r And bond K ^r WhereinThrough Q ^r And transpose K ^r To obtain an adjacency matrix A representing the degree of correlation between regions ^r Adjacency matrix A ^r The element in (2) is used for measuring the similarity degree of two areas in the data enhancement image on the characteristic information, as shown in a formula (4):

A ^r ＝Q ^r (K ^r ) ^r (4)

I ^r ＝topkIndex(A ^r ) (5)

by routing index matrix I within each zone ^r Self-attention calculations are performed on fine-grained token: since these routing areas are expected to be scattered across the feature map, and the GPU will try to merge multiple thread read memories together for access, key tensors and value tensors are collected as shown in equations (6) - (7):

K ^g ＝gather(K,I ^r ) (6)

V ^g ＝gather(V,I ^r ) (7)

O＝Attention(Q,K ^g ,V ^g )+LCE(V) (8)

wherein LCE (V) is an introduced local context enhancement term, the function LCE (·) uses deep convolution parameterization, kernel size is 5;

The complexity under the proper regional division factor S of the BiFomer attention mechanism is thatThe specific complexity is O ((HW) ² ) Vanilla attention and complexity of +.>The quasi global axial attention is lower, so that the calculated amount of an attention mechanism is effectively reduced;

in each BiFomer attention mechanism module, relative position information is implicitly coded by using 3X 3 depth convolution at the beginning, and then a Bi-layer routing attention Bi-Level Routing Attention module and a 2-layer space shift MLP module with an expansion ratio of e are sequentially applied to perform cross-position relation modeling and position-by-position embedding;

due to the existence of suspended particles in turbid water, the water quality is poor, the particles in the water scatter and absorb light, so that the visible light transparency is reduced, the image quality and the visibility of targets are affected, the fact that fish swarm individual targets in dense scenes are smaller in the current aquaculture water is considered, a fuzzy or unclear state is often presented in the turbid water, and details are difficult to accurately extract. Easy to cause: a deviation of a few pixels in the prediction frame P results in no overlap between the prediction frame P and the real frame G; the individual small target fish is mispredicted such that the intersection of the prediction box P with the real box G is either a prediction bounding box or a true bounding box. Although the loss function CIoU of the YOLOv7 model can handle the two cases, the model is based on IoU, so that the model is sensitive to the position deviation of a blurred small target, the NWD loss function is used for detecting the small target by using a waserstein distance, is insensitive to the scale of the blurred target and is more suitable for measuring the similarity between the blurred small targets, therefore, the loss function module of the YOLOv7 model is improved, and the NWD loss function module is added, so that the learning capability of the model on the characteristics of the blurred small target object is enhanced;

The NWD loss function module uses NWD loss functions whose expression is shown in formula (9):

wherein,for the gaussian distribution model of the prediction block P, the prediction block p= (cx) _p ,cy _p ,w _p ,h _p )，(cx _p ,cy _p ) To predict the center coordinates of the frame P, w _p To predict the width of the frame P, h _p For predicting the height of the box P, +.>As a gaussian distribution model of the real frame G, the real frame g= (cx) _g ,cy _g ,w _g ,h _g )，(cx _g ,cy _g ) Is the center coordinate of the real frame G, w _g For the width of the real frame G, h _g For the height of the real box G, NWD is the normalized wasperstein distance, NWD is shown in formula (10):

wherein T represents a formula transpose;

taking a prediction frame P as an example, a gaussian distribution model of the prediction frame PThe modeling process of (1) is as follows:

the center pixel weight of the prediction block P is highest, the importance of the pixel decreases from center to boundary, for the prediction block p= (cx) _p ,cy _p ,w _p ,h _p ) Wherein, (cx) _p ,cy _p ) To predict the center coordinates of the frame P, w _p To predict the width of the frame P, h _p For predicting the height of the frame P, the standard equation of the ellipse inscribed in the frame P is shown in formula (12):

wherein (mu) _x ,μ _y ) Is the center coordinate of ellipse, sigma _x The half-axis length of an ellipse in the x-axis represents the distance, σ, that the ellipse extends in the positive direction of the x-axis _y The half-axis length of the ellipse in the y-axis represents the distance that the ellipse extends along the positive direction of the y-axis;

the probability density function of the two-dimensional gaussian distribution is shown in formula (13):

where μ represents the mean vector of the gaussian distribution of coordinates (x, y), Σ represents the covariance matrix of the gaussian distribution of coordinates (x, y), when (x- μ) ^T ∑ ^-1 (x- μ) =1, the elliptic standard equation in equation (12) is the density profile of a two-dimensional gaussian distribution, and therefore, the prediction block p= (cx) _p ,cy _p ,w _p ,h _p ) Can be modeled as a two-dimensional Gaussian distributionWherein mu _p Sum sigma _p As shown in formulas (14) - (15):

the NWD loss function has scale invariance, smoothness of position deviation and capability of measuring similarity between non-overlapping or mutually contained prediction frames in the aspect of fuzzy small target detection;

the improved YOLOv7 model is shown in fig. 2, and comprises an input end, a backbone network and a detection head; the input end is used for inputting an image to be detected; the backbone network comprises a CBS module, an SPPCSPC module, a BiFomer attention mechanism module, a Concat module and an ELAN-W module, wherein the CBS module is used for feature extraction, the CBS module comprises a convolution layer module, a batch standardization module and a loss function module, the convolution layer module is used for carrying out convolution operation, the batch standardization module is used for carrying out standardization operation, the loss function module comprises a SiLU loss function module and an NWD loss function module, the SPPCSPC module is of a spatial pyramid structure, the SPPCSPC module obtains different receptive fields through maximum pooling and is used for detecting targets with different sizes in an image to be detected, the BiFomer attention mechanism module is used for enabling an improved YOLOv7 model to realize sparsity of content perception in a self-adaptive query mode, the Concat module is used for carrying out connection operation, the ELAN-W module consists of a plurality of CBS modules, and the ELAN-W module is used for feature extraction; the detection Head comprises an ELAN-W module, a Concat module, an MP-2 module, a REP module and a Head module, wherein the MP-2 module is used for feature fusion and dimension reduction, the REP module is used for feature extraction, feature smoothing and feature transmission, and the Head module is used for carrying out boundary frame prediction according to the features and outputting a prediction frame with highest confidence coefficient;

More specifically, the ELAN-W module includes two branches, the first branch performs channel conversion through 1×1 convolution, the second branch performs channel conversion through 1×1 convolution first, then extracts features through four 3×3 convolution modules, and finally the output of the second branch and the output of the first branch are stacked through the Concat module; the MP-2 module comprises two branches, wherein the first branch consists of a MaxPool module and a convolution of 1×1, the second branch consists of a convolution of 1×1 and a convolution of 3×3, and finally, the output of the first branch and the output of the second branch are stacked through a Concat module; the REP module comprises a train module and a depth module, the train module is divided into three branches of feature extraction, smooth feature and feature transmission, the outputs of the three branches of feature extraction, smooth feature and feature transmission are stacked through the Concat module, the depth module comprises a convolution of 3 multiplied by 3, the gait is 1, and the depth module is converted by the training module in a re-parameterization mode;

s3, inputting the data enhancement image data set obtained after preprocessing in the step S1 into the improved YOLOv7 model obtained in the step S2, and training the improved YOLOv7 model through the data enhancement image data set to obtain a trained improved YOLOv7 model;

Training the improved YOLOv7 model based on the data-enhanced image data set until training is completed when the set learning round is reached, and obtaining a trained improved YOLOv7 model;

s4, inputting the acquired aquaculture water area image to be detected into the trained improved YOLOv7 model obtained in the step S3 to obtain the position information of the fish target individual, wherein the specific method comprises the following steps:

s401, inputting the acquired image into the trained improved YOLOv7 model obtained in the step S3;

s402, performing feature extraction on an input image through a plurality of CBS modules, performing feature extraction by using a convolution layer module, a batch standardization module and a loss function module, compressing the width and the height of the image, expanding the channel number of the image, and obtaining (80,80,512) (40,40,1024) (20,20,1024) feature images with three sizes;

s403, performing feature extraction on the feature map (20,20,1024) by using an SPPCSPC module, and expanding a receptive field to obtain a small-size intermediate feature map;

s404, inputting the small-size intermediate feature map obtained in the step S403 into a BiFomer attention mechanism module, enhancing the feature learning capacity of a network, up-sampling, stacking the obtained feature map and the feature map of (40,40,1024), and performing feature extraction through an ELAN-W module to obtain the intermediate feature map;

S405, carrying out up-sampling on the middle-size middle feature map obtained in the step S404, stacking the obtained feature map and the feature map (80,80,512), and carrying out feature extraction through an ELAN-W module to obtain a large-size feature map;

s406, downsampling the large-size feature map obtained in the step S405 through an MP-2 module, stacking the obtained feature map with the middle-size feature map obtained in the step S404, and extracting features through an ELAN-W structure to obtain the middle-size feature map;

s407, downsampling the medium-size feature map obtained in the step S406 through an MP-2 module, stacking the obtained feature map and the small-size medium feature map in the step S403, and extracting features through an ELAN-W structure to obtain a small-size feature map;

s408, carrying out feature extraction, feature smoothing, feature transfer and output of a prediction frame with highest confidence level on the small-size feature map, the medium-size feature map and the large-size feature map through the three REP modules and the three Head modules respectively to obtain a final prediction result and obtain the position information of the fish target individual.

Example 2

The image in the culture water area is the image data of the fugu rubripes, which is commercial fish cultured in high density, and has high research value, and the image data of the fugu rubripes is collected from a fugu rubripes culture workshop, wherein the image data comprises images extracted from videos. The cultivation pool of the takifugu rubripes cultivation workshop is provided with a 200-ten-thousand-pixel water camera and a 200-ten-thousand-pixel underwater camera, and the water camera is used for collecting video data, is arranged above the water surface by 1.5m and used for accurately measuring the fish distribution condition, and is placed on the inner wall of the cultivation pool and used for obtaining another shooting angle so as to identify the fish characteristics. Capturing high-density occlusion video data by using an overwater camera, capturing turbid water area fuzzy occlusion video data by using an underwater camera, extracting frames from the captured video every 500 milliseconds, generating 1000 overwater images with 1920 multiplied by 1080 resolution and 600 underwater images with 1920 multiplied by 1080 resolution, and storing the images in a JPG file mode.

The data enhancement is one of important machine learning methods, and generates more training data based on the existing training sample data, so that the amplified training data is as close to the data in real distribution as possible. The water data set after data enhancement comprises 3000 images, the underwater data set comprises 1800 images, the two data sets are combined into a data enhancing image data set containing 4800 images, a training set, a verification set and a test set are constructed according to the proportion of 6:2:2, the proportion of the water images and the underwater images in the training set, the verification set and the test set is 1:1, the training set contains 2880 images, the verification set contains 960 images, and the test set contains 960 images.

In an industrialized aquaculture environment, the problems of high-density shielding, water turbidity and the like exist, so that the detection effect of a dense fish school is poor, and therefore, the requirements of a target detection model on robustness, multi-scale feature fusion capability and bearing capability for large data volume and high-complexity calculation are high for a complex underwater environment, and the current mainstream target detection model based on deep learning is mainly divided into two-stage detection and single-stage detection. Faster RCNN is a typical representation of a two-stage target detection model that first generates candidate regions and then performs classification and regression tasks on the candidate regions, thus improving detection, but with a corresponding increase in processing time, which is not suitable for high-speed swimming fish swarm detection. The single-stage detection algorithm represented by YOLOv5 and YOLOv7 utilizes an end-to-end reasoning process to directly position the target position, is suitable for the condition that the required target detection is frequent due to fish swimming, improves the system operation efficiency and reduces the later maintenance cost. Under the same magnitude, the detection accuracy of the YOLOv7 is higher than that of the YOLOv5, the speed is 120 percent (FPS), the YOLOv7 brings about a plurality of architectural changes, such as expanding an efficient layer aggregation network, adopting methods of expanding, randomly scrambling, merging cardinalities and the like, under the condition of not damaging the original gradient path, the learning ability of the network is continuously enhanced, and the more diversified characteristics of learning of different computing block groups are guided; based on the connected model scaling, specific attributes of the model are adjusted, and models with different sizes are generated so as to meet the requirements of different reasoning speeds; the convolution is re-parameterized, so that rough to acceptable hierarchical labels with guidance are generated from the prediction result of the guidance head, and the hierarchical labels are used for assisting in learning of the guidance head. Therefore, the embodiment adopts the YOLOv7 as a basic model to detect the positions of the fish target individuals of the takifugu rubripes in the fish pond, and compared with the YOLOv7 of different versions, the embodiment selects the YOLOv7x with higher accuracy as the basic model so as to meet the requirement of the model on the high detection accuracy of the positions of the fish target individuals in the high-density fish shoals in the turbid water area.

After SPPCSPC module of YOLOv7 model, biFomer attention mechanism module is added, NWD loss function module is added in loss function module of YOLOv7 model, improved YOLOv7 model is obtained, experimental software and hardware environment: windows10 operating system, intel (R) Core (TM) i7-12700,2.1GHz CPU,64GB RAM,GeForce RTX 3070Ti,PyCharm with CUDA version 11.1, machine learning framework Pytorch 1.9.1, operating environment python 3.8, modified YOLOv7 model trained with 200 epochs, batch size 8, initial learning rate 0.01. And training the improved YOLOv7 model through data enhancement image data set to obtain a trained improved YOLOv7 model.

In the detection of fugu rubripes, representing a prediction box as a true target or a false target can produce four potential predictions: true positive TP, false positive FP, true negative TN, and false negative FN. The performance of the model is evaluated by the precision rate P and recall R calculated as shown in equations (16) and (17):

where TP represents the number of correctly identified fish targets, FP represents the number of incorrectly identified fish targets, and FN represents the number of undetected fish targets.

The precision rate P and recall rate R are interactive. If the accuracy is kept at a high value and the recall is increased, this means that the model performs better, in contrast to a worse performing model, a large amount of accuracy may be lost in exchange for an increased recall. To combine these two metrics, an average accuracy rate AP is introduced to measure the detection accuracy rate, as shown in equation (18):

Three sets of experiments were designed for this example: experiment one: the effectiveness of the BiFomer attention mechanism and the NWD loss function, which are proposed by the detection method of the embodiment, is verified by carrying out an ablation experiment on the water data set, the underwater data set and the complete data set; experiment II: the effectiveness of the detection method of the embodiment is verified by carrying out a comparison experiment on the complete data set and the current underwater target detection model; experiment III: experiments are carried out on different data sets, and the robustness and generalization capability of the detection method of the embodiment are verified.

Experiment one:

respectively adding the BiFomer attention mechanism module and the NWD loss function module into a YOLOv7 model, respectively constructing YOLOv7-BiFomer added with the BiFomer attention mechanism module, YOLOv7-NWD added with the NWD loss function module and BNYOLOv7 added with the BiFomer attention mechanism module and the NWD loss function module, and respectively performing an ablation experiment based on an on-water data set, an underwater data set and a data enhancement image data set. The results of the ablation experiments of the three methods described above were compared with the experimental results of the original YOLOv7 to verify the effectiveness of the proposed module.

The results of the ablation experiment based on the water dataset are shown in table 1, the results of the ablation experiment based on the underwater dataset are shown in table 2, and the results of the ablation experiment based on the data enhancement image dataset are shown in table 3.

Table 1 results of ablation experiments based on the above-water dataset

Table 2 results of ablation experiments based on underwater datasets

Table 3 ablation experiment results based on data enhanced image dataset

From tables 1-3, it can be seen that the proposed BiFomer attention mechanical module and NWD loss function module can both improve the detection accuracy and recall, and the BiFomer attention mechanical module and NWD loss function module have certain advantages for improving the detection performance. Compared with an NWD loss function module, the BiFomer attention mechanism module has higher improvement on detection accuracy of a turbid and high-density data set on water, because the BiFomer adopts a two-stage routing attention (BRA) as a core building block, the BiFomer comprises a regional routing step and a token-level attention step, and fine-grained token-to-token attention is applied in combination of routing regions, so that valuable key tensors and value tensors can be efficiently positioned, the model selectively focuses on regions where targets possibly appear, focuses on shielding fish bodies, and the detection accuracy of the model on the individual positions of the fish targets in the high-density and shielding fish groups is improved. The NWD loss function module is more obvious in improvement of the accuracy rate of fish shoal detection in the turbid water area high density, particularly has larger influence on recall rate, the Fugu rubripes in the turbid water area high density shows extremely limited appearance information, difficulty in recognition feature learning is increased, the NWD loss function carries out target detection by using Wasserstein distance, scale invariance is achieved, similarity between fuzzy targets and small targets can be better measured, sensitivity of IoU to position deviation of fuzzy objects is relieved, and accordingly omission ratio is reduced. In an ablation experiment based on a data enhancement image dataset, the accuracy rate P of BNyOLOv7 reaches 98.05%, and is improved by 2.46% compared with the accuracy rate P of YOLOv 7; the recall rate R reaches 97.69%, which is 3.73% higher than that of the yellow 7; the average accuracy AP reaches 99.1%, the average accuracy AP is improved by 1.62% compared with that of the YOLOv7, the reasonability of fusion of the two modules is verified, and the effectiveness of adding the BiFomer attention mechanical model and the NWD loss function module in the YOLOv7 model is also verified, so that the detection accuracy of the fish target individual position detection method provided by the embodiment is higher, and the effect is better.

Experiment II:

the modified YOLOv7 model BNYOLOv7 of this example was compared with an underwater target detection model with a better current detection effect. The models used for the experiments include: the detection model SWIPENT of the underwater small object proposed by YOLOv7 and Chen, the detection model FERET of the underwater blur small object proposed by Fan and the adaptive threshold module ATM proposed by Li Haiqing integrate the deformable convolution module DCM and the adaptive threshold module ATM into YOLOv5 and DCM-ATM-YOLOv5 and KAYOLO based on prior knowledge fusion and the modified YOLOv7 model BNYOLOv7 of the present embodiment, the above models were subjected to experiments based on the data enhanced image dataset, and the experimental results were compared, and the experimental results are shown in Table 4.

Table 4 results of comparative experiments

Experiments show that the improved YOLOv7 model BNyolov7 of the embodiment has the highest detection accuracy on the cloudy and high-density takifugu rubripes data set. The methods of improving the feature fusion network by SWIPENT and increasing the complexity of the model network by FERET are all to improve the model performance by improving the utilization rate of the existing features, and essentially do not solve the problem of difficult feature extraction of the target detection model. The DCM-ATM-YOLOv5 adds a deformable convolution module to shift the sampling position to a foreground target, but the deformable convolution lacks effective guiding learning, so that the model possibly focuses on some unnecessary features, and secondly, the added Adaptive Threshold Module (ATM) cannot completely replace a fixed threshold, and still has the defects of incapability of adapting to the difference of different data sets and limited capability of processing unbalanced data sets, and is particularly sensitive to noise in a blurred image, and false detection is easily caused. The improved YOLOv7 model BNYOLOv7 of the embodiment focuses the network on the necessary characteristics of the fish shoal in a query self-adaptive mode through a BiFomer attention mechanism without being interfered by other irrelevant characteristics, and improves the smoothness of the network on the position deviation of the fuzzy and high-density fish shoal through an NWD loss function. The detection effect of KAYOLO is greatly affected by the quality and quantity of the priori knowledge, and the reliability, updating and adaptability of the priori knowledge need to be considered. If the priori knowledge is outdated or not updated in time, the prior knowledge can not adapt to the condition of the density change of the shoal of fish, and the priori knowledge can only provide limited information and cannot completely describe the complex turbid high-density shoal of fish target characteristics. The improved YOLOv7 model BNyolov7 of the embodiment radically improves the extraction capability of object features with different scales, thereby improving the detection capability of the fish target individual position detection method of the embodiment on the fish target individual position in the cloudy and high-density takifugu rubripes fish shoal.

Experiment III:

the modified YOLOv7 model BNYOLOv7 model of this example was applied to different data sets and analyzed for its generalization ability and adaptability to different detection scenarios. The experimental data sets are divided into two groups, wherein one group is a Fish data set of the Chinese agricultural artificial intelligence innovation entrepreneur large-scale Fish data set, the Fish data set comprises laboratory Fish daily behavior activity video data with different specifications, a large number of high-density shielding and blurring conditions exist, and the experimental data set is similar to a real environment, and 8396 images of an intermediate part are used as a training data set. The other group is a VOC2007+2012 data set, which is a standardized data set widely used in the field of target detection and comprises 20 categories, 21503 image data sets for target detection tasks are adopted for training in the experiment, and the robustness of the proposed model in different application scenes is verified. The modified YOLOv7 model of this example was compared with YOLOv7, SWIPENet, FERNet, DCM-ATM-YOLOv5, KAYOLO, and the experimental results are shown in table 5.

Table 5 results of comparative experiments

Experimental results show that the improved YOLOv7 model BNyolov7 of the embodiment has better accuracy, recall and average accuracy on both data sets than other comparative models. The fish target individual position detection method provided by the embodiment is verified to be good in performance in a turbid and high-density underwater environment, still has strong multi-scale feature fusion capability in public data sets of different scenes and data distribution, and has good robustness and generalization capability.

The fish target individual position detection method for fusing the YOLOv7 model with the BiFomer attention mechanism module and the NWD loss function module solves the problem that the accuracy rate of detecting high-density fish shoals in turbid water is low in an industrial aquaculture environment. The BiFomer attention mechanism module is introduced into the YOLOv7 model to efficiently locate valuable fish swarm information, so that the model is better suitable for the scale and shape changes of different targets, the targets and the background are better separated in a dense scene, the detection accuracy of the model in a high-density fish swarm is improved, and meanwhile, the calculation efficiency is balanced; in order to reduce and improve the sensitivity of the YOLOv7 model to objects with different scales, an NWD loss function module is introduced, the sensitivity to input changes is reduced, and the smoothness of position deviation is improved, so that the accuracy of detecting the positions of fish target individuals in a high-density fish shoal in a turbid water area is improved.

The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. The fish target individual position detection method based on the improved YOLOv7 model is characterized by comprising the following steps of:

2. The method for detecting the position of an individual fish target based on an improved YOLOv7 model according to claim 1, wherein in the step S1, the preprocessing includes resizing the image and normalizing the pixel values.

3. The method for detecting the position of an individual fish object based on an improved YOLOv7 model according to claim 1, wherein in said step S1, said enhancement operation includes horizontal mirror-image inversion, vertical inversion, and horizontal vertical inversion.

4. The method according to claim 2, wherein in the step S2, the BiFomer attention mechanism module employs a BiFomer attention mechanism for filtering out irrelevant key tensors and value tensors at a coarse area level, and retaining a small part of the routing area, and the basic building block of the BiFomer attention mechanism includes a dual layer routing attention, and the BiFomer attention mechanism includes:

Q＝X ^r W ^q (1)

K＝X ^r W ^k (2)

V＝X ^r W ^v (3)

A ^r ＝Q ^r (K ^r ) ^T (4)

I ^r ＝topkIndex(A ^r ) (5)

K ^g ＝gather(K，I ^r ) (6)

V ^g ＝gather(V，I ^r ) (7)

wherein K is ^g For collected keysThe tensor is used to determine the degree of the tensor,tensors for the collected values;

O＝Attention(Q，K ^g ，V ^g )+LCE(V) (8)

5. The method for detecting the position of a fish target individual based on an improved YOLOv7 model according to claim 4, wherein in the step S2, the NWD loss function module uses NWD loss functions, and the expression of the NWD loss functions is as shown in formula (9):

Wherein,for the gaussian distribution model of the prediction block P, the prediction block p= (cx) _p ，cy _p ，w _p ，h _p )，(cx _p ，cy _p ) To predict the center coordinates of the frame P, w _p To predict the width of the frame P, h _p For predicting the height of the box P, +.>As a gaussian distribution model of the real frame G, the real frame g= (cx) _g ，cy _g ，w _g ，h _g )，(cx _g ，cy _g ) Is the center coordinate of the real frame G, w _g For the width of the real frame G, h _g For the height of the real box G, NWD is the normalized Wasserstein distance, NWD is as in equation (10) The following is shown:

where T represents the formula transpose.

6. The method for detecting the position of a fish target individual based on an improved YOLOv7 model according to claim 5, wherein the improved YOLOv7 model comprises an input end, a backbone network and a detection head; the input end is used for inputting an image to be detected; the backbone network comprises a CBS module, an SPPCSPC module, a BiFomer attention mechanism module, a Concat module and an ELAN-W module, wherein the CBS module is used for feature extraction, the CBS module comprises a convolution layer module, a batch standardization module and a loss function module, the convolution layer module is used for carrying out convolution operation, the batch standardization module is used for carrying out standardization operation, the loss function module comprises a SiLU loss function module and an NWD loss function module, the SPPCSPC module is of a spatial pyramid structure, the SPPCSPC module obtains different receptive fields through maximum pooling and is used for detecting targets with different sizes in an image to be detected, the BiFomer attention mechanism module is used for enabling an improved YOLOv7 model to realize sparsity of content perception in a self-adaptive query mode, the Concat module is used for carrying out connection operation, and the ELAN-W module is composed of a plurality of the CBS modules and is used for feature extraction; the detection Head comprises an ELAN-W module, a Concat module, an MP-2 module, a REP module and a Head module, wherein the MP-2 module is used for feature fusion and dimension reduction, the REP module is used for feature extraction, feature smoothing and feature transmission, and the Head module is used for carrying out prediction of a boundary frame according to features and outputting a prediction frame with highest confidence coefficient.

7. The method for detecting the position of a fish target individual based on an improved YOLOv7 model according to claim 1, wherein in the step S3, the improved YOLOv7 model is trained based on the data-enhanced image dataset until training is completed when a set learning round is reached, and a trained improved YOLOv7 model is obtained.