CN113191222B

CN113191222B - Underwater fish target detection method and device

Info

Publication number: CN113191222B
Application number: CN202110406987.7A
Authority: CN
Inventors: 陈英义; 张倩; 李道亮; 秦瀚翔; 于辉辉; 孙博洋; 刘慧慧; 李少波; 魏晓华; 杨玲
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2024-05-03
Anticipated expiration: 2041-04-15
Also published as: CN113191222A

Abstract

The invention provides a method and a device for detecting underwater fish targets, wherein the method comprises the following steps: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from feature images output by each layer of the feature extraction model; inputting a plurality of feature images with different scales into the target detection model, and outputting a target detection result of the image to be detected; the image to be detected comprises a plurality of images of fish targets with different scales. On one hand, the invention realizes that the characteristics of fish targets with different scales in an image to be detected are completely represented by extracting a plurality of characteristic diagrams with different scales, so that the phenomenon that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to detect because of incomplete characteristics is effectively relieved; on the other hand, the target detection is carried out on the image to be detected by combining a plurality of feature images with different scales, so that the target detection result is more accurate.

Description

Underwater fish target detection method and device

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a device for detecting underwater fish targets.

Background

Currently, fish farming dominates aquaculture. In order to ensure the yield of fish, it is necessary to estimate the number of fish cultivated or monitor the growth state thereof. In addition, in order to prevent sudden death of fish from having a serious effect on the growth of other fish, it is necessary to individually track the target fish in which abnormal conditions occur.

As underwater robots and cameras are also evolving, research into underwater video or images is becoming a research hotspot in many other areas. At present, a machine learning algorithm is mainly adopted to detect the targets of fishes according to underwater videos or images so as to monitor and count the growth states of the underwater fishes and track individual fishes.

However, the fish culture form is highly intensive, so that the pixel area occupied by a single fish object in an image is small, and a plurality of small objects are generated. The size of the fish targets in the image is influenced by the distance between the fish targets and the underwater camera, namely the pixel area occupied by the fish targets close to the image is large, and the pixel area occupied by the fish targets far away from the camera is small, so that the size of the fish in the image is greatly different. Therefore, when the prior art is used for detecting the fish targets, the fish targets with smaller scale are easy to lose in the detection process, and the fish targets with larger scale can not be detected due to incomplete characteristic information, so that the detection accuracy is lower.

Disclosure of Invention

The invention provides an underwater fish target detection method and device, which are used for solving the defect that in the prior art, when the sizes of fish targets in images are greatly different, the target detection results of the fish targets are difficult to accurately obtain, and realizing that when the sizes of the fish targets in the images are greatly different, the target detection results of the fish targets are accurately obtained.

The invention provides a method for detecting underwater fish targets, which comprises the following steps:

Inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from feature images output by each layer of the feature extraction model;

inputting the feature images with different scales into the target detection model, and outputting a target detection result of the image to be detected;

The target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales.

According to the underwater fish target detection method provided by the invention, the characteristic extraction model comprises a deconvolution module and a downsampling module;

Correspondingly, the inputting the image to be detected into the feature extraction model in the target detection model, and obtaining the feature graphs of a plurality of different scales of the image to be detected from the feature graphs output by each layer of the feature extraction model comprises the following steps:

Sequentially passing the image to be detected through all the downsampling modules to obtain a feature map output by the last downsampling module;

Sequentially passing the feature image output by the last downsampling module through all deconvolution modules to obtain the feature image output by all deconvolution modules;

taking the characteristic diagrams output by the deconvolution modules as characteristic diagrams of the images to be detected;

The feature graphs output by the deconvolution modules are feature graphs with different scales, the downsampling module is used for downsampling the input of the downsampling module, and the deconvolution module is used for deconvoluting the input of the deconvolution module.

According to the underwater fish target detection method provided by the invention, the feature map output by the last downsampling module sequentially passes through each deconvolution module to obtain the feature map output by each deconvolution module, and the method comprises the following steps:

fusing the feature image output by the last downsampling module with the feature image output by the downsampling module corresponding to each deconvolution module sequentially;

Inputting the fusion result into a deconvolution module next to each deconvolution module, and obtaining a characteristic diagram output by the deconvolution module at the back; wherein the deconvolution module is pre-associated with the downsampling module.

According to the underwater fish target detection method provided by the invention, the feature map output by the last downsampling module is fused with the feature map output by the downsampling module corresponding to each deconvolution module sequentially, and the method comprises the following steps:

For any deconvolution module, deconvoluting the feature images output by the deconvolution module immediately before the deconvolution module, and performing a first convolution operation on the deconvoluted feature images to obtain feature images after the first convolution operation;

Respectively performing a second convolution operation and a third convolution operation on the feature images output by the downsampling module corresponding to the deconvolution module, and obtaining feature images after the second convolution operation and feature images after the third convolution operation;

fusing the characteristic diagram after the first convolution operation, the characteristic diagram after the second convolution operation and the characteristic diagram after the third convolution operation;

The feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same channel number, and the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same size.

According to the underwater fish target detection method provided by the invention, the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation are fused, and the method comprises the following steps:

And performing dot product operation on the characteristic diagram after the first convolution operation and the characteristic diagram after the second convolution operation, and fusing the characteristic diagram after the dot product operation with the characteristic diagram after the third convolution operation.

According to the method for detecting the underwater fish target provided by the invention, before the image to be detected is input into the feature extraction model in the target detection model and the feature images with different scales of the image to be detected are obtained from the feature images output by each layer of the feature extraction model, the method further comprises the steps of:

Preprocessing the image to be detected;

wherein the preprocessing comprises image enhancement and/or geometric transformation of the image to be detected.

According to the underwater fish target detection method provided by the invention, the Loss function of the target detection model is a Focal Loss function.

The invention also provides an underwater fish target detection device, which comprises:

The acquisition module is used for inputting the image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from the feature images output by each layer of the feature extraction model;

The target detection module is used for inputting the feature images with the different scales into a target detection model and outputting a target detection result of the image to be detected;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor performs the steps of the method for detecting an underwater fish target as described above when the program is executed.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the underwater fish target detection method as described in any of the above.

According to the underwater fish target detection method and device, on one hand, the image to be detected is input into the feature extraction model in the target detection model, the multi-scale deconvolution operation is carried out on the image to be detected in the feature extraction model, and a plurality of feature images with different scales are obtained, so that the features of fish targets with different scales in the image to be detected are completely represented, the phenomenon that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger sizes are difficult to detect because of incomplete features is effectively relieved; on the other hand, the target detection is carried out on the image to be detected by combining a plurality of feature images with different scales, so that the target detection result is more accurate.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of the underwater fish target detection method provided by the invention;

FIG. 2 is a schematic diagram of a target detection model in the underwater fish target detection method provided by the invention;

FIG. 3 is a second schematic diagram of the structure of the object detection model in the underwater fish object detection method according to the present invention;

FIG. 4 is a third schematic diagram of the structure of the object detection model in the underwater fish object detection method according to the present invention;

FIG. 5 is a schematic structural view of the underwater fish object detection apparatus provided by the present invention;

Fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The underwater fish object detection method of the present invention is described below with reference to fig. 1, including: step 101, inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from feature images output by each layer of the feature extraction model;

the image to be detected is an image of a fish object with a plurality of different scales, which is required to be subjected to object detection. The source of the image to be detected can be an image of the underwater fish object acquired in real time, or can be an image of the underwater fish object acquired in advance. The source of the image to be detected is not particularly limited in this embodiment.

Optionally, the image to be detected in the embodiment is an image acquired by the underwater image acquisition device under the factory intensive cultivation scene. The underwater image capturing device may be a camera or a robot, which is not particularly limited in this embodiment.

Before inputting the image to be detected into the feature extraction model in the target detection model, the target detection model needs to be trained first. While training the object detection model requires the construction of a dataset.

The data construction method in this embodiment will be described below by taking an image construction dataset of underwater fish photographed by a camera as an example.

In the embodiment, an underwater video acquisition device is built in a field culture pond, and underwater fish images are shot in the culture pond in advance through the underwater video acquisition device so as to construct a data set.

Preferably, the underwater video capturing apparatus in this embodiment includes a stand and an underwater camera. Wherein, the bracket is used for adjusting the height and angle of the underwater camera in water so as to ensure that as many fish objects as possible are shot in the field of view of the underwater camera.

The shape of the size of the culture pond for collecting the data set can be selected or set according to actual requirements. For example, the pond is a cylindrical pond with a diameter of 1.8 meters and a height of 1 meter.

Further, the present embodiment is not limited to the number, variety and length of fish in the culture pond, for example, the number of fish in the culture pond is 300, the variety of fish is a spat of oplegnathus fasciatus, and the length of fish is 7-8 cm.

The period for collecting the sample data can be set according to actual requirements, such as one month or two months.

In the data acquisition stage, in order to acquire more diversified data and enrich a data set, the embodiment acquires images of underwater fish in various underwater environments. Wherein, the underwater environment includes daytime natural illumination, daytime artificial light source, night artificial light source and night camera light source illumination, which is not particularly limited in this example.

By the mode, a large amount of underwater fish videos are collected. Then, extracting image frames from the underwater fish video through video processing software, and marking the image frames by using an image marking tool so as to obtain target detection labels of the image frames. Wherein, the data marking tool is a tool LabelImg and the like.

In addition, the embodiment also eliminates the image frames with serious target double images and the image frames with fewer fish targets or difficult labeling caused by the high-speed movement of fish in water so as to obtain high-quality effective image frames and form a data set.

The fish target distribution density in each image frame can be medium density or high density. The number of data sets, medium density or high density image frames may be plural, and this embodiment is not particularly limited. For example, the dataset contains 725 image frames, 291 image frames of medium density, and 434 image frames of high density. The number of fish objects in the data set may be plural, for example 32499, which is not particularly limited in this embodiment.

The size of the image frame may also be set according to the actual implementation, such as 768×1024.

In addition, the number of fish in the image frames with medium density or high density can be set according to actual requirements, for example, the number of fish targets in each image frame with medium density is 30-60, and the number of fish targets in each image frame with high density is 60-90.

In the actual simulation process, 80% of image frames can be selected as training samples, namely sample images, 10% of image frames are selected as verification samples, and the availability and performance of the target detection model are verified. In addition, 10% of the image frames can be used as a test set, namely the image to be detected.

In summary, in the data set construction process, the underwater camera is firstly utilized to acquire underwater fish video, then video editing software and an image labeling tool are utilized to extract and label the pictures of fish targets, so as to construct a data set, and a data base is provided for training of a target detection model.

In the training process of the target detection model, a training strategy can adopt a random gradient descent (SGD) activation function and the like to optimize the target detection model. The super parameters in the target detection model can be set according to actual requirements. Such as learning rate, batch_size, epoch are set to 0.001, 2, and 30, respectively.

Then, inputting the image to be detected into a feature extraction model of the trained target detection model, and performing deconvolution operation for a plurality of times after performing convolution for the image to be detected in the feature extraction model for a plurality of times so as to obtain a plurality of feature images with different scales. Each characteristic map comprises characteristics of fish targets with different scales.

According to the embodiment, the multi-scale feature extraction is carried out on the image to be detected, so that a plurality of feature images with different scales can be obtained, the features of the fish targets with different scales in the image to be detected are reserved, and the defect that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to detect due to incomplete features is effectively avoided.

102, Inputting the feature images with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales.

The target detection model comprises a feature extraction model, a classification model and a regression model. The classification model is used for identifying fish targets in the image to be detected, and the regression model is used for obtaining the boundary boxes of the fish targets in the image to be detected.

The feature images with different scales can be sequentially input into the target detection model, the target detection result of each feature image is output, and the target detection results of the feature images with different scales are overlapped to obtain the target detection result of the image to be detected.

The feature images with different scales can be input into the target detection model and the target detection result of the image to be detected can be output.

In this embodiment, the target detection algorithm is implemented using python or C.

Preferably, when the target detection algorithm is implemented using the Python language, the target detection algorithm may be implemented based on Pytorch framework. The hardware device may be set according to actual requirements, for example, 2 GPUs (Graphics Processing Unit, graphics processor) of 2080Ti are set.

In addition, the present embodiment adopts accuracy (Precision), recall (recall), and MAP (MEAN AVERAGE Precision ) as metrics to verify the effectiveness of the underwater fish target detection method in the present embodiment. The specific calculation formula is as follows:

the TP, the FP and the FN are respectively positive samples and positive samples of sample labels, negative samples and positive samples of classification results, negative samples of sample labels and negative samples of classification results, N is the number of images to be detected, p (k) is the Precision value of the identified kth image to be detected, and Deltar (k) is the change condition between the identified kth-1 image to be detected and the Recall value of the kth image to be detected.

In order to optimize the investment in the production links of aquaculture, intensive aquaculture implementing high-density aquaculture and mass production is becoming a standard model of aquaculture to obtain maximum economic and environmental benefits. Today, high-tech technologies such as big data, internet of things, cloud computing, artificial intelligence and the like and intelligent equipment are deeply fused with modern agriculture. The method has more and more application in various links such as production, processing, transportation, sales and the like in the fishery cultivation process, and the production and operation efficiency is greatly improved. Among them, the object detection method in computer vision technology is widely used in the aspects of automatic identification and classification, production state monitoring, etc. in aquaculture, because of its advantages of rapidness, objectivity and high precision.

The underwater fish target detection method in this embodiment is compared with the existing target detection methods, such as FASTER RCNN (Faster Regions with CNN, fast area convolutional neural network), RETINANET network, SSD (Single Shot MultiBox Detector, single-shot multi-box detection), DSSD (Deconvolutional Single Shot Detector, deconvolution single-shot detection), YOLOv3 (You only look once V3, you see third edition only once) and YOLOv5 (You only look once V, you see fifth edition only once), to verify the effectiveness of the underwater fish target detection method in this embodiment. As can be seen from table 1, the detection result of the image to be detected in this embodiment has the maximum accuracy, recall rate and average mean value accuracy, so that the target detection result is more accurate.

TABLE 1 experimental analysis results

Model	MAP	Recall	Precision
				Faster RCNN	88.9％	97.3％	61.2％
Retinanet	89.6％	94.6％	50.2％
				SSD	86.3％	92.7％	50.1％
DSSD	85.7％	84.2％	70.8％
				YOLOv3	90.4％	95.1％	61.3％
YOLOv5	94.7％	96.4％	73.1％
				This embodiment	95.2％	99.2％	78.1％

According to the embodiment, the target detection model is used for carrying out multi-scale feature extraction on the image to be detected, a plurality of scale feature images are obtained, and the target detection result of the image to be detected can be accurately obtained by combining the plurality of feature images with different scales.

On one hand, the image to be detected is input into the feature extraction model in the target detection model, multi-scale deconvolution operation is carried out on the image to be detected in the feature extraction model, and a plurality of feature images with different scales are obtained to completely represent the features of fish targets with different scales in the image to be detected, so that the phenomenon that the fish targets with smaller sizes are lost in the target detection process and the fish targets with larger scales are difficult to detect due to incomplete features is effectively relieved; on the other hand, the target detection is carried out on the image to be detected by combining a plurality of feature images with different scales, so that the target detection result is more accurate.

On the basis of the above embodiment, the feature extraction model in this embodiment includes a deconvolution module and a downsampling module; correspondingly, the inputting the image to be detected into the feature extraction model in the target detection model, and obtaining the feature graphs of a plurality of different scales of the image to be detected from the feature graphs output by each layer of the feature extraction model comprises the following steps: sequentially passing the image to be detected through all the downsampling modules to obtain a feature map output by the last downsampling module; sequentially passing the feature image output by the last downsampling module through all deconvolution modules to obtain the feature image output by all deconvolution modules; taking the characteristic diagrams output by the deconvolution modules as characteristic diagrams of the images to be detected; the feature graphs output by the deconvolution modules are feature graphs with different scales, the downsampling module is used for downsampling the input of the downsampling module, and the deconvolution module is used for deconvoluting the input of the deconvolution module.

The feature extraction model is a deconvolution neural network structure, and ResNet (residual network) with a plurality of convolution layers is used as a backbone network. ResNet improves the characterization capability of the shallow network through a residual block structure containing a multi-layer network, and effectively avoids the problem of network degradation.

Preferably, the convolution layer is a hole convolution. Because the size of the image to be detected is large, under the condition of limited resources, the adoption of cavity convolution can provide a larger receptive field, and the information loss caused by pooling operation can be effectively reduced. That is, a larger perceived field of view can be provided without losing image information. In addition, the cavity convolution can reduce parameters and accurately determine the position of the target.

Wherein the perceived field of view is the size of the area mapped onto the original image by the pixels in each layer of feature map. The bigger the receptive field is, the more global information it contains, which is a higher semantic feature; conversely, the smaller the receptive field, the more localized and detailed features are contained in the feature map.

It should be noted that the convolution layer may also be a normal convolution.

The feature extraction model may include a plurality of downsampling modules and a plurality of deconvolution modules, which are not limited in the structure of the feature extraction model in this embodiment.

For example, the number of downsampling modules in the feature extraction model is 6, conv3, conv5, conv6, conv7, conv8, and conv9 in the DSSD network, respectively.

Alternatively, one or more downsampling layers may be included in the downsampling module, and the present embodiment is not limited to the number of downsampling layers in the downsampling module. The downsampling layers in each downsampling module may be the same or different.

For the downsampling layer of any downsampling module, pooling operation can be carried out on the feature map input into the downsampling layer, so that the size of the input feature map is reduced. Wherein the pooling operation may maximize pooling or average pooling, etc.

The downsampling module further includes one or more convolution layers, and the present embodiment is not limited to the number of convolution layers in the downsampling module. The convolution kernel size of each convolution layer may be the same or different. The convolution operation can learn the input feature map and output the feature map with deep feature information.

As shown in fig. 2 and fig. 3, an image to be detected may sequentially pass through a plurality of downsampling modules, and feature extraction is performed on the image to be detected through each downsampling module until the image passes through the last downsampling module, so as to obtain a feature map output by the last downsampling module.

After the feature map output by the last downsampling module is obtained, the feature map output by the last downsampling module can be used as the input of the front deconvolution module arranged in the feature extraction model, and the feature map output by the deconvolution module is obtained. And then inputting the feature map output by the deconvolution module into the next deconvolution module immediately after the feature map is input, and acquiring the feature map output by the next deconvolution module immediately after the feature map is input until all the deconvolution modules are passed through, so as to acquire a plurality of feature maps with different scales.

In addition, the feature map output by the last downsampling module and the feature output by the downsampling module corresponding to the front deconvolution module arranged in the feature extraction model can be used as the input of the deconvolution module together, so that the feature map output by the deconvolution module can be obtained. And then, the characteristic diagram output by the deconvolution module and the characteristic diagram output by the downsampling module corresponding to the next deconvolution module immediately behind the characteristic diagram are input into the next deconvolution module together, and the characteristic diagram output by the next deconvolution module is obtained until all deconvolution modules pass through.

Wherein one or more deconvolution layers may be included in the deconvolution module, the present embodiment is not limited to the number of deconvolution layers in the deconvolution module. The deconvolution layers in each deconvolution module may be the same or different.

For any deconvolution layer, the feature map input in that deconvolution layer may be enlarged or restored using a deconvolution operation. The deconvolution layers in each deconvolution module may be the same or different.

In the embodiment, a plurality of deconvolution modules are adopted to carry out multi-scale feature extraction on the input feature map, so that the extracted feature map contains rich scale features of the feature map to be detected, and the accuracy of target detection is effectively improved.

On the basis of the foregoing embodiment, in this embodiment, the step of sequentially passing the feature map output by the last downsampling module through each deconvolution module to obtain the feature map output by each deconvolution module includes: fusing the feature image output by the last downsampling module with the feature image output by the downsampling module corresponding to each deconvolution module sequentially; inputting the fusion result into a deconvolution module next to each deconvolution module, and obtaining a characteristic diagram output by the deconvolution module at the back; wherein the deconvolution module is pre-associated with the downsampling module.

Specifically, it is assumed that the sum of the numbers of deconvolution modules and downsampling modules in the feature extraction model is N, where N is a positive integer. And numbering the downsampling module and the deconvolution module according to the mode that the downsampling module and the deconvolution module are sequentially added with 1 from front to back in the feature extraction, and the downsampling module corresponding to the deconvolution module with the number i is N-i.

For the next deconvolution module immediately adjacent to any deconvolution module, after the feature image output by the last downsampling module sequentially passes through the deconvolution module, the feature image output by the deconvolution module and the feature image output by the downsampling module corresponding to the deconvolution module are fused and then input into the next deconvolution module immediately adjacent to the next deconvolution module, so as to obtain the feature image output by the next deconvolution module immediately adjacent to the next deconvolution module.

According to the embodiment, the feature map output by the deconvolution module and the feature map output by the downsampling module corresponding to the deconvolution module are fused, the fused feature map is used as the input of the next deconvolution module immediately adjacent to the deconvolution module, the context information is fully utilized, the fused feature map contains abundant shallow features and deep features, the loss of the feature information in the image to be detected can be effectively reduced, and the accuracy of target detection is further improved.

On the basis of the foregoing embodiment, in this embodiment, the fusing the feature map output by the last downsampling module with the feature map output by the downsampling module corresponding to each deconvolution module sequentially includes: for any deconvolution module, deconvoluting the feature images output by the deconvolution module immediately before the deconvolution module, and performing a first convolution operation on the deconvoluted feature images to obtain feature images after the first convolution operation; respectively performing a second convolution operation and a third convolution operation on the feature images output by the downsampling module corresponding to the deconvolution module, and obtaining feature images after the second convolution operation and feature images after the third convolution operation; fusing the characteristic diagram after the first convolution operation, the characteristic diagram after the second convolution operation and the characteristic diagram after the third convolution operation; the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same channel number, and the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation have the same size.

Specifically, since the mutual occlusion between the targets easily occurs in the high-density image, after the targets are occluded, the feature information which can be extracted is reduced and extraction of other features is interfered, so that frequent occurrence of missed detection or false detection is caused. When the image of the underwater fish obtained in the dense scene is analyzed by using a computer vision method, the complexity of the problem increases exponentially with the number of fish targets and the increase of the mutual influence among the fish targets, so that the detection result of the underwater fish targets is difficult to meet the actual requirement.

In order to solve the above problem, in this embodiment, on one hand, the feature map output by the deconvolution module and the feature map output by the downsampling module corresponding to the deconvolution module are fused, so that the low-level features and the high-level features are fully combined; on the other hand, features are enriched by expanding the dimension of the feature map, so that most of information is reserved, and no significant overhead is generated.

Optionally, for any deconvolution module, deconvoluting the feature map output by the deconvolution module to deconvolve the feature map output by the deconvolution module; and then performing a first convolution operation on the feature map after the deconvolution operation. The size of the feature map after the first convolution operation is a preset multiple, such as 2 times, of the feature map before the convolution operation.

And at the same time, performing a second convolution operation on the feature map output by the downsampling module corresponding to the deconvolution module, and further extracting the features in the feature map output by the downsampling module corresponding to the deconvolution module. The feature map size before and after the second convolution operation is unchanged.

And in order to obtain additional context information, performing a third convolution operation on the feature map output by the downsampling module corresponding to the deconvolution module. The size and the channel number of the feature map after the third convolution operation are the same as those of the feature map after the first convolution operation and the feature map after the second convolution operation.

The sizes of convolution kernels of the deconvolution operation, the first convolution operation, the second convolution operation and the third convolution operation can be set according to actual requirements. The number of times of the first convolution operation, the second convolution operation, and the third convolution operation may also be set according to actual requirements.

And then fusing the characteristic diagram after the first convolution operation, the characteristic diagram after the second convolution operation and the characteristic diagram after the third convolution operation.

The fusion mode comprises the steps of connecting the characteristic diagram after the first convolution operation, the characteristic diagram after the second convolution operation and the characteristic diagram after the third convolution operation in series, fusing the characteristic diagram after the first convolution operation and the characteristic diagram after the second convolution operation and then connecting the characteristic diagram after the third convolution operation in series or fusing the characteristic diagram after the first convolution operation and the characteristic diagram after the third convolution operation and then connecting the characteristic diagram after the second convolution operation in series so as to expand the dimension of the characteristic diagram and acquire the characteristics of the abundant images to be detected.

According to the embodiment, the low-level features and the high-level features are fully combined, the features are enriched by expanding the dimension, most of information is reserved, no obvious expenditure is generated, and the problems that in the prior art, due to the fact that the fish objects are seriously shielded and the small objects are numerous, the accuracy of the object detection result is low, and the omission and false detection are frequent in the dense scene are effectively solved.

As shown in fig. 4, an example in the present embodiment. The size of the feature map output by the deconvolution module is W×H×512, firstly, the feature map after the first convolution operation is obtained by performing deconvolution operation of 2×2×256 and then performing first convolution operation of 3×3×256, and the feature map after the first convolution operation is 2W×2H×256. The size of the feature map output by the downsampling module corresponding to the deconvolution module is 2W×2H2X512, and the feature map after the first convolution operation is 2W×2H2H×256 is obtained through the second convolution operation of 3×3×256 and 3×3×256 in sequence. The size of the feature map output by the downsampling module corresponding to the deconvolution module is subjected to a third convolution operation of 1×1×256, and the feature map obtained after the third convolution operation is 2w×2h×256.

On the basis of the foregoing embodiment, in this embodiment, fusing the feature map after the first convolution operation, the feature map after the second convolution operation, and the feature map after the third convolution operation includes: and performing dot product operation on the characteristic diagram after the first convolution operation and the characteristic diagram after the second convolution operation, and fusing the characteristic diagram after the dot product operation with the characteristic diagram after the third convolution operation.

Specifically, the feature map after the first convolution operation and the feature map after the second convolution operation are subjected to dot product operation (ELEMENTWISE), and the size and the channel number of the feature map after the dot product operation are the same as those of the feature map after the first convolution operation and the feature map after the second convolution operation.

Then, the feature map after the dot product operation and the feature map after the third convolution operation are fused by using a tandem operation (Concatenate). The number of channels of the fused feature map is twice the number of channels of the feature map after the dot-product operation or the feature map after the third convolution operation, and the size of the channel is the same as the size of the feature map after the dot-product operation or the feature map after the third convolution operation.

According to the embodiment, the low-level features and the high-level features are fully combined, more detail features are fused, so that the obtained feature map contains abundant feature information of the image to be detected, and the target detection result can be accurately obtained even under the conditions of serious target shielding or numerous small targets and the like in a dense scene.

On the basis of the foregoing embodiments, in this embodiment, before the inputting the image to be detected into the feature extraction model in the target detection model and obtaining the feature images of different scales of the image to be detected from the feature images output by each layer of the feature extraction model, the method further includes: preprocessing the image to be detected; wherein the preprocessing comprises image enhancement and/or geometric transformation of the image to be detected.

In particular, underwater fish target detection presents a significant challenge due to the relatively complex underwater environment. When an underwater image acquisition device is used for acquiring an image of an underwater fish object, color deviation caused by light absorption, detail blurring caused by forward scattering of light and contrast ratio caused by backward scattering of light in an underwater environment are low, even serious distortion is caused, so that the quality of an underwater image obtained by an image acquisition mode based on underwater optical imaging is poor, an expected effect is not achieved, and further an object detection result is inaccurate. Therefore, the image enhancement pretreatment of the image to be detected plays an important role in accurately detecting the underwater fish targets.

In order to alleviate the above problems, in this embodiment, before an image to be detected is input into a feature extraction model in a target detection model, an image enhancement pretreatment is performed on the image to be detected by using an defogging network (AOD-Net), so that the contrast of the image to be detected can be effectively improved, the phenomenon of image blurring caused by too many white masks during underwater imaging can be reduced, the quality of the image to be detected can be improved, and the influence on the fish target detection result caused by the blurring of the image to be detected can be reduced.

Alternatively, AOD-Net is a deep learning method based on an atmospheric scattering model. The AOD-Net directly calculates two parameters of global atmosphere light and medium transmissivity, and combines the two parameters into one parameter for estimation, so that error accumulation and even amplification are not caused. The outline of the image to be detected after the AOD-Net enhancement is clear, the color is rich, and the problem of excessive recovery does not exist.

In this embodiment, one or more of the mean value, variance, average gradient and image entropy of the image are used to compare the images to be detected before and after the AOD-Net is enhanced, so as to evaluate the quality of the images to be detected before and after the AOD-Net is enhanced.

In addition, before the image to be detected is input into the feature extraction model in the target detection model, geometric transformation can be performed on the image to be detected so as to enrich the features of the image to be detected.

In the training process of the target detection model, the image enhancement and the geometric transformation can be carried out on the sample image in the mode, and a rich sample data set can be obtained by carrying out the geometric transformation on the sample graph, so that the phenomenon of overfitting of the model is effectively avoided.

Optionally, the geometric transformation includes flipping, rotating, scaling, and the like, and the present embodiment is not limited to the manner in which the geometric transformation is performed.

According to the embodiment, the image enhancement is carried out on the image to be detected and the sample image, so that the contrast of the image to be detected can be effectively improved, the phenomenon of image blurring caused by too many white masks in underwater imaging is reduced, and the accuracy of a target detection result is further improved. In addition, the sample set can be expanded by performing geometric transformation on the sample image, so that the phenomenon of overfitting of the target detection model is effectively avoided, the performance of the target detection model is improved, and the accuracy of a target detection result is further improved.

Based on the above embodiments, the Loss function of the target detection model in this embodiment is a Focal Loss function.

Specifically, in highly intensive environments, a single fish object occupies a small pixel area in the image, creating numerous small objects, making the background redundant. That is, the number of positive samples is small and the number of negative samples is large, resulting in class imbalance.

As shown in fig. 2, in order to solve the above-described problem, the present embodiment uses a Focal Loss function as a Loss function to balance the problem of positive and negative sample imbalance.

In addition, the target detection model is optimized by adopting a Non-maximum suppression (Non-Maximum Suppression) method.

The underwater fish object detection device provided by the invention is described below, and the underwater fish object detection device described below and the underwater fish object detection method described above can be referred to correspondingly.

As shown in fig. 5, the embodiment provides an underwater fish target detection device, the zhuanghan son includes an acquisition module 501 and a target detection module 502, wherein:

The obtaining module 501 is configured to input an image to be detected into a feature extraction model in a target detection model, and obtain feature graphs of a plurality of different scales of the image to be detected from feature graphs output by each layer of the feature extraction model;

The shape of the size of the culture pond for collecting the data set can be selected or set according to actual requirements.

Further, the present embodiment is not limited to the number, variety, and length of fish in the culture pond.

The period for collecting the sample data can be set according to actual requirements.

The fish target distribution density in each image frame can be medium density or high density. The number of data sets, medium density or high density image frames may be plural, and this embodiment is not particularly limited. The number of fish objects in the data set may be plural, and this embodiment is not particularly limited.

The size of the image frame may also be set according to the actual implementation.

In addition, the number of fish in the image frame of medium density or high density can also be set according to actual requirements.

In the training process of the target detection model, a training strategy can adopt a random gradient descent (SGD) activation function and the like to optimize the target detection model. The super parameters in the target detection model can be set according to actual requirements.

The target detection module is used for inputting the feature images with the different scales into the target detection model and outputting a target detection result of the image to be detected; the target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales.

Preferably, when the target detection algorithm is implemented using the Python language, the target detection algorithm may be implemented based on Pytorch framework. The hardware equipment can be set according to actual requirements.

In order to optimize the investment in the production links of aquaculture, intensive aquaculture implementing high-density aquaculture and mass production is becoming a standard model of aquaculture to obtain maximum economic and environmental benefits. Today, high-tech technologies such as big data, internet of things, cloud computing, artificial intelligence and the like and intelligent equipment are deeply fused with modern agriculture. The method has more and more application in various links such as production, processing, transportation, sales and the like in the fishery cultivation process, and the production and operation efficiency is greatly improved. Among them, computer vision technology is widely used in the aspects of automatic identification and classification in aquaculture, production state monitoring, etc. because of its advantages of rapidness, objectivity and high precision.

The underwater fish target detection method in the embodiment is compared with the existing computer vision technology to verify the effectiveness of the underwater fish target detection method in the embodiment. As can be seen from table 1, the detection result of the image to be detected in this embodiment has the maximum accuracy, recall rate and average mean value accuracy, so that the target detection result is more accurate.

On the basis of the above embodiment, the feature extraction model in this embodiment includes a deconvolution module and a downsampling module; correspondingly, the acquisition module is specifically configured to: sequentially passing the image to be detected through all the downsampling modules to obtain a feature map output by the last downsampling module; sequentially passing the feature image output by the last downsampling module through all deconvolution modules to obtain the feature image output by all deconvolution modules; taking the characteristic diagrams output by the deconvolution modules as characteristic diagrams of the images to be detected; the feature graphs output by the deconvolution modules are feature graphs with different scales, the downsampling module is used for downsampling the input of the downsampling module, and the deconvolution module is used for deconvoluting the input of the deconvolution module.

On the basis of the above embodiment, the obtaining module in this embodiment is further configured to fuse the feature map output by the last downsampling module with the feature map output by the downsampling module corresponding to each deconvolution module sequentially; inputting the fusion result into a deconvolution module next to each deconvolution module, and obtaining a characteristic diagram output by the deconvolution module at the back; wherein the deconvolution module is pre-associated with the downsampling module.

On the basis of the above embodiment, the present embodiment further includes a fusion module specifically configured to: for any deconvolution module, deconvoluting the feature images output by the deconvolution module immediately before the deconvolution module, and performing a first convolution operation on the deconvoluted feature images to obtain feature images after the first convolution operation; respectively performing a second convolution operation and a third convolution operation on the feature images output by the downsampling module corresponding to the deconvolution module, and obtaining feature images after the second convolution operation and feature images after the third convolution operation; fusing the characteristic diagram after the first convolution operation, the characteristic diagram after the second convolution operation and the characteristic diagram after the third convolution operation; the characteristic diagram after the first convolution operation, the characteristic diagram after the second convolution operation and the channel number of the characteristic diagram after the third convolution operation are the same; the feature map after the first convolution operation, the feature map after the second convolution operation and the feature map after the third convolution operation are the same in size.

On the basis of the foregoing embodiment, the fusion module in this embodiment is further configured to perform a dot product operation on the feature map after the first convolution operation and the feature map after the second convolution operation, and fuse the feature map after the dot product operation with the feature map after the third convolution operation.

On the basis of the above embodiments, the present embodiment further includes a preprocessing module specifically configured to: preprocessing the image to be detected; wherein the preprocessing comprises image enhancement and/or geometric transformation of the image to be detected.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 601, communication interface (Communications Interface) 602, memory 603 and communication bus 604, wherein processor 601, communication interface 602, memory 603 complete communication with each other through communication bus 604. The processor 601 may call logic instructions in the memory 603 to perform an underwater fish object detection method comprising: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from feature images output by each layer of the feature extraction model; inputting the feature images with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales.

Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of underwater fish target detection provided by the above methods, the method comprising: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from feature images output by each layer of the feature extraction model; inputting the feature images with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided underwater fish target detection methods, the method comprising: inputting an image to be detected into a feature extraction model in a target detection model, and acquiring a plurality of feature images with different scales of the image to be detected from feature images output by each layer of the feature extraction model; inputting the feature images with different scales into the target detection model, and outputting a target detection result of the image to be detected; the target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting an underwater fish target, comprising:

the target detection model is obtained by training by taking a sample image as a sample and taking a target detection label of the sample image as a sample label; the image to be detected comprises a plurality of images of fish targets with different scales;

the feature extraction model comprises a deconvolution module and a downsampling module;

The feature graphs output by the deconvolution modules are feature graphs with different scales, the downsampling module is used for downsampling the input of the downsampling module, and the deconvolution module is used for deconvoluting the input of the deconvolution module;

The step of sequentially passing the feature map output by the last downsampling module through all deconvolution modules to obtain the feature map output by all deconvolution modules comprises the following steps:

inputting the fusion result into a deconvolution module next to each deconvolution module, and obtaining a characteristic diagram output by the deconvolution module at the back; wherein the deconvolution module is pre-associated with the downsampling module;

The fusing the feature image output by the last downsampling module with the feature image output by the downsampling module corresponding to each deconvolution module sequentially comprises the following steps:

2. The underwater fish target detection method as claimed in claim 1, wherein the fusing the feature map after the first convolution operation, the feature map after the second convolution operation, and the feature map after the third convolution operation includes:

3. The underwater fish target detection method as claimed in any one of claims 1 to 2, wherein before the inputting of the image to be detected into the feature extraction model in the target detection model, the obtaining of the feature images of a plurality of different scales of the image to be detected from the feature images output from the respective layers of the feature extraction model, further comprises:

Preprocessing the image to be detected;

4. The method for detecting an underwater fish target as claimed in any one of claims 1 to 2, wherein the loss function of the target detection model is FocalLoss functions.

5. An underwater fish object detection apparatus, comprising:

Correspondingly, the acquisition module is specifically configured to:

the acquisition module is further configured to:

The device also comprises a fusion module, wherein the fusion module is specifically used for:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the underwater fish target detection method as claimed in any one of claims 1 to 4 when the program is executed.

7. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the underwater fish target detection method as claimed in any of claims 1 to 4.