CN117765421A

CN117765421A - coastline garbage identification method and system based on deep learning

Info

Publication number: CN117765421A
Application number: CN202410197140.6A
Authority: CN
Inventors: 于迅; 彭士涛; 胡健波; 赵浩栾; 齐兆宇; 肖令; 邓孟涛; 马国强
Original assignee: Tianjin Research Institute for Water Transport Engineering MOT
Current assignee: Tianjin Research Institute for Water Transport Engineering MOT
Priority date: 2024-02-22
Filing date: 2024-02-22
Publication date: 2024-03-26
Anticipated expiration: 2044-02-22
Also published as: CN117765421B

Abstract

the invention relates to the technical field of image processing, and discloses a coastline garbage identification method and system based on deep learning, wherein the method comprises the following steps: acquiring an original image of a coastline; preprocessing an original image to construct a detection target image data set; training the YOLOv8n-IL network model by using the detection target image data set to obtain an optimal weight file and a trained network model; inputting the image to be detected into a trained YOLOv8n-IL network model to identify coastline garbage. The YOLOv8n-IL network model expands the receptive field, enhances the feature extraction capability, can dynamically adjust the receptive field according to different scales of targets, more effectively processes the background information difference required by the targets with different scales, improves the detection precision of the network model on the targets with different scales, and reduces the occurrence of missed detection and false detection.

Description

coastline garbage identification method and system based on deep learning

Technical Field

the invention relates to the technical field of image processing, in particular to a coastline garbage identification method and system based on deep learning.

Background

In recent years, coastal garbage has increasingly polluted the environment of coastal cities, i.e., the portion of the coastal garbage that is stranded on land. The coastal garbage is composed of a plurality of garbage types, wherein the plastic garbage occupies 60% -80% of the total coastal garbage. The plastic has durability, is not easy to be completely degraded, and most of the plastic has density lower than that of seawater, so that the plastic is easy to repeatedly move to and from the ocean and coastline environments along with wind and tides, and the plastic is degraded into plastic scraps and microplastic in the process. Unlike large plastics, small-sized plastics often avoid cleaning and can therefore persist and accumulate in coastal environments for many years. The continuously accumulated plastic garbage not only causes serious threat to coastal ecological environment and human health, but also greatly influences the image and development of coastal cities. Therefore, in order to better explore the generation mechanism of plastic waste and evaluate the effect of waste cleaning activities, how to efficiently and accurately monitor and quantify plastic waste in an artificial coastline becomes an important study.

At present, a common method for monitoring plastic waste on coastlines is a site vision census method. The method requires observers to perform visual census evaluation on site, and evaluates the quantity, composition and distribution of plastic wastes in the coastline environment according to the result of manually collecting the wastes. Although this method is highly accurate, it is inefficient and has limited coverage, requiring significant human resources. And when the existing network model carries out garbage identification, garbage with bright colors and tourists cannot be distinguished, in the unmanned aerial vehicle aerial image, the characteristics of people are weakened, and some of the tourists wearing trousers with the same color or squatting cannot observe the characteristics of people even. All these factors lead to serious missed and false detection situations in the garbage detection task.

Therefore, there is a need for a shoreline garbage recognition method based on deep learning, which can enhance the feature extraction capability, dynamically adjust the receptive field according to different scales of targets, more effectively process the background information difference required by the targets with different scales, improve the detection precision of the network model on the targets with different scales, and reduce the occurrence of missed detection and false detection.

Disclosure of Invention

In order to solve the technical problems, the invention provides a coastline garbage recognition method and a coastline garbage recognition system based on deep learning, which can enhance the feature extraction capability, dynamically adjust receptive fields according to different scales of targets, more effectively process background information differences required by the targets with different scales, improve the detection precision of a network model on the targets with different scales, and reduce the occurrence of missed detection and false detection.

the invention provides a coastline garbage identification method based on deep learning, which comprises the following steps:

S1, acquiring an original image of a coastline; the original image comprises garbage, tourists and environment;

s2, preprocessing an original image to construct a detection target image data set; the detection targets include: garbage and guests, the garbage comprising a first size range garbage and a second size range garbage; wherein the lower limit of the second size range is greater than the upper limit of the first size range;

S3, training the YOLOv8n-IL network model by using the detection target image data set to obtain an optimal weight file and a trained YOLOv8n-IL network model; the method comprises the steps that a Yolov8n-IL network model is based on the Yolov8n network model, the Yolov8n network model comprises an input end, a backbone network, a Neck module and a detection head, the Yolov8n-IL network model replaces the last two C2f modules in the backbone network with an InceptionNeXt module, and an LSK module is embedded in front of the detection head; the LSK module comprises a common convolution and an expansion convolution, wherein the common convolution and the expansion convolution have receptive fields with different sizes and are used for acquiring context information in different ranges; the optimal weight file comprises a plurality of groups of weight parameters, and the characteristic information of each detection target corresponds to one group of weight parameters;

S4, inputting the image to be detected into a trained YOLOv8n-IL network model to identify coastline garbage; the method specifically comprises the following steps:

S41, inputting an image to be detected into an input end of a trained YOLOv8n-IL network model;

S42, extracting features of each detection target in the image to be detected through a backbone network;

S43, carrying out feature fusion on the features of each detection target through a Neck module to obtain a fused feature image;

S44, the LSK module respectively performs feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, and selects corresponding feature extraction results to combine the fused feature image to obtain a final output feature image of the fused feature image;

s45, the detection head identifies each detection target according to the final output characteristic image, and an identification result of the image to be detected is obtained.

Further, in S3, the optimal weight file includes a plurality of sets of weight parameters, and the feature information of each detection target corresponds to a set of weight parameters including:

Taking a weight parameter corresponding to the characteristic information of the garbage with the detection target of the first size range as a first weight parameter;

Taking a weight parameter corresponding to the characteristic information of the garbage with the detection target being the second size range as a second weight parameter;

And taking the weight parameter corresponding to the characteristic information of which the detection target is the tourist as a third weight parameter.

Further, S44, the LSK module performs feature extraction on all detection targets in the fused feature image through the common convolution and the expansion convolution, and selecting the corresponding feature extraction result to combine with the fused feature image to obtain a final output feature image of the fused feature image includes:

s441, performing feature extraction on all detection targets in the fused feature images by an LSK module through common convolution to obtain first feature images of all detection targets;

s442, performing feature extraction on all detection targets in the fused feature images by the LSK module through expansion convolution to obtain second feature images of all the detection targets;

s443, selecting a first characteristic image or a second characteristic image as the attention characteristic of the detection target by the LSK module according to the characteristic of the detection target and the optimal weight file;

S444, the LSK module obtains a final output characteristic image of the fusion characteristic image according to the fusion characteristic image and the attention characteristics of all detection targets in the fusion characteristic image.

Further, S443, the LSK module selecting the first feature image or the second feature image as the attention feature of the detection target according to the feature of the detection target and the optimal weight file includes:

S4431, the LSK module carries out weighted calculation on the first characteristic image and the optimal weight file, and obtains the characteristic information weights of all detection targets in the first characteristic image according to the characteristics of the detection targets;

s4432, the LSK module performs weighted calculation on the second characteristic image and the optimal weight file, and obtains characteristic information weights of all detection targets in the second characteristic image according to the characteristics of the detection targets;

s4433, determining a detection target with the feature information weight ratio larger than a preset value in the first feature image, and selecting the first feature image as the attention feature of the detection target; wherein the detection target comprises garbage in a first size range;

s4434, determining a detection target with the feature information weight ratio larger than a preset value in the second feature image, and selecting the second feature image as the attention feature of the detection target; wherein the detection target includes a second size range of trash and guests.

Further, S2, preprocessing the original image, and constructing a detection target image dataset includes:

S21, performing image cutting processing on an original image to obtain a cut image containing a detection target and an environment image not containing the detection target;

s22, labeling the cut image, and labeling the cut image containing the detection target as a positive sample to obtain a detection target image data set;

S23, dividing the detection target image data set into a training set, a verification set and a test set according to a preset proportion, and carrying out data enhancement processing on cut images in the training set;

And S24, marking the environment image which does not contain the detection target as a negative sample, and adding the negative sample into the training set and the verification set which are subjected to data enhancement processing to obtain a final training set and a final verification set.

further, S4, before inputting the image to be detected into the trained YOLOv8n-IL network model to identify the coastline garbage, further includes:

and inputting the acquired images into a SAHI module for image cutting to obtain a plurality of images to be detected.

further, S4, after inputting the image to be detected into the trained YOLOv8n-IL network model to identify the coastline garbage, further includes:

S5, the SAHI module performs image stitching on the identification result of each image to be detected to obtain a coastline garbage identification result.

Further, after S45, the method further includes:

and the trained YOLOv8n-IL network model scores and outputs the confidence coefficient of the identification frame of each detection target on each image to be detected according to the output identification result of each image to be detected.

further, S5, the SAHI module performs image stitching on the identification result of each image to be detected, and the step of obtaining the coastline garbage identification result comprises the following steps:

s51, the SAHI module performs image stitching on the identification result of each image to be detected to obtain a stitched identification result graph;

s52, the SAHI module executes non-maximum suppression post-processing on the spliced recognition result graphs, only keeps the recognition frame with the largest reliability score among a plurality of recognition frames on the same detection target, and removes other recognition frames on the detection target to obtain a coastline garbage recognition result.

The invention also provides a shoreline garbage identification system based on deep learning, which is used for executing the shoreline garbage identification method based on deep learning, and is characterized in that the system comprises:

The image acquisition module is used for acquiring an original image of the coastline; the original image comprises garbage, tourists and environment;

The image processing module is connected with the image acquisition module and used for preprocessing an original image and constructing a detection target image data set; the detection targets include: garbage and guests, the garbage comprising a first size range garbage and a second size range garbage; wherein the lower limit of the second size range is greater than the upper limit of the first size range;

The network model building module is connected with the image processing module and is used for training the YOLOv8n-IL network model by utilizing the detection target image data set to obtain an optimal weight file and a trained YOLOv8n-IL network model; the method comprises the steps that a Yolov8n-IL network model is based on the Yolov8n network model, the Yolov8n network model comprises an input end, a backbone network, a Neck module and a detection head, the Yolov8n-IL network model replaces the last two C2f modules in the backbone network with an InceptionNeXt module, and an LSK module is embedded in front of the detection head; the LSK module comprises a common convolution and an expansion convolution, wherein the common convolution and the expansion convolution have receptive fields with different sizes and are used for acquiring context information in different ranges; the optimal weight file comprises a plurality of groups of weight parameters, and the characteristic information of each detection target corresponds to one group of weight parameters;

The image recognition module is connected with the network model building module and is used for inputting an image to be detected into the trained YOLOv8n-IL network model to recognize coastline garbage; the method specifically comprises the following steps: inputting an image to be tested into an input end of a trained YOLOv8n-IL network model; extracting features of each detection target in the image to be detected through a backbone network; then, feature fusion is carried out on the features of each detection target through a Neck module, and a fused feature image is obtained; the LSK module respectively performs feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, and selects corresponding feature extraction results to combine with the fused feature image to obtain a final output feature image of the fused feature image; and the detection head identifies each detection target according to the final output characteristic image to obtain the identification result of the image to be detected.

The embodiment of the invention has the following technical effects:

The invention provides an improved network model based on a YOLOv8n network model, wherein the last two C2f modules in a backbone network are replaced by an InceptionNeXt module, an LSK module is embedded in front of a detection head, a YOLOv8n-IL network model is built, tourists and garbage are taken as detection targets, the receptive field is dynamically adjusted according to different detection targets through the LSK module, background information differences required by different scale targets are processed more effectively, the detection precision of the network model on different scale targets is improved, the influence of the tourists on garbage identification can be reduced by taking the tourists as the detection targets, and the occurrence of missed detection and false detection is reduced.

Drawings

in order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a shoreline trash identification method based on deep learning provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a YOLOv8n network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a YOLOv8n-IL network model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an InceptionNeXt module in a YOLOv8n-IL network model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an LSK module in a YOLOv8n-IL network model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a shoreline garbage recognition system based on deep learning according to the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the invention, are within the scope of the invention.

Fig. 1 is a flowchart of a shoreline garbage identification method based on deep learning according to an embodiment of the present invention, referring to fig. 1, specifically including:

s1, acquiring an original image of a coastline.

specifically, the shoreline is shot and sampled by the unmanned aerial vehicle, and an original image of the shoreline is obtained. The original image comprises garbage, tourists and environment.

Illustratively, a drone may be selected that is equipped with a high resolution (4K) camera and an integrated Global Positioning System (GPS), such as a arcade puck 4 Pro V2 drone. When the coastline is photographed and sampled, the flying height of the unmanned aerial vehicle can be set to be 30m, and the cradle head angle is set to be-90 degrees, so that a photo perpendicular to the ground can be acquired.

S2, preprocessing the original image to construct a detection target image data set.

Specifically, the preprocessing comprises image cutting processing and labeling processing; the detection targets include garbage and guests. Further subdividing the refuse, which may also include a first size range refuse and a second size range refuse; wherein the lower limit of the second size range is greater than the upper limit of the first size range. The first size range and the second size range may be set according to actual conditions. Illustratively, smaller sized trash such as plastic bottles, plastic bags, etc. may be classified as first size range trash, and larger sized trash such as fishing net, wood, etc. may be classified as second size range trash.

S21, performing image cutting processing on the original image to obtain a cut image containing the detection target and an environment image not containing the detection target.

Specifically, the original image is subjected to image cutting processing, and the obtained cut image containing the detection target and the environment image not containing the detection target may have a size of 640×640 pixels. The size of the cut image is consistent with the input size of the network model, the larger the input size of the network model is, the larger the parameter number to be calculated is, the slower the calculation speed is, and the undersize of the input size of the model can lead to insufficient recognition accuracy, so that the input size of the network model needs to be adjusted according to actual requirements. The input size of the network model is 640 multiplied by 640 pixels, so that the calculation complexity can be reduced as much as possible on the basis of ensuring the recognition accuracy, and the recognition effect of the network model is improved. Different detection targets can be marked through identification frames with different colors; the garbage can be further divided into plastic garbage, wood, fishing nets and the like, wherein the plastic garbage comprises plastic products such as plastic bottles, plastic bags, plastic shoe covers, plastic scraps and the like.

in some embodiments, since the detection target comprises tourists, the detection target can accurately identify the personnel, and thus the detection target can be applied to coastal search and rescue work and the like adaptively.

And S22, labeling the cut image, and labeling the detection target as a positive sample to obtain a detection target image data set.

For example, a labelimg labeling tool may be used to label the detection target in the cut image.

S23, dividing the detection target image data set into a training set, a verification set and a test set according to a preset proportion, and carrying out data enhancement processing on the cut images in the training set.

Illustratively, the preset ratio may be set as desired, e.g., 8:1:1 or 7:1.5:1.5. the preset proportion is 8:1:1, for example, the detection target data set is calculated according to 8:1:1 is divided into a training set, a verification set and a test set. In order to increase the number of pictures in the training set, the data enhancement processing is performed on the cut images in the training set, where the data enhancement processing may include: the cut image is subjected to a horizontal operation and a reverse operation.

and S3, training the YOLOv8n-IL network model by using the detection target image data set to obtain an optimal weight file and a trained YOLOv8n-IL network model.

Specifically, fig. 2 is a schematic diagram of a YOLOv8n network model provided by an embodiment of the present invention, fig. 3 is a schematic diagram of a YOLOv8n-IL network model provided by an embodiment of the present invention, fig. 4 is a schematic diagram of an incoiponnaxt module in a YOLOv8n-IL network model provided by an embodiment of the present invention, and fig. 5 is a schematic diagram of an LSK module in a YOLOv8n-IL network model provided by an embodiment of the present invention; referring to fig. 2-5, the YOLOv8n-IL network model is based on a YOLOv8n network model, the YOLOv8n network model comprises an input end, a backbone network, a negk module and a detection head, the YOLOv8n-IL network model replaces the last two C2f modules in the backbone network with an incoptionnegt module, and an LSK module is embedded in front of the detection head; the LSK module comprises a common convolution and an expansion convolution.

Specifically, training a YOLOv8n-IL network model through a training set; adjusting hyper-parameters of the YOLOv8n-IL network model through the verification set and evaluating performance of the model; the trained YOLOv8n-IL network model is tested by the test set to evaluate the generalization ability of the YOLOv8n-IL network model. And performing iterative optimization on the weight file through continuous training of the YOLOv8n-IL network model until the optimal weight file is obtained, wherein the training of the YOLOv8n-IL network model is completed.

Furthermore, after the YOLOv8n-IL network model is built on any equipment, the trained optimal weight file is imported into the YOLOv8n-IL network model to obtain the trained YOLOv8n-IL network model without retraining the model.

specifically, the optimal weight file includes a plurality of sets of weight parameters, and the feature information of each detection target corresponds to one set of weight parameters. Further, taking a weight parameter corresponding to the characteristic information of the garbage with the detection target being the first size range as a first weight parameter; taking a weight parameter corresponding to the characteristic information of the garbage with the detection target being the second size range as a second weight parameter; and taking the weight parameter corresponding to the characteristic information of which the detection target is the tourist as a third weight parameter.

The structure of the YOLOv8n network model is shown in fig. 2, and the input end mainly comprises mosaic data enhancement, adaptive anchor frame calculation and adaptive gray filling. The backbone network consists of a Conv module, a C2f module and an SPPF module; the Conv module is a convolution layer and is used for extracting different characteristics in the image, such as edges, corner points, textures and the like; the C2f module is a main module for learning residual characteristics, and through more branch cross-layer connection, the gradient flow of the model is enriched, and the characteristic representation capability is improved; the SPPF module is a spatial pyramid pooling that converts feature maps of different scales into feature vectors of a fixed size. The Neck module uses a PAN-FPN structure to perform feature fusion, so that the capability of the network for fusing the features of objects with different scaling scales can be enhanced, and the fusion and the utilization of information of feature layers with different scales are enhanced. The output end decouples the classifying and detecting processes, and mainly comprises loss calculation and target detection frame screening. The Loss calculation process mainly comprises a positive and negative sample distribution strategy and a Loss calculation, wherein the Loss calculation comprises classification and regression of 2 branches. The classification branch uses BCE Loss and the regression branch uses Distribution Focal Loss and CIOU (complete intersection over union) Loss functions.

Further, with continued reference to fig. 4, the last two C2f modules in the backbone network of the YOLOv8n network model are replaced with the conceptionnext module to enhance the model's ability to extract advanced features. The module not only expands receptive field and enhances the feature extraction capability by decomposing the large convolution kernel, but also solves the problem of the large convolution kernel blocking speed. If the receptive field is too large, the detail information of the target part is lost; conversely, the context information of the target may be reduced.

The conceptionNeXt module fuses ideas and methods of ConvNeXt and reception multiscale feature extraction, and decomposes 7X 7DWConv in ConvNeXt into four efficient parallel branches along the channel dimension, including a 3X 3 small square kernel, two orthogonal band kernels, respectively a 1X 11 kernel and a 11X 1 kernel, and an identity mapping.

For the input image X, it is first split into 4 parallel branches along the dimension of the channel:

split is a segmentation process. Dividing the input image X into four groups, X_hw、X_w、X_hand X_idEach group contains a portion of the channel information. g denotes dividing the input image X into four groups, the number of channels of each group; x is X_:,:grepresenting the first g channels of an input image X_:g:2gRepresenting the g-th to 2-th channels of the input image X. X is X_:2g:3gRepresenting the 2 g-th to 3 g-th channels of the input image X_:3g:Representing the 3 g-th to last channel of the input image X.

Illustratively, assume that the shape of the input image X is [1, 64, 224, 224], where 1 represents the size of the batch (referring to the number of samples used per update of the parameters when training the neural network), 64 represents the number of channels, the former 224 represents the height, and the latter 224 represents the width. From the description of the Split function, the input image X can be divided into four groups: x is X_hw、X_w、X_hand X_id. According to the channel group number g=4, then the channel number per group is 64/4=16. The specific grouping process is as follows: x is X_hw: the first 16 channels are removed from X, with the shape [1, 16, 224, 224]; x is X_w: the next 16 channels are taken out of X, shaped as [1, 16, 224, 224]; x is X_h: the next 16 channels are removed from X and are shaped as [1, 16, 224, 224]; x is X_id: the remaining 16 channels were removed from X and were shaped as [1, 16, 224, 224].

These 4 features are then passed through 4 different operators, respectively:

；

The DWConv refers to Depth-wise Convolition, namely Depth Convolution, and has the advantage of high calculation efficiency compared with the traditional Convolution operation. X'._hw、X'_w、X'_h、X'_idRespectively X_hw、X_w、X_h、X_idAnd (5) a result after parallel branch processing. k (k)_srepresenting the size of the small square kernel, default to 3, k_bIndicating the size of the orthogonal band kernel, default to 11. Finally, the outputs of each branch are spliced together:

Wherein X 'represents X'_hw、X'_w、X'_hAnd X'_idAnd splicing the four parts according to the channel dimension to obtain an output characteristic diagram.

Further, with continued reference to fig. 5, an LSK module is embedded in front of the detection head, and the LSK module can dynamically adjust the receptive field according to different scales of the targets, so as to more effectively process the background information difference required by the targets with different scales.

The structure of the LSK module is shown in fig. 5, the common convolution and expansion convolution have receptive fields with different sizes, and context information with different ranges can be obtained; the effect of large-kernel convolution is achieved by combining common convolution and expansion convolution, so that the network obtains an oversized receptive field when extracting the characteristics, and further obtains rich context information; the context information, that is, environmental information around the detection target, may assist in determining the category to which the detection target belongs. For example, a piece of clothing is identified, if the clothing's context information is coastal, the clothing is trash, and if the clothing's context information is guest, the clothing is guest's wear, not trash. A module selection network is designed by using common two-dimensional convolution and Sigmoid function, so that the network can dynamically select convolutions with different receptive fields according to different detection targets.

to obtain a network of larger receptive fields, a series of continuously expanding Depth-wise convolutions with large convolution kernels are obtained by decoupling. Specifically, assuming that the size of the ith Depth-wise convolution kernel in the sequence is k, the expansion rate is d, and the receptive field is RF, they satisfy the following relationship:

；

;

Where i represents the ith convolution kernel, k represents the size of the convolution kernel, d represents the expansion rate of the convolution kernel, and RF represents the receptive field of the convolution kernel.

In order to obtain rich background information features from different areas of the input data, a series of decoupled Depth-wise convolution kernels without receptive fields are employed, wherein the expression of the features is as follows:

；

Wherein U is₀The symbol X "represents the input symbol, here the fused symbol image processed by the negk module, where i represents the i+1th deep convolution kernel, U_irepresenting the features obtained after the i+1th convolution kernel. Since there are two DWConv in total in this embodiment, there are three features U in total₀,U₁,U₂。() Representing the i+1th DWConv operation, i representing the i+1th DWConv, and a total of two DWConv, dw, represent the DWConv convolution operation on the picture.

Assuming M decoupled convolution kernels, each convolution operation is followed by a1×1 convolution layerAnd (3) carrying out channel fusion of the space feature vectors:

；

Wherein,a convolution layer that performs a convolution operation of 1x1 size on a picture is represented. /(I)representing features resulting from different convolution kernels. The size of the extracted features varies because the receptive fields vary from convolution kernel to convolution kernel.

In order to make the model pay more attention to important background information of a detection target in space, a space selection mechanism is used for carrying out space selection on the feature map from large convolution kernels with different scales. Performing concatemer stitching on features from different receptive field convolution kernels:

；

Wherein,The representation fuses features with different receptive fields to obtain more rich context information and multi-scale feature representations. Such fusion can help the network better understand the detection targets in the image and improve the detection and recognition capabilities of the detection targets. By connecting the features of different receptive fields, the network can pay attention to local details and global contexts at the same time, so that the understanding and judging capability of a detection target is improved.

Application channel level averaging poolingAnd maximum pooling/>Extracting spatial relationship:

；

Wherein,And/>is the spatial feature descriptor after mean pooling and maximum pooling. To achieve information interaction of different spatial descriptors, convolutional layers/>, are utilizedSplicing the spatial pooling features, and converting the pooling features of the 2 channels into N spatial attention feature graphs, wherein N represents the number of decoupled convolution kernels:

；

Wherein 2→n is a mathematical expression where F represents a certain function operation, 2 represents the number of parameters input, and N represents the result of output. In this case the number of the elements to be formed is,The sign is converted into N spatial attention profiles, N being the same as the number M of DWconv, using two parameters (SAavg and SAmax) as inputs.

Thereafter, a Sigmoid activation function is applied to each spatial attention profileIndependent spatial selection masks corresponding to each decoupled large convolution kernel can be obtained:

；

Wherein,representing each spatial attention profile,/>representing an independent spatially selective mask for each of the decoupled large convolution kernels.

then, weighting the characteristics of the large convolution kernel sequence after decoupling and the corresponding space selection mask, and passing through a convolution layerFusion to obtain attention features/>：

；

The final output can be obtained from the element-wise point-wise composition of the input feature x″ and the attention feature S, namely:

；

the final output of the LSK module is the element-wise product of the input feature x″ and the attention feature S, which can be used to strengthen the correlation between features, highlighting important feature information. Where Y represents the final output feature image of the LSK module.

Further, before S4, the acquired images are input into an SAHI module for image cutting, and a plurality of images to be detected are obtained.

Specifically, the SAHI module segments the acquired image into a plurality of 640 x 640 pixels images to be detected; in this case, there may be a region where the cutting is repeated when cutting the image to be measured, and thus the same detection target may exist in a plurality of images to be measured.

S4, inputting the image to be detected into a trained YOLOv8n-IL network model to identify coastline garbage.

S41, inputting the image to be tested into the input end of the trained YOLOv8n-IL network model.

s42, extracting features of each detection target in the image to be detected through the backbone network.

And S43, carrying out feature fusion on the features of each detection target through a Neck module to obtain a fused feature image.

specifically, feature fusion is performed on the features of each detection target through the Neck, so that feature information of each detection target can be richer.

s44, the LSK module respectively performs feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, and selects corresponding feature extraction results to combine with the fused feature image to obtain a final output feature image of the fused feature image.

Specifically, the LSK module performs feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, processes feature extraction results through a spatial attention mechanism, fusion operation and the like, is used for emphasizing the spatial position of each detection target and the spatial size of each detection target, and then selects corresponding feature extraction results to combine the fused feature image to obtain a final output feature image of the fused feature image.

And S441, carrying out feature extraction on all detection targets in the fused feature images by the LSK module through common convolution to obtain first feature images of all detection targets.

S442, the LSK module performs feature extraction on all detection targets in the fused feature images through expansion convolution to obtain second feature images of all detection targets.

S443, the LSK module selects the first characteristic image or the second characteristic image as the attention characteristic of the detection target according to the characteristic of the detection target and the optimal weight file.

Specifically, the LAK module performs weighted calculation on the first feature image, the second feature image and the first weight parameter, the second weight parameter and the third weight parameter in the optimal weight file respectively to obtain feature information weights of all detection targets in the first feature image and feature information weights of all detection targets in the second feature image.

and S4431, the LSK module performs weighted calculation on the first characteristic image and the optimal weight file, and obtains the characteristic information weights of all the detection targets in the first characteristic image according to the characteristics of the detection targets.

And S4432, the LSK module performs weighted calculation on the second characteristic image and the optimal weight file, and obtains the characteristic information weights of all the detection targets in the second characteristic image according to the characteristics of the detection targets.

s4433, determining a detection target with the feature information weight ratio larger than a preset value in the first feature image, and selecting the first feature image as the attention feature of the detection target; wherein the detection target comprises refuse of a first size range.

Specifically, since the size of the garbage in the first size range is smaller, the detection target can be accurately extracted by common convolution and the context information can be extracted, so that the first characteristic image is used as the attention characteristic of the garbage in the first size range.

Specifically, since the size of the garbage and the tourists in the second size range is large, the detection target can be accurately extracted by expansion convolution and the extraction of the context information is required, so that the second characteristic image is used as the attention characteristic of the garbage and the tourists in the second size range.

Further, the trained YOLOv8n-IL network model scores and outputs confidence coefficient of the identification frame of each detection target on each image to be detected according to the identification result of each image to be detected.

Specifically, the trained YOLOv8n-IL network model compares the pixels of the detection target in the identification frame with the pixels of the detection target with the real label as the identification result according to the identification result of the identification frame, and performs confidence scoring according to the difference.

Illustratively, when the recognition result of the recognition frame of the detection target is garbage, the trained YOLOv8n-IL network model compares the pixels of the detection target in the recognition frame with the pixels of the detection target marked garbage, and performs confidence scoring according to the difference.

S51, the SAHI module performs image stitching on the identification result of each image to be detected to obtain a stitched identification result graph.

Specifically, because the images to be detected may have overlapping areas, that is, the same detection target appears in a plurality of images to be detected, the SAHI module performs image stitching on the recognition result of each image to be detected, and a plurality of recognition frames may be formed on the same detection target in the stitched recognition result diagram.

Specifically, the SAHI module performs Non-Max Suppression (NMS) post-processing on the spliced recognition result graphs, only keeps the recognition frame with the highest reliability score in a plurality of recognition frames on the same detection target, and removes other recognition frames on the detection target, so as to obtain a final coastline garbage recognition result.

In the embodiment of the invention, the improved network model based on the YOLOv8n network model is provided, the last two C2f modules in the backbone network are replaced by the InceptionNeXt module, the LSK module is embedded in front of the detection head, the YOLOv8n-IL network model is built, meanwhile, tourists and garbage are taken as detection targets, the receptive field is dynamically adjusted according to different detection targets through the LSK module, the background information difference required by different-scale targets is more effectively processed, the detection precision of the network model on the different-scale targets is improved, the influence of the tourists on garbage identification can be reduced by taking the tourists as the detection targets, and the conditions of missing detection and false detection are reduced.

Fig. 6 is a schematic structural diagram of a shoreline garbage recognition system based on deep learning according to the embodiment of the present invention, where the shoreline garbage recognition system based on deep learning is used to execute the shoreline garbage recognition method based on deep learning according to any one of the foregoing embodiments, and as shown in fig. 6, the system includes:

And the image acquisition module is used for acquiring an original image of the coastline.

Specifically, the image acquisition module may include an unmanned aerial vehicle, and is configured to take a photograph of the coastline and sample the same, so as to obtain an original image of the coastline. The original image comprises garbage, tourists and environment.

the image processing module is connected with the image acquisition module and used for preprocessing an original image and constructing a detection target data set.

Specifically, the detection targets comprise garbage and tourists, and the garbage comprises garbage in a first size range and garbage in a second size range; wherein the lower limit of the second size range is greater than the upper limit of the first size range; different detection targets are marked through different color identification frames.

The network model building module is connected with the image processing module and is used for training the YOLOv8n-IL network model by utilizing the detection target image data set to obtain an optimal weight file and a trained YOLOv8n-IL network model.

Specifically, the YOLOv8n-IL network model is based on the YOLOv8n network model, the YOLOv8n network model comprises an input end, a backbone network, a lock module and a detection head, the YOLOv8n-IL network model replaces the last two C2f modules in the backbone network with an incoptionnew module, and an LSK module is embedded in front of the detection head. The LSK module comprises a common convolution and an expansion convolution, wherein the common convolution and the expansion convolution have receptive fields with different sizes and are used for acquiring context information with different ranges. The optimal weight file comprises a plurality of groups of weight parameters, and the characteristic information of each detection target corresponds to one group of weight parameters.

The image recognition module is connected with the network model building module and is used for inputting an image to be detected into the trained YOLOv8n-IL network model to recognize coastline garbage.

The method specifically comprises the following steps: inputting an image to be tested into an input end of a trained YOLOv8n-IL network model; extracting features of each detection target in the image to be detected through a backbone network; then, feature fusion is carried out on the features of each detection target through a Neck module, and a fused feature image is obtained; the LSK module respectively performs feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, and selects corresponding feature extraction results to combine with the fused feature image to obtain a final output feature image of the fused feature image; and the detection head identifies each detection target according to the final output characteristic image to obtain the identification result of the image to be detected.

In the embodiment of the invention, the last two C2f modules in the backbone network are replaced by the InceptionNeXt modules based on the YOLOv8n network model by the network model building module, and the LSK modules are embedded in front of the detection heads to build the YOLOv8n-IL network model, so that the receptive field of the network model is enlarged, the feature extraction capability is enhanced, and the receptive field can be dynamically adjusted by the LSK modules according to different scales of the targets, so that the background information difference required by the targets with different scales is more effectively processed, and the detection precision of the network model on the targets with different scales is improved; the image processing module is used for making a training set by taking the tourists and the garbage as detection targets, and the influence of the tourists on garbage identification can be reduced by taking the tourists as the detection targets, so that the occurrence of missed detection and false detection is reduced.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. As used in this specification, the terms "a," "an," "the," and/or "the" are not intended to be limiting, but rather are to be construed as covering the singular and the plural, unless the context clearly dictates otherwise. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method or apparatus comprising such elements.

It should also be noted that the positional or positional relationship indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Unless specifically stated or limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present invention.

Claims

1. the coastline garbage identification method based on deep learning is characterized by comprising the following steps of:

S2, preprocessing the original image to construct a detection target image data set; the detection target includes: garbage and guests, the garbage comprising a first size range garbage and a second size range garbage; wherein the lower limit of the second size range is greater than the upper limit of the first size range;

S3, training the YOLOv8n-IL network model by utilizing the detection target image data set to obtain an optimal weight file and a trained YOLOv8n-IL network model; the Yolov8n-IL network model is based on a Yolov8n network model, the Yolov8n network model comprises an input end, a backbone network, a Neck module and a detection head, the Yolov8n-IL network model replaces the last two C2f modules in the backbone network with an acceptance NeXt module, and an LSK module is embedded in front of the detection head; the LSK module comprises a common convolution and an expansion convolution, wherein the common convolution and the expansion convolution have receptive fields with different sizes and are used for acquiring context information in different ranges; the optimal weight file comprises a plurality of groups of weight parameters, and the characteristic information of each detection target corresponds to one group of weight parameters;

S44, the LSK module respectively performs feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, and selects corresponding feature extraction results to combine with the fused feature image to obtain a final output feature image of the fused feature image;

2. The method for identifying coastline garbage based on deep learning according to claim 1, wherein in S3, the optimal weight file includes a plurality of sets of weight parameters, and the feature information of each detection target corresponds to one set of weight parameters including:

3. The coastline garbage recognition method based on deep learning according to claim 2, wherein the step S44 of the LSK module performing feature extraction on all detection targets in the fused feature image through common convolution and expansion convolution, respectively, and selecting a corresponding feature extraction result to combine with the fused feature image to obtain a final output feature image of the fused feature image comprises:

s441, the LSK module performs feature extraction on all detection targets in the fused feature images through common convolution to obtain first feature images of all detection targets;

s442, the LSK module performs feature extraction on all detection targets in the fused feature images through expansion convolution to obtain second feature images of all detection targets;

S443, the LSK module selects a first characteristic image or a second characteristic image as the attention characteristic of the detection target according to the characteristic of the detection target and the optimal weight file;

4. The deep learning based coastline garbage recognition method of claim 3, wherein the step S443 of the LSK module selecting the first feature image or the second feature image as the attention feature of the detection target according to the feature of the detection target and the optimal weight file comprises:

S4431, the LSK module carries out weighted calculation on the first characteristic image and the optimal weight file, and obtains characteristic information weights of all detection targets in the first characteristic image according to the characteristics of the detection targets;

s4432, the LSK module carries out weighted calculation on the second characteristic image and the optimal weight file, and obtains characteristic information weights of all detection targets in the second characteristic image according to the characteristics of the detection targets;

5. The method for identifying coastline rubbish based on deep learning as claimed in claim 1, wherein the step S2 of preprocessing the original image to construct a detection target image data set comprises:

S21, performing image cutting processing on the original image to obtain a cut image containing a detection target and an environment image not containing the detection target;

S23, dividing a detection target image data set into a training set, a verification set and a test set according to a preset proportion, and carrying out data enhancement processing on cut images in the training set;

6. the method for identifying coastline garbage based on deep learning according to claim 1, wherein the step S4, before inputting the image to be detected into the trained YOLOv8n-IL network model for identifying coastline garbage, further comprises:

7. The method for identifying coastline garbage based on deep learning according to claim 6, wherein S4, after inputting the image to be detected into the trained YOLOv8n-IL network model for identifying coastline garbage, further comprises:

8. The deep learning based coastline garbage recognition method of claim 7, further comprising, after S45:

and the trained YOLOv8n-IL network model scores and outputs the confidence coefficient of the identification frame of each detection target on each image to be detected according to the identification result of each image to be detected.

9. The method for identifying coastline rubbish based on deep learning according to claim 8, wherein the step S5 of the SAHI module image-stitching the identification result of each image to be detected, the step of obtaining the coastline rubbish identification result includes:

S52, the SAHI module executes non-maximum suppression post-processing on the spliced recognition result graphs, only keeps the recognition frames with the largest confidence scores in the recognition frames on the same detection target, and removes other recognition frames on the detection target to obtain a coastline garbage recognition result.

10. A deep learning based coastline waste identification system for performing the deep learning based coastline waste identification method of any one of the preceding claims 1-9, characterized in that the system comprises:

The image processing module is connected with the image acquisition module and used for preprocessing the original image and constructing a detection target image data set; the detection target includes: garbage and guests, the garbage comprising a first size range garbage and a second size range garbage; wherein the lower limit of the second size range is greater than the upper limit of the first size range;

The network model building module is connected with the image processing module and is used for training the YOLOv8n-IL network model by utilizing the detection target image data set to obtain an optimal weight file and a trained YOLOv8n-IL network model; the Yolov8n-IL network model is based on a Yolov8n network model, the Yolov8n network model comprises an input end, a backbone network, a Neck module and a detection head, the Yolov8n-IL network model replaces the last two C2f modules in the backbone network with an acceptance NeXt module, and an LSK module is embedded in front of the detection head; the LSK module comprises a common convolution and an expansion convolution, wherein the common convolution and the expansion convolution have receptive fields with different sizes and are used for acquiring context information in different ranges; the optimal weight file comprises a plurality of groups of weight parameters, and the characteristic information of each detection target corresponds to one group of weight parameters;