CN116977880A

CN116977880A - Grassland rat hole detection method based on unmanned aerial vehicle image

Info

Publication number: CN116977880A
Application number: CN202311087548.XA
Authority: CN
Inventors: 罗小玲; 李朝; 郜晓晶; 白戈力; 李慧旻
Original assignee: Inner Mongolia Agricultural University
Current assignee: Inner Mongolia Agricultural University
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-10-31

Abstract

The application discloses a grassland rat hole detection method based on an unmanned aerial vehicle image, which comprises the following steps of: obtaining a grassland rat hole image; based on a context enhancement module, a Tcode decoupling head and a full-dimensional dynamic convolution, improving an original YOLOv5n model to obtain an improved CTO-YOLOv5n model; and processing the grassland rat hole image based on the CTO-YOLOv5n model to obtain the position information of the rat hole. The detection method provided by the application can accurately identify the grassland rat holes and the rat holes shielded by weeds, and can distinguish shadows of the rat holes and stones, so that the accurate identification of the rat holes is realized.

Description

Grassland rat hole detection method based on unmanned aerial vehicle image

Technical Field

The application belongs to the technical field of target detection, and particularly relates to a grassland rat hole detection method based on an unmanned aerial vehicle image.

Background

In order to replace manual detection of rat holes, a plurality of digital detection methods for rat holes are presented. Cui Bo et al propose a method for identifying and positioning rat holes in a rat-damage area by combining machine vision with unmanned aerial vehicle remote sensing images. The method uses a YOLOv3-tiny network to identify rat holes. The research effectively monitors the distribution condition of rat holes of the rat, makes up the defect of the traditional method in monitoring the rat damage, and improves the real-time performance and flexibility of monitoring the rat damage of the rat.

Sun Di et al use a small unmanned aerial vehicle as a low-altitude remote sensing platform to acquire visible light aerial images of a research area. And extracting rat holes by adopting a maximum likelihood method and an object-oriented classification method, and carrying out precision evaluation on classification results by combining a confusion matrix with a ground sample party. The result shows that the UAV low-altitude remote sensing is good in the disaster investigation of the yellow rabbit tail mice, and has good application and popularization values.

Zhou Xiaolin et al studied and established a grassland rat hole automatic recognition method of an object-oriented template matching method and a support vector machine method by using a visible light wave band unmanned aerial vehicle image. The accuracy evaluation and analysis are carried out on the recognition results of the 2 methods, and the results show that the overall accuracy of the 2 methods is higher, and the method is suitable for the accurate recognition of the grassland mousehole in the Sanjiang source region.

Wen Amin et al monitor the density of the rat holes in the ancient Golgi desert at the south of the ancient deserts by adopting unmanned aerial vehicle remote sensing technology, explore the best interpretation method of the aerial images of the rat holes in the large sand, and provide a solution for rapidly interpreting the data of the rat damage in the low-altitude remote sensing aerial photography.

However, most of the existing digital detection methods are machine learning and deep learning models with larger models, the models are large in size, are unfavorable to be deployed on unmanned aerial vehicle equipment, have longer detection time, and are unfavorable for real-time detection. The rat hole detection effect on the background of the complex weed land is poor, the shadow of the stone can be mistakenly identified as the rat hole, and the rat hole shielded by the weed cannot be identified. Therefore, it is highly desirable to provide a grassland rat hole detection method based on unmanned aerial vehicle images.

Disclosure of Invention

The application aims to provide a grassland rat hole detection method based on unmanned aerial vehicle images, which can accurately identify grassland rat holes and rat holes shielded by weeds, can distinguish shadows of the rat holes and stones and realizes accurate identification. In addition, the detection model designed by the application has small volume and is suitable for being deployed on unmanned aerial vehicle equipment so as to solve the problems in the prior art.

In order to achieve the above purpose, the application provides a grassland rat hole detection method based on unmanned aerial vehicle images, which comprises the following steps:

obtaining a grassland rat hole image;

based on a context enhancement module, a Tcode decoupling head and a full-dimensional dynamic convolution, improving an original YOLOv5n model to obtain an improved CTO-YOLOv5n model;

and processing the grassland rat hole image based on the CTO-YOLOv5n model to obtain the position information of the rat hole.

Optionally, the process of acquiring the grassland rat hole image comprises: and shooting the mousehole on the grassland based on the unmanned aerial vehicle to obtain a grassland mousehole image.

Optionally, the process of improving the YOLOv5n model based on the context enhancement module includes: in the feature enhancement layer of the original YOLOv5n model, a context enhancement module is introduced.

Optionally, the process of improving the YOLOv5n model based on the Tscode decoupling header includes: the decoupling header in the original YOLOv5n model is replaced with a Tscode decoupling header.

Optionally, the process of improving the YOLOv5n model based on the full-dimensional dynamic convolution includes: the conventional convolution of the C3 block in the original YOLOv5n model is replaced with a full-dimensional dynamic convolution.

Optionally, the processing the grassland rat hole image based on the CTO-YOLOv5n model includes: inputting the grassland rat hole image into the CTO-YOLOv5n model, extracting features, and then carrying out reinforcement fusion on the extracted features to obtain a plurality of feature images; and carrying out direct prediction and fusion prediction on the plurality of feature images to obtain the rat hole positions in the grassland rat hole images.

Optionally, the process of extracting features of the grassland rat hole image comprises the following steps: and extracting features of the grassland mousehole image based on the full-dimensional dynamic convolution, and dynamically adjusting different features.

Optionally, the convolution kernel dimensions of the full-dimensional dynamic convolution include: the spatial dimension of the convolution kernel, the input channel dimension, the output channel dimension, and the kernel dimension of the convolution kernel.

Optionally, the process of directly predicting and fusing the predictions for the plurality of feature maps includes: and respectively predicting a plurality of feature graphs based on the TScode decoupling heads, and then carrying out fusion prediction on the feature graphs obtained by fusing the feature graphs.

Optionally, the process of obtaining the fused feature map includes: and distributing a leachable weight parameter for each feature map with different scales based on the context enhancement module, respectively solving products, and carrying out summation processing on a plurality of obtained products to obtain a fused feature map.

The application has the technical effects that:

the application introduces a context enhancement module into the feature enhancement layer to enhance feature extraction, thereby improving the understanding capability of the model to different sizes of mouse holes. Meanwhile, a TScode decoupling head is used for replacing a coupling head in the original YOLOv5n model so as to solve the mutual influence of classification and positioning information. And finally, replacing the traditional convolution in the C3 module in the original YOLOv5n model by using the full-dimensional dynamic convolution, and improving the adaptability and the expression capability of the model to different characteristics of the image.

The detection method provided by the application can accurately identify the grassland rat holes and the rat holes shielded by weeds, and can distinguish shadows of the rat holes and stones, so that the accurate identification of the rat holes is realized.

The improved YOLOv5n model is small in size and suitable for being deployed on unmanned aerial vehicle equipment.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a diagram of an original Yolov5n network structure according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a full-dimensional dynamic convolution structure in an embodiment of the present application;

FIG. 3 is a schematic diagram of an improved CTO-YOLOv5n model network structure in an embodiment of the application;

FIG. 4 is a schematic structural diagram of each module in the improved CTO-YOLOv5n model according to an embodiment of the present application;

FIG. 5 is a comparison chart of easily confused difficulty detection results under a non-complex background in an embodiment of the present application; wherein, (a) and (c) are detection results of YOLOv5n in a non-complex background, and (b) and (d) are detection results of CTO-YOLOv5n in a non-complex background;

FIG. 6 is a comparison chart of the detection results of difficulties in a complex background in the embodiment of the present application; wherein, (a) and (c) are detection results of YOLOv5n in a complex background, and (b) and (d) are detection results of CTO-YOLOv5n in a complex background.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Example 1

In order to realize accurate recognition of grassland rat holes and applicability of a real environment, the size, accuracy and real-time detection of a model must be considered. Therefore, the embodiment selects the YOLOv5n with smaller weight file and faster reasoning speed as the basic network. The method has the main characteristics that the reasoning speed of the algorithm is obviously improved on the premise of ensuring the detection accuracy, and the method is more suitable for being deployed and applied in scenes with limited resources such as edge equipment, mobile equipment and the like.

The conventional YOLOv5n employs a single-stage detection method, which divides an input image into a plurality of grids, each of which predicts probabilities of a certain number of bounding boxes and objects. YOLOv5n adopts CSPDarknet53 as backbone network, introduces FPN and PAN and other structures to perform multi-scale feature fusion, and adopts SiLU activation function and SPPF technology. YOLOv5n mainly comprises three parts, backBone, neck and Head, and its network structure is shown in fig. 1.

(1) CSP structure of backbone feature extraction network:

the backbox part is to extract the characteristics of the input picture, the backbone characteristic extraction network of YOLOv5n adopts a CSP structure, the outer layer of the residual structure is connected with a Skip Connection channel as a figure, the structure forms a C3 module in the backbone characteristic extraction network, and the backbox of YOLOv5 is the stack of the C3 modules.

(2) Path Aggregation Network (PAN) of feature enhancement layer:

PAN (Path Aggregation Network) is mainly characterized in that the expression capacity of the characteristics and the accuracy of the model are improved through path aggregation. The PAN architecture mainly comprises two modules: a Feature Pyramid Network (FPN) and a feature fusion module (PANet).

The FPN is primarily responsible for constructing a feature pyramid with multi-scale features so that the model can detect objects of different scales. The FPN is composed of a plurality of levels, each consisting of a feature extraction module and an upsampling module. The feature extraction module is used for extracting features from the input image, and the up-sampling module is used for up-sampling the low-level feature map to the same size as the high-level features so as to perform feature fusion. FPN in YOLOv5 extracts 5, 7 and 10 layers of feature graphs in a main feature network respectively, and carries out up-sampling processing after SPPF layer output passes through a CBS module, then carries out feature fusion with 7 layers of output, and carries out fusion of 16 layers and 5 layers.

The bottom layer of the FPN structure is used as the uppermost layer of the new feature pyramid, downsampling operation and feature fusion are carried out again, and features in different branches are fused, so that the expression capability of the features is improved. The outputs of the 19, 22 layers are combined with the outputs of the 15, 11 layers in the FPN, respectively.

The PAN architecture is an efficient and accurate object detection network architecture that achieves good performance in many object detection tasks. The PAN structure is characterized by good expandability and mobility, and can be easily applied to different tasks and data sets.

In order to adapt to the recognition condition of the grassland rat holes, the embodiment improves the original YOLOv5n model, and proposes a grassland rat hole detection model CTO-YOLOv5n.

Improvement of YOLOv5n based on context enhancement module (CAM):

a context enhancement module (CAM) is a module for enhancing the perception of context information by a neural network, and the CAM is used for integrating the context information based on hole convolution and adaptive fusion. In neural networks, the receptive field of a single convolutional layer can only cover pixels of a local area, and thus cannot fully utilize the context information in the image.

The hole convolution (dilated convolution), also known as a dilation convolution or a dilation convolution, is a convolution operation that can increase the sensitivity of the convolution kernel.

A typical convolution operation is to slide on an input signature by a certain step size and convolution kernel size, and perform the convolution operation at each position. In the hole convolution, a certain number of zero value elements or equally-spaced samples are inserted into a convolution kernel to form certain intervals (i.e. holes), and then the convolution operation is performed on the input feature diagram in the same manner.

By increasing the number of holes of the convolution kernel, the hole convolution can increase the receptive field size of the convolution kernel without increasing the number of convolution parameters. Thus, compared with the common convolution, the cavity convolution can capture wider context information, so that the expression capability of the features is improved.

In CAM, the mode of cavity convolution increases the receptive field of convolution kernel and enhances the perceptibility of network to context information. Meanwhile, the number of convolution parameters and the calculated amount are not increased, so that no extra calculation burden and overfitting risk are caused.

The CAM fuses the feature images of different scales into a global feature image through self-adaptive fusion so as to acquire more abundant context information. Adaptive fusion refers to dynamically adjusting contributions between feature maps of different scales obtained by hole convolution by learning a set of weight parameters. Feature maps of different scales typically contain different levels of semantic information, so fusing them together can better capture the features and context information of the target.

In adaptive fusion, each feature map is multiplied by a learnable weight parameter, which is then added to obtain a fused feature map. The learning of these weight parameters can be performed by a back propagation algorithm, so that the model can adaptively adjust the contribution between different scale feature maps, thereby better adapting to different tasks and scenes.

Tscode based decoupling header improvement versus YOLOv5n improvement:

in the object detection task, classification and localization are two key tasks, and in YOLOv5n, classification and localization are performed on the same feature map. However, due to the different nature of the classification and localization tasks, they have different demands on the position and spatial information on the feature map, and thus may present problems of spatial misalignment.

The classification task is more concerned with whether the region on the feature map contains objects, regarding the texture information of the image, and to which class it should belong. The positioning task focuses more on the accurate position of the object on the feature map and focuses more on the edge information of the image so as to accurately correct parameters of the boundary frame. If the classification and positioning tasks share the same feature map, there may be situations where predictions of the two tasks at different locations are inconsistent, resulting in positioning errors and degraded performance.

In the detection head in YOLOv5n, classification and positioning tasks are carried out on the same feature layer, namely, 18 convolution kernels with the size of 1×1 and the step size of 1 are used for convolution operation, wherein the convolution kernels are filled with 1. The predicted output 18 is 3× (4+1+1), 4 is the boundary offset, the first 1 is the confidence, i.e., the probability of containing the object within the box, and the second 1 is the classification, determining the probability of being a mousehole. When the back propagation is performed, classification and positioning are combined in a feature layer to perform back propagation, parameters are updated, the attention points of the two subtasks are different, and the two subtasks are mutually coupled, so that space alignment is caused, and the performance is reduced.

The traditional decoupling head can respectively construct two branches for each output layer of YOLOv5n to calculate, respectively extract the target position and information of each layer, respectively perform learning training through the two constructed branches, and finally splice the two branches together. If the feature map of 80×80×64 output through the trunk feature extraction network and the feature enhancement layer will respectively perform two 3×3 convolutions, two feature maps are output, and the first feature map performs category learning prediction. The second feature map is divided into two paths, one is used for carrying out positioning learning prediction and the other is used for carrying out confidence learning prediction on whether the target is contained.

The Tscode decoupling head learns the classification task through feature layers with high semantic information, while the localization task needs to provide a high resolution feature map containing more edge information to better return to object boundaries.

YOLOv5n was improved based on full-dimensional dynamic convolution ODconv:

a given convolution kernel in the dynamic convolution has four dimensions, namely the spatial dimension of the convolution kernel, the input channel dimension, the output channel dimension, and the kernel dimension of the convolution kernel. In conventional dynamic convolution, although the convolution kernel weights may be dynamically adjusted, only a single attention scalar pi is assigned _i (the kernel dimension of the convolution kernels), which means that for each convolution kernel, all its output channels use the same weight assignment. That is, conventional dynamic convolution ignores differences in the spatial dimension, input channel dimension, and output channel dimension of the convolution kernel, and thus lacks flexibility in the assignment of weights to different channels and different convolution kernels, and cannot be dynamically optimized for specific tasks and data. Whereas ODConv has, relative to conventional dynamic convolutionMore comprehensive dynamics, the traditional dynamic convolution only considers the dynamics of the kernel dimension, while ODConv considers the dynamics of the space, the input channel and the output channel dimensions, and as shown in fig. 2, different input features can be dynamically adjusted more comprehensively. The flexibility of ODConv is higher, each convolution kernel in the ODConv can have own attention distribution, the attention distribution can be dynamically adjusted according to different input samples, and the flexibility can improve the adaptability and generalization capability of the network. In fig. 2, the s-parameter assigns different attention scalar values to the convolution parameters (each convolution kernel) at k x k spatial locations; the c parameter gives different attention scalar to the input channel of each convolution kernel; the o parameter allocates different attention scalar for the output channel of the convolution kernel; the pi parameter assigns a focus scalar to the entire convolution kernel. These four types of attention are then multiplied to the process of k convolution kernels. The convolution operation is different in all spatial positions, all input channels, all output channels and all kernels of the convolution, so that the feature extraction capability of the model on different input pictures is greatly improved.

CTO-YOLOv5n model

By integrating the three points, the improved YOLOv5n is shown in fig. 3 and 4, a context enhancement module (CAM) is introduced, the feature fusion is enhanced by integrating the context information and the self-adaptive fusion method, the understanding capability of the model to the rat holes in the image is improved, and the traditional convolution of the C3 module in the YOLOv5n model is changed into ODconv full-dimensional dynamic convolution, so that the convolution kernel can be dynamically adjusted in full dimensions when the model detects the rat holes with different sizes and shapes, and the adaptability of the model to the rat holes is improved. The TScode decoupling head is used for replacing the coupling head in the original model, so that different information can be acquired from the feature map by the detected subtasks, and interaction of classification and regression tasks is reduced.

As shown in table 1, the modified YOLOv5n model has the following characteristics:

TABLE 1

In view of the uncomplicated background, there may also be some interference factors affecting the detection, as shown in fig. 5, the left column is the original model test result, and the right column is the test result of the improved model. Since the rat hole target is characterized in that a black part exists in the middle, the original model can misdetect the shade of cola and the shade of stone as the rat hole, and the improved model does not misdetect the shade of cola and the shade of stone as the rat hole. There are also irregularly shaped rat holes that are not detected by the original model, and which are detected by the modified model.

For complex background, i.e. various things on the grasslands such as stones, faeces of cattle and sheep horses, and weed clusters, etc., the detection interference to the rat holes is serious, as shown in the diagram of fig. 6, the comparison of (a) and (b) is that the faeces of animals and stones are detected as rat holes by mistake, the comparison of (c) and (d) is that the grass shields the rat holes seriously, and the two shielded rat holes cannot be detected.

And A Mode (CAM) for integrating context information and self-adaptive fusion is introduced into the feature enhancement layer to enhance feature extraction, so that the understanding capability of the model to different mouse holes is improved. Meanwhile, a context decoupling header (TScode) oriented to a specific task is used to replace a coupling header of YOLOv5n for solving the mutual influence of classification and positioning information. And finally, replacing a C3 module in the YOLOv5n with an OD_C3 module, and replacing the traditional convolution by using full-dimensional dynamic convolution (ODconv), so that the adaptability and the expression capability of the model to different characteristics of the image are improved.

When the improved YOLOv5n model is used for identifying the grassland rat holes, the unmanned aerial vehicle is used for shooting the rat holes on the grassland, the shot pictures are input into the improved YOLOv5n model, and 5 effective feature layers are stored through a trunk feature extraction network, wherein the sizes of the effective feature layers are (160×160×32), (80×80×64), (40×40×128), (20×20×256) and (10×10×256) respectively. The extracted features (80×80×64), (40×40×128) and (20×20×256) are transmitted to a feature enhancement layer, the features are enhanced and fused, 5 feature graphs are finally obtained, the sizes of the feature graphs are (160×160×32), (80×80×64), (40×40×128), (20×20×256) and (10×10×256), the feature graphs are input to a TScode decoupling head for prediction, and the TScode decoupling head is combined with adjacent feature graphs for prediction after fusion, so that the accurate positioning of the rat hole is finally obtained. Thus, the classification and positioning tasks can be more accurate.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. The grassland mousehole detection method based on the unmanned aerial vehicle image is characterized by comprising the following steps of:

obtaining a grassland rat hole image;

2. The method for detecting a meadow mousehole based on the unmanned aerial vehicle image according to claim 1, wherein,

the process of obtaining the grassland rat hole image comprises the following steps: and shooting the mousehole on the grassland based on the unmanned aerial vehicle to obtain a grassland mousehole image.

3. The method for detecting a meadow mousehole based on the unmanned aerial vehicle image according to claim 1, wherein,

the process of improving the YOLOv5n model based on the context enhancement module comprises the following steps: in the feature enhancement layer of the original YOLOv5n model, a context enhancement module is introduced.

4. The method for detecting a meadow mousehole based on the unmanned aerial vehicle image according to claim 1, wherein,

the process for improving the YOLOv5n model based on the Tscode decoupling head comprises the following steps: the decoupling header in the original YOLOv5n model is replaced with a Tscode decoupling header.

5. The method for detecting a meadow mousehole based on the unmanned aerial vehicle image according to claim 1, wherein,

the process of improving the YOLOv5n model based on full-dimensional dynamic convolution includes: the conventional convolution of the C3 block in the original YOLOv5n model is replaced with a full-dimensional dynamic convolution.

6. The method for detecting a meadow mousehole based on the unmanned aerial vehicle image according to claim 1, wherein,

the processing of the grassland rat hole image based on the CTO-YOLOv5n model comprises the following steps: inputting the grassland rat hole image into the CTO-YOLOv5n model, extracting features, and then carrying out reinforcement fusion on the extracted features to obtain a plurality of feature images; and carrying out direct prediction and fusion prediction on the plurality of feature images to obtain the rat hole positions in the grassland rat hole images.

7. The method for detecting a meadow mousehole based on an unmanned aerial vehicle image according to claim 6, wherein,

the process for extracting the characteristics of the grassland rat hole image comprises the following steps: and extracting features of the grassland mousehole image based on the full-dimensional dynamic convolution, and dynamically adjusting different features.

8. The method for detecting a meadow mousehole based on the unmanned aerial vehicle image according to claim 7, wherein,

the convolution kernel dimensions of the full-dimensional dynamic convolution include: the spatial dimension of the convolution kernel, the input channel dimension, the output channel dimension, and the kernel dimension of the convolution kernel.

9. The method for detecting a meadow mousehole based on an unmanned aerial vehicle image according to claim 6, wherein,

the process of directly predicting and fusing the plurality of feature maps comprises the following steps: and respectively predicting a plurality of feature graphs based on the TScode decoupling heads, and then carrying out fusion prediction on the feature graphs obtained by fusing the feature graphs.

10. The grassland rat-hole detection method based on the unmanned aerial vehicle image of claim 9, wherein,

the process for obtaining the fused feature map comprises the following steps: and distributing a leachable weight parameter for each feature map with different scales based on the context enhancement module, respectively solving products, and carrying out summation processing on a plurality of obtained products to obtain a fused feature map.