CN115393618A

CN115393618A - Improved YOLOv 5-based small-sample wild animal detection method

Info

Publication number: CN115393618A
Application number: CN202211018085.7A
Authority: CN
Inventors: 程志友; 刘思乾; 汪传建; 罗荣昊; 程灿
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2022-11-25

Abstract

The invention relates to a small sample wild animal detection method based on improved YOLOv5, which comprises the following steps: screening the AwA2 animal data set according to the collected real wild animal data set of the to-be-detected fragment area, and taking the screened image as an experimental data set; screening the acquired real wild animal data set of the to-be-detected fragment area to obtain a small sample experiment data set; labeling the experimental data set and the small sample experimental data set; adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model; and (3) obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, detecting the real wild animal of the to-be-detected fragment area and carrying out feasibility verification. According to the method, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that compared with the traditional method, the labor cost is greatly reduced.

Description

Improved YOLOv 5-based small-sample wild animal detection method

Technical Field

The invention relates to the technical field of deep learning and target detection classification, in particular to a small sample wild animal detection method based on improved YOLOv 5.

Background

Biological resources are the natural basis for realizing sustainable development of human beings and are powerful guarantee for the balance and stability of an ecosystem, so that wild animals need to be continuously monitored and protected. The wild environment is complex and variable, a plurality of unknown risks exist, and the protection action on wild animals is difficult to develop. As technology has evolved, various modern technologies have been developed for wildlife monitoring, including radio tracking, wireless sensor network tracking, satellite and global positioning system GPS tracking, and monitoring by motion sensitive cameras. With the advancement of digital technology, infrared cameras are being widely used for wildlife detection to facilitate photographing without interfering with normal work and rest of animals.

Infrared cameras automatically capture images and videos over a long period of time in the field, which can generate a large amount of image and video data, and manually processing such a large amount of images is also time consuming. In the field of target detection, a data set can directly influence the detection effect, the data set with abundant and balanced samples is beneficial to completing detection and classification tasks, however, the natural protection area of some wild animals is wide in region, many wild animals belong to rare species and are already at the edge of extinction, the number of rare animals is small, the track is fluctuated, clear animal pictures are difficult to capture by an infrared camera, and therefore the number of the wild animal samples which are shot by the infrared camera and can be really used for research is very small.

At present, the existing wild animal detection and classification methods at home and abroad are established on the basis of sufficient samples, however, under the condition of insufficient samples of wild animals, the methods are difficult to well complete the wild animal detection and classification tasks, and how to improve the detection precision of the wild animals under the condition of insufficient samples is a technical problem which is urgently needed to be solved in the field of wild animal detection at present.

Disclosure of Invention

The invention aims to provide a small sample wild animal detection method based on improved YOLOv5, which can solve the problem of insufficient wild animal samples, can better improve the overall performance of target detection, and can effectively improve the target detection precision of small sample wild animals.

In order to realize the purpose, the invention adopts the following technical scheme: a method for detecting a small sample wild animal based on modified YOLOv5, comprising the following sequential steps:

(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the section to be detected, screening out an image similar to the real wild animal data set of the section to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set;

(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set;

(3) Labeling the experimental data set and the small sample experimental data set;

(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting a first training set into the YOLOv5-CA network model for training;

(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.

In step (3), the labeling refers to labeling the animals in the pictures of the experimental data set and the small sample experimental data set by using a Labelimg tool, wherein the labeling format is yolo format.

The step (4) specifically comprises the following steps:

(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding the feature maps to form two feature maps which are respectively sensitive to direction and position; any intermediate tensor X = [ X ] ₁ ,x ₂ ,x ₃ ...,x _C ]∈R ^C×H×W As input, and outputs a tensor Y = [ Y ] of the same length ₁ ,y ₂ ,y ₃ ...,y _c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:

the output of the c-th channel with width w is as follows:

wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X _C Is the tensor of the c-th channel, h is the height of X;

formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transform, as shown in equation (3):

f＝δ(F ₁ ([Z ^h ,Z ^w ])) (3)

where δ is the nonlinear activation function, F ₁ The convolution transformation is 1 multiplied by 1, f is intermediate characteristic mapping, and is a result obtained after characteristic coding is carried out on the spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f ^h ∈R ^C/r×H And f ^w ∈R ^C/r×W Then using two other 1X 1Convolution transformation F _h And F _w Respectively will f ^h And f ^w The transformation into tensor to X, which contain the same number of eigenlayers, yields:

g ^h ＝σ(F _h (f ^h )) (4)

g ^w ＝σ(F _w (f ^w )) (5)

in the formula, F _h 、F _w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output ^h And g ^w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:

in the formula, y _c (i, j) represents the value of (i, j) in the c-th channel of the output, x _c (i, j) represents the value of (i, j) in the c-th channel,

represents the ith value of height h in the c channel,

represents the jth value of width w in the c channel;

(4b) The first training set is adjusted to be 640 × 640 images, and the images are input into a YOLOv5-CA network model to obtain 1024 output matrices of 1 × 1.

The YOLOv5-CA network model in the step (4) specifically includes:

a first layer: the layer is a Focus layer of a deep neural network, a 640 x 640 image is converted into three RGB channels, the Focus layer performs operation of taking a value every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on the width and height dimensions is converted into channel dimensions, the input channels are expanded by four times, and then the feature extraction is performed to obtain 64 output matrixes with the size of 320 x 320 as input matrixes of a second layer;

a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and extracting features of the deep neural network, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of the third layer;

and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;

a fourth layer: the layer is a convolutional layer, 256 convolutional kernels of 2 × 2 are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, BN normalization and a SiLU activation function are used for the layer, and 256 output matrixes of which the sizes are 80 × 80 are obtained and serve as input matrixes of a fifth layer;

and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses a plurality of Bottleneck modules to stack and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are subjected to channel splicing operation to obtain 256 output matrixes with the size of 80 x 80 as input matrixes of a sixth layer;

a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a SiLU activation function as input matrixes of a seventh layer;

a seventh layer: the layer is a C3 module, an input matrix is divided into two branches on the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;

an eighth layer: the layer is a convolutional layer, 1024 convolution kernels of 2 × 2 are defined, each convolution kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and the layer uses BN normalization and a SiLU activation function to obtain 1024 output matrixes of which the sizes are 20 × 20 and serve as input matrixes of a ninth layer;

a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, and 1024 output matrixes with the size of 20 multiplied by 20 are obtained and serve as input matrixes of a tenth layer;

a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.

The step (5) specifically comprises the following steps:

(5a) In the first stage, inputting a first training set into a YOLOv5-CA network model for training to obtain an optimal weight;

(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of a YOLOv5-CA network model, and the second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;

(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification.

According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that the labor cost is greatly reduced compared with the traditional method; secondly, the generalization ability of the network is improved by adopting a two-stage training mode, and the problem of insufficient wild animal samples is solved; thirdly, compared with algorithms such as Faster R-CNN, SSD, YOLov3-spp, YOLOX, PP-YOLOE and the like, the method can better improve the overall performance of target detection and can effectively improve the target detection precision of small-sample wild animals.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a structural diagram of an improved YOLOv5-CA network model;

FIG. 3 is a schematic diagram of a two-stage training method.

Detailed Description

As shown in fig. 1, a method for detecting a small sample wild animal based on modified YOLOv5, which comprises the following steps in the following order:

(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the parcel to be detected, screening out an image similar to the real wild animal data set of the parcel to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set; specifically, an AwA2 animal data set is downloaded through the Internet, the AwA2 animal data set is screened according to a real wild animal data set of a to-be-detected fragment, namely a real wild animal data set of a Qilian mountain used for experiments, 2398 images of 5 types of animals similar to the real wild animal data set of the Qilian mountain are screened out to serve as the experiment data set, the experiment data set is classified, random distribution is carried out according to the proportion of eight to two, and the data are divided into a first training set and a verification set;

(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set; specifically, photos transmitted back by a keemun mountain security camera are screened, 282 5 types of photos with clear animal pictures and high pixel quality are selected as small sample experiment data sets, the 5 types of photos are randomly distributed according to the proportion of eight to two, and the data are divided into a second training set and a test set;

In the step (3), the labeling means labeling the animal in the picture of the experimental data set and the small sample experimental data set by using a Labelimg tool, and the labeling format is yolo format.

The step (4) specifically comprises the following steps:

(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding feature maps to form two feature maps which are respectively sensitive to direction sensing and position; any intermediate tensor X = [ X ] ₁ ,x ₂ ,x ₃ ...,x _C ]∈R ^C×H×W As input, and outputs a tensor Y = [ Y ] of the same length ₁ ,y ₂ ,y ₃ ...,y _c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:

the output of the c-th channel with width w is as follows:

wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X _C Is the tensor of the c channel, h is the height of X;

formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transformation, as shown in equation (3):

f＝δ(F ₁ ([Z ^h ,Z ^w ])) (3)

wherein δ is a nonlinear activation function, F ₁ Is a 1 × 1 convolution transform, and f is an intermediate feature map, both horizontal and verticalIn two directions, the spatial information is subjected to feature coding to obtain a result; f is then decomposed along the spatial dimension into 2 individual tensors, f ^h ∈R ^C/r×H And f ^w ∈R ^C/r×W Then using two other 1 × 1 convolution transforms F _h And F _w Respectively will f ^h And f ^w The transformation into tensor to X, which contain the same number of eigenlayers, yields:

g ^h ＝σ(F _h (f ^h )) (4)

g ^w ＝σ(F _w (f ^w )) (5)

represents the ith value of height h in the c channel,

represents the jth value of width w in the c channel;

(4b) And adjusting the first training set into 640 × 640 images, and inputting the images into a Yolov5-CA network model to obtain 1024 output matrixes of 1 × 1.

As shown in fig. 2, the YOLOv5-CA network model in step (4) specifically includes:

a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 128 convolution kernels can help the system to extract enough features, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of a third layer;

a fourth layer: the layer is a convolutional layer, and 256 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 256 convolutional kernels can help the system to extract enough features, and the layer uses BN normalization and a SiLU activation function to obtain 256 output matrixes with the size of 80 × 80 as input matrixes of the fifth layer;

and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses the plurality of Bottleneck modules to stack and the 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on a channel to obtain 256 output matrixes with the size of 80 multiplied by 80 as input matrixes of a sixth layer;

a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, 512 convolutional kernels can help the system to extract enough features, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a Silu activation function and serve as input matrixes of a seventh layer;

a seventh layer: the layer is a C3 module, an input matrix is divided into two branches in the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;

an eighth layer: the layer is a convolutional layer, 1024 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 1024 convolutional kernels can help the system to extract enough features, and the layer uses BN normalization and a SilU activation function to obtain 1024 output matrixes with the size of 20 × 20 as input matrixes of the ninth layer;

a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, wherein the calculation results of serially connecting two MaxPool layers with the size of 5 multiplied by 5 are the same as the calculation results of one MaxPool layer with the size of 9 multiplied by 9, and the calculation results of serially connecting three MaxPool layers with the size of 5 multiplied by 5 are the same as the calculation results of one MaxPool layer with the size of 13 multiplied by 13, and 1024 output matrixes with the size of 20 multiplied by 20 are obtained as the input matrix of the tenth layer;

As shown in fig. 3, the step (5) specifically includes the following steps:

(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of the YOLOv5-CA network model, and the second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;

(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification. And performing feasibility verification according to the detection accuracy, wherein if the detection accuracy is high, the description is feasible, otherwise, the description is infeasible. For example, the wild animal in the picture is a tiger, the last detection result is the tiger, the detection result is correct, and the detection accuracy is calculated through a plurality of detection results in a plurality of pictures.

In conclusion, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that the labor cost is greatly reduced compared with the traditional method; the method adopts a two-stage training method to improve the generalization capability of the network and solve the problem of insufficient wild animal samples; compared with the algorithms such as Faster R-CNN, SSD, YOLOv3-spp, YOLOX, PP-YOLOE and the like, the invention can better improve the overall performance of target detection and effectively improve the target detection precision of small-sample wild animals.

Claims

1. A method for detecting a small sample wild animal based on improved YOLOv5 is characterized in that: the method comprises the following steps in sequence:

(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the parcel to be detected, screening out an image similar to the real wild animal data set of the parcel to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set;

(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting the labeled first training set into the YOLOv5-CA network model for training;

2. The improved YOLOv 5-based small sample wild-animal assay of claim 1, wherein: in step (3), the labeling refers to labeling the animals in the pictures of the experimental data set and the small sample experimental data set by using a Labelimg tool, wherein the labeling format is yolo format.

3. The improved YOLOv 5-based small sample wild-animal detection method of claim 1, wherein: the step (4) specifically comprises the following steps:

(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding the feature maps to form two feature maps which are respectively sensitive to direction and position; any intermediate tensor X = [ X ] ₁ ,x ₂ ,x ₃ ...,x _C ]∈R ^C×H×W As input, and output a tensor Y = [ Y ] of the same length ₁ ,y ₂ ,y ₃ ...,y _c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:

the output of the c-th channel with width w is as follows:

f＝δ(F ₁ ([Z ^h ,Z ^w ])) (3)

wherein δ is a nonlinear activation function, F ₁ The convolution transformation is 1 multiplied by 1, f is intermediate characteristic mapping, and is a result obtained after characteristic coding is carried out on the spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f ^h ∈R ^C/r×H And f ^w ∈R ^C/r×W Then using two other 1 × 1 convolution transforms F _h And F _w Respectively will f ^h And f ^w The transform into a tensor to X containing the same number of eigenlayers yields:

g ^h ＝σ(F _h (f ^h )) (4)

g ^w ＝σ(F _w (f ^w )) (5)

represents the ith value of height h in the c channel,

represents the jth value of width w in the c channel;

(4b) And adjusting the labeled first training set into 640 × 640 images, and inputting the images into a Yolov5-CA network model to obtain 1024 output matrices of 1 × 1.

4. The improved YOLOv 5-based small sample wild-animal detection method of claim 1, wherein: the YOLOv5-CA network model in the step (4) specifically includes:

a first layer: the layer is a Focus layer of a deep neural network, an image of 640 multiplied by 640 is converted into three channels of RGB, the Focus layer performs operation of taking a value from every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on width and height dimensions is converted into channel dimensions at the moment, input channels are expanded by four times, then feature extraction is performed, and 64 output matrixes of 320 multiplied by 320 in size are obtained and serve as input matrixes of a second layer;

a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, 1024 output matrixes with the size of 20 multiplied by 20 are obtained and are used as input matrixes of a tenth layer;

5. The improved YOLOv 5-based small sample wild-animal assay of claim 1, wherein: the step (5) specifically comprises the following steps:

(5a) In the first stage, inputting the labeled first training set into a YOLOv5-CA network model for training to obtain the optimal weight;

(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of the YOLOv5-CA network model, and the labeled second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;