CN115393618A - Improved YOLOv 5-based small-sample wild animal detection method - Google Patents

Improved YOLOv 5-based small-sample wild animal detection method Download PDF

Info

Publication number
CN115393618A
CN115393618A CN202211018085.7A CN202211018085A CN115393618A CN 115393618 A CN115393618 A CN 115393618A CN 202211018085 A CN202211018085 A CN 202211018085A CN 115393618 A CN115393618 A CN 115393618A
Authority
CN
China
Prior art keywords
layer
data set
yolov5
network model
matrixes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211018085.7A
Other languages
Chinese (zh)
Inventor
程志友
刘思乾
汪传建
罗荣昊
程灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202211018085.7A priority Critical patent/CN115393618A/en
Publication of CN115393618A publication Critical patent/CN115393618A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a small sample wild animal detection method based on improved YOLOv5, which comprises the following steps: screening the AwA2 animal data set according to the collected real wild animal data set of the to-be-detected fragment area, and taking the screened image as an experimental data set; screening the acquired real wild animal data set of the to-be-detected fragment area to obtain a small sample experiment data set; labeling the experimental data set and the small sample experimental data set; adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model; and (3) obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, detecting the real wild animal of the to-be-detected fragment area and carrying out feasibility verification. According to the method, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that compared with the traditional method, the labor cost is greatly reduced.

Description

Improved YOLOv 5-based small-sample wild animal detection method
Technical Field
The invention relates to the technical field of deep learning and target detection classification, in particular to a small sample wild animal detection method based on improved YOLOv 5.
Background
Biological resources are the natural basis for realizing sustainable development of human beings and are powerful guarantee for the balance and stability of an ecosystem, so that wild animals need to be continuously monitored and protected. The wild environment is complex and variable, a plurality of unknown risks exist, and the protection action on wild animals is difficult to develop. As technology has evolved, various modern technologies have been developed for wildlife monitoring, including radio tracking, wireless sensor network tracking, satellite and global positioning system GPS tracking, and monitoring by motion sensitive cameras. With the advancement of digital technology, infrared cameras are being widely used for wildlife detection to facilitate photographing without interfering with normal work and rest of animals.
Infrared cameras automatically capture images and videos over a long period of time in the field, which can generate a large amount of image and video data, and manually processing such a large amount of images is also time consuming. In the field of target detection, a data set can directly influence the detection effect, the data set with abundant and balanced samples is beneficial to completing detection and classification tasks, however, the natural protection area of some wild animals is wide in region, many wild animals belong to rare species and are already at the edge of extinction, the number of rare animals is small, the track is fluctuated, clear animal pictures are difficult to capture by an infrared camera, and therefore the number of the wild animal samples which are shot by the infrared camera and can be really used for research is very small.
At present, the existing wild animal detection and classification methods at home and abroad are established on the basis of sufficient samples, however, under the condition of insufficient samples of wild animals, the methods are difficult to well complete the wild animal detection and classification tasks, and how to improve the detection precision of the wild animals under the condition of insufficient samples is a technical problem which is urgently needed to be solved in the field of wild animal detection at present.
Disclosure of Invention
The invention aims to provide a small sample wild animal detection method based on improved YOLOv5, which can solve the problem of insufficient wild animal samples, can better improve the overall performance of target detection, and can effectively improve the target detection precision of small sample wild animals.
In order to realize the purpose, the invention adopts the following technical scheme: a method for detecting a small sample wild animal based on modified YOLOv5, comprising the following sequential steps:
(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the section to be detected, screening out an image similar to the real wild animal data set of the section to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set;
(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set;
(3) Labeling the experimental data set and the small sample experimental data set;
(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting a first training set into the YOLOv5-CA network model for training;
(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.
In step (3), the labeling refers to labeling the animals in the pictures of the experimental data set and the small sample experimental data set by using a Labelimg tool, wherein the labeling format is yolo format.
The step (4) specifically comprises the following steps:
(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding the feature maps to form two feature maps which are respectively sensitive to direction and position; any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:
Figure BDA0003812549140000021
the output of the c-th channel with width w is as follows:
Figure BDA0003812549140000031
wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X C Is the tensor of the c-th channel, h is the height of X;
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transform, as shown in equation (3):
f=δ(F 1 ([Z h ,Z w ])) (3)
where δ is the nonlinear activation function, F 1 The convolution transformation is 1 multiplied by 1, f is intermediate characteristic mapping, and is a result obtained after characteristic coding is carried out on the spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f h ∈R C/r×H And f w ∈R C/r×W Then using two other 1X 1Convolution transformation F h And F w Respectively will f h And f w The transformation into tensor to X, which contain the same number of eigenlayers, yields:
g h =σ(F h (f h )) (4)
g w =σ(F w (f w )) (5)
in the formula, F h 、F w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output h And g w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:
Figure BDA0003812549140000032
in the formula, y c (i, j) represents the value of (i, j) in the c-th channel of the output, x c (i, j) represents the value of (i, j) in the c-th channel,
Figure BDA0003812549140000033
represents the ith value of height h in the c channel,
Figure BDA0003812549140000034
represents the jth value of width w in the c channel;
(4b) The first training set is adjusted to be 640 × 640 images, and the images are input into a YOLOv5-CA network model to obtain 1024 output matrices of 1 × 1.
The YOLOv5-CA network model in the step (4) specifically includes:
a first layer: the layer is a Focus layer of a deep neural network, a 640 x 640 image is converted into three RGB channels, the Focus layer performs operation of taking a value every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on the width and height dimensions is converted into channel dimensions, the input channels are expanded by four times, and then the feature extraction is performed to obtain 64 output matrixes with the size of 320 x 320 as input matrixes of a second layer;
a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and extracting features of the deep neural network, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of the third layer;
and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;
a fourth layer: the layer is a convolutional layer, 256 convolutional kernels of 2 × 2 are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, BN normalization and a SiLU activation function are used for the layer, and 256 output matrixes of which the sizes are 80 × 80 are obtained and serve as input matrixes of a fifth layer;
and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses a plurality of Bottleneck modules to stack and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are subjected to channel splicing operation to obtain 256 output matrixes with the size of 80 x 80 as input matrixes of a sixth layer;
a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a SiLU activation function as input matrixes of a seventh layer;
a seventh layer: the layer is a C3 module, an input matrix is divided into two branches on the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;
an eighth layer: the layer is a convolutional layer, 1024 convolution kernels of 2 × 2 are defined, each convolution kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and the layer uses BN normalization and a SiLU activation function to obtain 1024 output matrixes of which the sizes are 20 × 20 and serve as input matrixes of a ninth layer;
a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, and 1024 output matrixes with the size of 20 multiplied by 20 are obtained and serve as input matrixes of a tenth layer;
a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.
The step (5) specifically comprises the following steps:
(5a) In the first stage, inputting a first training set into a YOLOv5-CA network model for training to obtain an optimal weight;
(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of a YOLOv5-CA network model, and the second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;
(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that the labor cost is greatly reduced compared with the traditional method; secondly, the generalization ability of the network is improved by adopting a two-stage training mode, and the problem of insufficient wild animal samples is solved; thirdly, compared with algorithms such as Faster R-CNN, SSD, YOLov3-spp, YOLOX, PP-YOLOE and the like, the method can better improve the overall performance of target detection and can effectively improve the target detection precision of small-sample wild animals.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a structural diagram of an improved YOLOv5-CA network model;
FIG. 3 is a schematic diagram of a two-stage training method.
Detailed Description
As shown in fig. 1, a method for detecting a small sample wild animal based on modified YOLOv5, which comprises the following steps in the following order:
(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the parcel to be detected, screening out an image similar to the real wild animal data set of the parcel to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set; specifically, an AwA2 animal data set is downloaded through the Internet, the AwA2 animal data set is screened according to a real wild animal data set of a to-be-detected fragment, namely a real wild animal data set of a Qilian mountain used for experiments, 2398 images of 5 types of animals similar to the real wild animal data set of the Qilian mountain are screened out to serve as the experiment data set, the experiment data set is classified, random distribution is carried out according to the proportion of eight to two, and the data are divided into a first training set and a verification set;
(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set; specifically, photos transmitted back by a keemun mountain security camera are screened, 282 5 types of photos with clear animal pictures and high pixel quality are selected as small sample experiment data sets, the 5 types of photos are randomly distributed according to the proportion of eight to two, and the data are divided into a second training set and a test set;
(3) Labeling the experimental data set and the small sample experimental data set;
(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting a first training set into the YOLOv5-CA network model for training;
(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.
In the step (3), the labeling means labeling the animal in the picture of the experimental data set and the small sample experimental data set by using a Labelimg tool, and the labeling format is yolo format.
The step (4) specifically comprises the following steps:
(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding feature maps to form two feature maps which are respectively sensitive to direction sensing and position; any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:
Figure BDA0003812549140000071
the output of the c-th channel with width w is as follows:
Figure BDA0003812549140000072
wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X C Is the tensor of the c channel, h is the height of X;
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transformation, as shown in equation (3):
f=δ(F 1 ([Z h ,Z w ])) (3)
wherein δ is a nonlinear activation function, F 1 Is a 1 × 1 convolution transform, and f is an intermediate feature map, both horizontal and verticalIn two directions, the spatial information is subjected to feature coding to obtain a result; f is then decomposed along the spatial dimension into 2 individual tensors, f h ∈R C/r×H And f w ∈R C/r×W Then using two other 1 × 1 convolution transforms F h And F w Respectively will f h And f w The transformation into tensor to X, which contain the same number of eigenlayers, yields:
g h =σ(F h (f h )) (4)
g w =σ(F w (f w )) (5)
in the formula, F h 、F w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output h And g w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:
Figure BDA0003812549140000073
in the formula, y c (i, j) represents the value of (i, j) in the c-th channel of the output, x c (i, j) represents the value of (i, j) in the c-th channel,
Figure BDA0003812549140000081
represents the ith value of height h in the c channel,
Figure BDA0003812549140000082
represents the jth value of width w in the c channel;
(4b) And adjusting the first training set into 640 × 640 images, and inputting the images into a Yolov5-CA network model to obtain 1024 output matrixes of 1 × 1.
As shown in fig. 2, the YOLOv5-CA network model in step (4) specifically includes:
a first layer: the layer is a Focus layer of a deep neural network, a 640 x 640 image is converted into three RGB channels, the Focus layer performs operation of taking a value every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on the width and height dimensions is converted into channel dimensions, the input channels are expanded by four times, and then the feature extraction is performed to obtain 64 output matrixes with the size of 320 x 320 as input matrixes of a second layer;
a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 128 convolution kernels can help the system to extract enough features, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of a third layer;
and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;
a fourth layer: the layer is a convolutional layer, and 256 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 256 convolutional kernels can help the system to extract enough features, and the layer uses BN normalization and a SiLU activation function to obtain 256 output matrixes with the size of 80 × 80 as input matrixes of the fifth layer;
and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses the plurality of Bottleneck modules to stack and the 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on a channel to obtain 256 output matrixes with the size of 80 multiplied by 80 as input matrixes of a sixth layer;
a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, 512 convolutional kernels can help the system to extract enough features, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a Silu activation function and serve as input matrixes of a seventh layer;
a seventh layer: the layer is a C3 module, an input matrix is divided into two branches in the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;
an eighth layer: the layer is a convolutional layer, 1024 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 1024 convolutional kernels can help the system to extract enough features, and the layer uses BN normalization and a SilU activation function to obtain 1024 output matrixes with the size of 20 × 20 as input matrixes of the ninth layer;
a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, wherein the calculation results of serially connecting two MaxPool layers with the size of 5 multiplied by 5 are the same as the calculation results of one MaxPool layer with the size of 9 multiplied by 9, and the calculation results of serially connecting three MaxPool layers with the size of 5 multiplied by 5 are the same as the calculation results of one MaxPool layer with the size of 13 multiplied by 13, and 1024 output matrixes with the size of 20 multiplied by 20 are obtained as the input matrix of the tenth layer;
a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.
As shown in fig. 3, the step (5) specifically includes the following steps:
(5a) In the first stage, inputting a first training set into a YOLOv5-CA network model for training to obtain an optimal weight;
(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of the YOLOv5-CA network model, and the second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;
(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification. And performing feasibility verification according to the detection accuracy, wherein if the detection accuracy is high, the description is feasible, otherwise, the description is infeasible. For example, the wild animal in the picture is a tiger, the last detection result is the tiger, the detection result is correct, and the detection accuracy is calculated through a plurality of detection results in a plurality of pictures.
In conclusion, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that the labor cost is greatly reduced compared with the traditional method; the method adopts a two-stage training method to improve the generalization capability of the network and solve the problem of insufficient wild animal samples; compared with the algorithms such as Faster R-CNN, SSD, YOLOv3-spp, YOLOX, PP-YOLOE and the like, the invention can better improve the overall performance of target detection and effectively improve the target detection precision of small-sample wild animals.

Claims (5)

1. A method for detecting a small sample wild animal based on improved YOLOv5 is characterized in that: the method comprises the following steps in sequence:
(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the parcel to be detected, screening out an image similar to the real wild animal data set of the parcel to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set;
(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set;
(3) Labeling the experimental data set and the small sample experimental data set;
(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting the labeled first training set into the YOLOv5-CA network model for training;
(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.
2. The improved YOLOv 5-based small sample wild-animal assay of claim 1, wherein: in step (3), the labeling refers to labeling the animals in the pictures of the experimental data set and the small sample experimental data set by using a Labelimg tool, wherein the labeling format is yolo format.
3. The improved YOLOv 5-based small sample wild-animal detection method of claim 1, wherein: the step (4) specifically comprises the following steps:
(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding the feature maps to form two feature maps which are respectively sensitive to direction and position; any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and output a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:
Figure FDA0003812549130000021
the output of the c-th channel with width w is as follows:
Figure FDA0003812549130000022
wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X C Is the tensor of the c channel, h is the height of X;
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transform, as shown in equation (3):
f=δ(F 1 ([Z h ,Z w ])) (3)
wherein δ is a nonlinear activation function, F 1 The convolution transformation is 1 multiplied by 1, f is intermediate characteristic mapping, and is a result obtained after characteristic coding is carried out on the spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f h ∈R C/r×H And f w ∈R C/r×W Then using two other 1 × 1 convolution transforms F h And F w Respectively will f h And f w The transform into a tensor to X containing the same number of eigenlayers yields:
g h =σ(F h (f h )) (4)
g w =σ(F w (f w )) (5)
in the formula, F h 、F w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output h And g w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:
Figure FDA0003812549130000023
in the formula, y c (i, j) represents the value of (i, j) in the c-th channel of the output, x c (i, j) represents the value of (i, j) in the c-th channel,
Figure FDA0003812549130000024
represents the ith value of height h in the c channel,
Figure FDA0003812549130000025
represents the jth value of width w in the c channel;
(4b) And adjusting the labeled first training set into 640 × 640 images, and inputting the images into a Yolov5-CA network model to obtain 1024 output matrices of 1 × 1.
4. The improved YOLOv 5-based small sample wild-animal detection method of claim 1, wherein: the YOLOv5-CA network model in the step (4) specifically includes:
a first layer: the layer is a Focus layer of a deep neural network, an image of 640 multiplied by 640 is converted into three channels of RGB, the Focus layer performs operation of taking a value from every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on width and height dimensions is converted into channel dimensions at the moment, input channels are expanded by four times, then feature extraction is performed, and 64 output matrixes of 320 multiplied by 320 in size are obtained and serve as input matrixes of a second layer;
a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and extracting features of the deep neural network, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of the third layer;
and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;
a fourth layer: the layer is a convolutional layer, 256 convolutional kernels of 2 × 2 are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, BN normalization and a SiLU activation function are used for the layer, and 256 output matrixes of which the sizes are 80 × 80 are obtained and serve as input matrixes of a fifth layer;
and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses the plurality of Bottleneck modules to stack and the 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on a channel to obtain 256 output matrixes with the size of 80 multiplied by 80 as input matrixes of a sixth layer;
a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a SiLU activation function as input matrixes of a seventh layer;
a seventh layer: the layer is a C3 module, an input matrix is divided into two branches on the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;
an eighth layer: the layer is a convolutional layer, 1024 convolution kernels of 2 × 2 are defined, each convolution kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and the layer uses BN normalization and a SiLU activation function to obtain 1024 output matrixes of which the sizes are 20 × 20 and serve as input matrixes of a ninth layer;
a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, 1024 output matrixes with the size of 20 multiplied by 20 are obtained and are used as input matrixes of a tenth layer;
a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.
5. The improved YOLOv 5-based small sample wild-animal assay of claim 1, wherein: the step (5) specifically comprises the following steps:
(5a) In the first stage, inputting the labeled first training set into a YOLOv5-CA network model for training to obtain the optimal weight;
(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of the YOLOv5-CA network model, and the labeled second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;
(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification.
CN202211018085.7A 2022-08-24 2022-08-24 Improved YOLOv 5-based small-sample wild animal detection method Pending CN115393618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211018085.7A CN115393618A (en) 2022-08-24 2022-08-24 Improved YOLOv 5-based small-sample wild animal detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211018085.7A CN115393618A (en) 2022-08-24 2022-08-24 Improved YOLOv 5-based small-sample wild animal detection method

Publications (1)

Publication Number Publication Date
CN115393618A true CN115393618A (en) 2022-11-25

Family

ID=84121016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211018085.7A Pending CN115393618A (en) 2022-08-24 2022-08-24 Improved YOLOv 5-based small-sample wild animal detection method

Country Status (1)

Country Link
CN (1) CN115393618A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310894A (en) * 2023-02-22 2023-06-23 中交第二公路勘察设计研究院有限公司 Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310894A (en) * 2023-02-22 2023-06-23 中交第二公路勘察设计研究院有限公司 Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope
CN116310894B (en) * 2023-02-22 2024-04-16 中交第二公路勘察设计研究院有限公司 Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope

Similar Documents

Publication Publication Date Title
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN107527007A (en) For detecting the image processing system of perpetual object
CN107092883A (en) Object identification method for tracing
CN111507275B (en) Video data time sequence information extraction method and device based on deep learning
CN111680705B (en) MB-SSD method and MB-SSD feature extraction network suitable for target detection
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN108133235A (en) A kind of pedestrian detection method based on neural network Analysis On Multi-scale Features figure
Meng et al. Investigation and evaluation of algorithms for unmanned aerial vehicle multispectral image registration
CN111582401A (en) Sunflower seed sorting method based on double-branch convolutional neural network
CN115393618A (en) Improved YOLOv 5-based small-sample wild animal detection method
CN113011308A (en) Pedestrian detection method introducing attention mechanism
CN115861799A (en) Light-weight air-to-ground target detection method based on attention gradient
Cai et al. A real-time smoke detection model based on YOLO-smoke algorithm
CN115578590A (en) Image identification method and device based on convolutional neural network model and terminal equipment
Sehree et al. Olive trees cases classification based on deep convolutional neural network from unmanned aerial vehicle imagery
WO2022205329A1 (en) Object detection method, object detection apparatus, and object detection system
CN112668675B (en) Image processing method and device, computer equipment and storage medium
Li et al. Object Detection for UAV Images Based on Improved YOLOv6
CN110991219B (en) Behavior identification method based on two-way 3D convolution network
CN112364864A (en) License plate recognition method and device, electronic equipment and storage medium
CN112308066A (en) License plate recognition system
CN115546569B (en) Attention mechanism-based data classification optimization method and related equipment
CN116563844A (en) Cherry tomato maturity detection method, device, equipment and storage medium
CN115115552A (en) Image correction model training method, image correction device and computer equipment
CN114429578A (en) Method for inspecting ancient architecture ridge beast decoration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination