CN115393618A - Improved YOLOv 5-based small-sample wild animal detection method - Google Patents
Improved YOLOv 5-based small-sample wild animal detection method Download PDFInfo
- Publication number
- CN115393618A CN115393618A CN202211018085.7A CN202211018085A CN115393618A CN 115393618 A CN115393618 A CN 115393618A CN 202211018085 A CN202211018085 A CN 202211018085A CN 115393618 A CN115393618 A CN 115393618A
- Authority
- CN
- China
- Prior art keywords
- layer
- data set
- yolov5
- network model
- matrixes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a small sample wild animal detection method based on improved YOLOv5, which comprises the following steps: screening the AwA2 animal data set according to the collected real wild animal data set of the to-be-detected fragment area, and taking the screened image as an experimental data set; screening the acquired real wild animal data set of the to-be-detected fragment area to obtain a small sample experiment data set; labeling the experimental data set and the small sample experimental data set; adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model; and (3) obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, detecting the real wild animal of the to-be-detected fragment area and carrying out feasibility verification. According to the method, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that compared with the traditional method, the labor cost is greatly reduced.
Description
Technical Field
The invention relates to the technical field of deep learning and target detection classification, in particular to a small sample wild animal detection method based on improved YOLOv 5.
Background
Biological resources are the natural basis for realizing sustainable development of human beings and are powerful guarantee for the balance and stability of an ecosystem, so that wild animals need to be continuously monitored and protected. The wild environment is complex and variable, a plurality of unknown risks exist, and the protection action on wild animals is difficult to develop. As technology has evolved, various modern technologies have been developed for wildlife monitoring, including radio tracking, wireless sensor network tracking, satellite and global positioning system GPS tracking, and monitoring by motion sensitive cameras. With the advancement of digital technology, infrared cameras are being widely used for wildlife detection to facilitate photographing without interfering with normal work and rest of animals.
Infrared cameras automatically capture images and videos over a long period of time in the field, which can generate a large amount of image and video data, and manually processing such a large amount of images is also time consuming. In the field of target detection, a data set can directly influence the detection effect, the data set with abundant and balanced samples is beneficial to completing detection and classification tasks, however, the natural protection area of some wild animals is wide in region, many wild animals belong to rare species and are already at the edge of extinction, the number of rare animals is small, the track is fluctuated, clear animal pictures are difficult to capture by an infrared camera, and therefore the number of the wild animal samples which are shot by the infrared camera and can be really used for research is very small.
At present, the existing wild animal detection and classification methods at home and abroad are established on the basis of sufficient samples, however, under the condition of insufficient samples of wild animals, the methods are difficult to well complete the wild animal detection and classification tasks, and how to improve the detection precision of the wild animals under the condition of insufficient samples is a technical problem which is urgently needed to be solved in the field of wild animal detection at present.
Disclosure of Invention
The invention aims to provide a small sample wild animal detection method based on improved YOLOv5, which can solve the problem of insufficient wild animal samples, can better improve the overall performance of target detection, and can effectively improve the target detection precision of small sample wild animals.
In order to realize the purpose, the invention adopts the following technical scheme: a method for detecting a small sample wild animal based on modified YOLOv5, comprising the following sequential steps:
(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the section to be detected, screening out an image similar to the real wild animal data set of the section to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set;
(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set;
(3) Labeling the experimental data set and the small sample experimental data set;
(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting a first training set into the YOLOv5-CA network model for training;
(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.
In step (3), the labeling refers to labeling the animals in the pictures of the experimental data set and the small sample experimental data set by using a Labelimg tool, wherein the labeling format is yolo format.
The step (4) specifically comprises the following steps:
(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding the feature maps to form two feature maps which are respectively sensitive to direction and position; any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:
the output of the c-th channel with width w is as follows:
wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X C Is the tensor of the c-th channel, h is the height of X;
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transform, as shown in equation (3):
f=δ(F 1 ([Z h ,Z w ])) (3)
where δ is the nonlinear activation function, F 1 The convolution transformation is 1 multiplied by 1, f is intermediate characteristic mapping, and is a result obtained after characteristic coding is carried out on the spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f h ∈R C/r×H And f w ∈R C/r×W Then using two other 1X 1Convolution transformation F h And F w Respectively will f h And f w The transformation into tensor to X, which contain the same number of eigenlayers, yields:
g h =σ(F h (f h )) (4)
g w =σ(F w (f w )) (5)
in the formula, F h 、F w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output h And g w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:
in the formula, y c (i, j) represents the value of (i, j) in the c-th channel of the output, x c (i, j) represents the value of (i, j) in the c-th channel,represents the ith value of height h in the c channel,represents the jth value of width w in the c channel;
(4b) The first training set is adjusted to be 640 × 640 images, and the images are input into a YOLOv5-CA network model to obtain 1024 output matrices of 1 × 1.
The YOLOv5-CA network model in the step (4) specifically includes:
a first layer: the layer is a Focus layer of a deep neural network, a 640 x 640 image is converted into three RGB channels, the Focus layer performs operation of taking a value every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on the width and height dimensions is converted into channel dimensions, the input channels are expanded by four times, and then the feature extraction is performed to obtain 64 output matrixes with the size of 320 x 320 as input matrixes of a second layer;
a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and extracting features of the deep neural network, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of the third layer;
and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;
a fourth layer: the layer is a convolutional layer, 256 convolutional kernels of 2 × 2 are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, BN normalization and a SiLU activation function are used for the layer, and 256 output matrixes of which the sizes are 80 × 80 are obtained and serve as input matrixes of a fifth layer;
and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses a plurality of Bottleneck modules to stack and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are subjected to channel splicing operation to obtain 256 output matrixes with the size of 80 x 80 as input matrixes of a sixth layer;
a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a SiLU activation function as input matrixes of a seventh layer;
a seventh layer: the layer is a C3 module, an input matrix is divided into two branches on the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;
an eighth layer: the layer is a convolutional layer, 1024 convolution kernels of 2 × 2 are defined, each convolution kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and the layer uses BN normalization and a SiLU activation function to obtain 1024 output matrixes of which the sizes are 20 × 20 and serve as input matrixes of a ninth layer;
a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, and 1024 output matrixes with the size of 20 multiplied by 20 are obtained and serve as input matrixes of a tenth layer;
a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.
The step (5) specifically comprises the following steps:
(5a) In the first stage, inputting a first training set into a YOLOv5-CA network model for training to obtain an optimal weight;
(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of a YOLOv5-CA network model, and the second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;
(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that the labor cost is greatly reduced compared with the traditional method; secondly, the generalization ability of the network is improved by adopting a two-stage training mode, and the problem of insufficient wild animal samples is solved; thirdly, compared with algorithms such as Faster R-CNN, SSD, YOLov3-spp, YOLOX, PP-YOLOE and the like, the method can better improve the overall performance of target detection and can effectively improve the target detection precision of small-sample wild animals.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a structural diagram of an improved YOLOv5-CA network model;
FIG. 3 is a schematic diagram of a two-stage training method.
Detailed Description
As shown in fig. 1, a method for detecting a small sample wild animal based on modified YOLOv5, which comprises the following steps in the following order:
(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the parcel to be detected, screening out an image similar to the real wild animal data set of the parcel to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set; specifically, an AwA2 animal data set is downloaded through the Internet, the AwA2 animal data set is screened according to a real wild animal data set of a to-be-detected fragment, namely a real wild animal data set of a Qilian mountain used for experiments, 2398 images of 5 types of animals similar to the real wild animal data set of the Qilian mountain are screened out to serve as the experiment data set, the experiment data set is classified, random distribution is carried out according to the proportion of eight to two, and the data are divided into a first training set and a verification set;
(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set; specifically, photos transmitted back by a keemun mountain security camera are screened, 282 5 types of photos with clear animal pictures and high pixel quality are selected as small sample experiment data sets, the 5 types of photos are randomly distributed according to the proportion of eight to two, and the data are divided into a second training set and a test set;
(3) Labeling the experimental data set and the small sample experimental data set;
(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting a first training set into the YOLOv5-CA network model for training;
(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.
In the step (3), the labeling means labeling the animal in the picture of the experimental data set and the small sample experimental data set by using a Labelimg tool, and the labeling format is yolo format.
The step (4) specifically comprises the following steps:
(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding feature maps to form two feature maps which are respectively sensitive to direction sensing and position; any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:
the output of the c-th channel with width w is as follows:
wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X C Is the tensor of the c channel, h is the height of X;
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transformation, as shown in equation (3):
f=δ(F 1 ([Z h ,Z w ])) (3)
wherein δ is a nonlinear activation function, F 1 Is a 1 × 1 convolution transform, and f is an intermediate feature map, both horizontal and verticalIn two directions, the spatial information is subjected to feature coding to obtain a result; f is then decomposed along the spatial dimension into 2 individual tensors, f h ∈R C/r×H And f w ∈R C/r×W Then using two other 1 × 1 convolution transforms F h And F w Respectively will f h And f w The transformation into tensor to X, which contain the same number of eigenlayers, yields:
g h =σ(F h (f h )) (4)
g w =σ(F w (f w )) (5)
in the formula, F h 、F w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output h And g w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:
in the formula, y c (i, j) represents the value of (i, j) in the c-th channel of the output, x c (i, j) represents the value of (i, j) in the c-th channel,represents the ith value of height h in the c channel,represents the jth value of width w in the c channel;
(4b) And adjusting the first training set into 640 × 640 images, and inputting the images into a Yolov5-CA network model to obtain 1024 output matrixes of 1 × 1.
As shown in fig. 2, the YOLOv5-CA network model in step (4) specifically includes:
a first layer: the layer is a Focus layer of a deep neural network, a 640 x 640 image is converted into three RGB channels, the Focus layer performs operation of taking a value every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on the width and height dimensions is converted into channel dimensions, the input channels are expanded by four times, and then the feature extraction is performed to obtain 64 output matrixes with the size of 320 x 320 as input matrixes of a second layer;
a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 128 convolution kernels can help the system to extract enough features, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of a third layer;
and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;
a fourth layer: the layer is a convolutional layer, and 256 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 256 convolutional kernels can help the system to extract enough features, and the layer uses BN normalization and a SiLU activation function to obtain 256 output matrixes with the size of 80 × 80 as input matrixes of the fifth layer;
and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses the plurality of Bottleneck modules to stack and the 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on a channel to obtain 256 output matrixes with the size of 80 multiplied by 80 as input matrixes of a sixth layer;
a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, 512 convolutional kernels can help the system to extract enough features, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a Silu activation function and serve as input matrixes of a seventh layer;
a seventh layer: the layer is a C3 module, an input matrix is divided into two branches in the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;
an eighth layer: the layer is a convolutional layer, 1024 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and feature extraction of the deep neural network, the 1024 convolutional kernels can help the system to extract enough features, and the layer uses BN normalization and a SilU activation function to obtain 1024 output matrixes with the size of 20 × 20 as input matrixes of the ninth layer;
a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, wherein the calculation results of serially connecting two MaxPool layers with the size of 5 multiplied by 5 are the same as the calculation results of one MaxPool layer with the size of 9 multiplied by 9, and the calculation results of serially connecting three MaxPool layers with the size of 5 multiplied by 5 are the same as the calculation results of one MaxPool layer with the size of 13 multiplied by 13, and 1024 output matrixes with the size of 20 multiplied by 20 are obtained as the input matrix of the tenth layer;
a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.
As shown in fig. 3, the step (5) specifically includes the following steps:
(5a) In the first stage, inputting a first training set into a YOLOv5-CA network model for training to obtain an optimal weight;
(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of the YOLOv5-CA network model, and the second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;
(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification. And performing feasibility verification according to the detection accuracy, wherein if the detection accuracy is high, the description is feasible, otherwise, the description is infeasible. For example, the wild animal in the picture is a tiger, the last detection result is the tiger, the detection result is correct, and the detection accuracy is calculated through a plurality of detection results in a plurality of pictures.
In conclusion, the collected real wild animal data set of the to-be-detected fragment area is used as a research object, and the coordinate attention module CA is used for effectively solving the problem of low detection precision of other algorithms, so that the labor cost is greatly reduced compared with the traditional method; the method adopts a two-stage training method to improve the generalization capability of the network and solve the problem of insufficient wild animal samples; compared with the algorithms such as Faster R-CNN, SSD, YOLOv3-spp, YOLOX, PP-YOLOE and the like, the invention can better improve the overall performance of target detection and effectively improve the target detection precision of small-sample wild animals.
Claims (5)
1. A method for detecting a small sample wild animal based on improved YOLOv5 is characterized in that: the method comprises the following steps in sequence:
(1) Downloading an AwA2 animal data set, screening the AwA2 animal data set according to the acquired real wild animal data set of the parcel to be detected, screening out an image similar to the real wild animal data set of the parcel to be detected as an experimental data set, and dividing the experimental data set into a first training set and a verification set;
(2) Screening an acquired real wild animal data set of a to-be-detected fragment area, selecting a picture with clear animal picture and high pixel quality as a small sample experiment data set, and dividing the small sample experiment data set into a second training set and a test set;
(3) Labeling the experimental data set and the small sample experimental data set;
(4) Constructing a YOLOv5 network model, adding a coordinate attention module CA into the YOLOv5 network model to obtain a YOLOv5-CA network model, and inputting the labeled first training set into the YOLOv5-CA network model for training;
(5) And obtaining a YOLOv5-CA-TL network model by adopting a two-stage training method, inputting the labeled test set into the YOLOv5-CA-TL network model, detecting wild animals in the picture, and verifying feasibility.
2. The improved YOLOv 5-based small sample wild-animal assay of claim 1, wherein: in step (3), the labeling refers to labeling the animals in the pictures of the experimental data set and the small sample experimental data set by using a Labelimg tool, wherein the labeling format is yolo format.
3. The improved YOLOv 5-based small sample wild-animal detection method of claim 1, wherein: the step (4) specifically comprises the following steps:
(4a) Replacing a first C3 module and a last C3 module of a backbone network of a YOLOv5 network model with a coordinate attention module CA, and then respectively encoding the feature maps to form two feature maps which are respectively sensitive to direction and position; any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and output a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y c ]Each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1, w) for X, and the output of the c-th channel of height H is as follows:
the output of the c-th channel with width w is as follows:
wherein H is the height of the pooling nucleus, W is the width of the pooling nucleus, X C Is the tensor of the c channel, h is the height of X;
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention module CA generates two feature layers before concatenation, and then shares a 1 x 1 convolution operation transform, as shown in equation (3):
f=δ(F 1 ([Z h ,Z w ])) (3)
wherein δ is a nonlinear activation function, F 1 The convolution transformation is 1 multiplied by 1, f is intermediate characteristic mapping, and is a result obtained after characteristic coding is carried out on the spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f h ∈R C/r×H And f w ∈R C/r×W Then using two other 1 × 1 convolution transforms F h And F w Respectively will f h And f w The transform into a tensor to X containing the same number of eigenlayers yields:
g h =σ(F h (f h )) (4)
g w =σ(F w (f w )) (5)
in the formula, F h 、F w 1 × 1 convolution transformation, sigma is sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f division, and then the output g is output h And g w The final output of the coordinate attention module CA is shown in equation (6) as the weighted weights:
in the formula, y c (i, j) represents the value of (i, j) in the c-th channel of the output, x c (i, j) represents the value of (i, j) in the c-th channel,represents the ith value of height h in the c channel,represents the jth value of width w in the c channel;
(4b) And adjusting the labeled first training set into 640 × 640 images, and inputting the images into a Yolov5-CA network model to obtain 1024 output matrices of 1 × 1.
4. The improved YOLOv 5-based small sample wild-animal detection method of claim 1, wherein: the YOLOv5-CA network model in the step (4) specifically includes:
a first layer: the layer is a Focus layer of a deep neural network, an image of 640 multiplied by 640 is converted into three channels of RGB, the Focus layer performs operation of taking a value from every other pixel on a feature map on each channel to obtain 4 independent feature layers, the 4 feature layers are stacked, information on width and height dimensions is converted into channel dimensions at the moment, input channels are expanded by four times, then feature extraction is performed, and 64 output matrixes of 320 multiplied by 320 in size are obtained and serve as input matrixes of a second layer;
a second layer: the layer defines 128 convolution kernels of 2 × 2, each convolution kernel has the function of a filter and is used for learning and extracting features of the deep neural network, the layer uses BN normalization and a SiLU activation function, and 128 output matrixes with the size of 160 × 160 are obtained and serve as input matrixes of the third layer;
and a third layer: the layer is a coordinate attention module CA, and 128 output matrixes with the size of 160 multiplied by 160 are obtained as input matrixes of the fourth layer;
a fourth layer: the layer is a convolutional layer, 256 convolutional kernels of 2 × 2 are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, BN normalization and a SiLU activation function are used for the layer, and 256 output matrixes of which the sizes are 80 × 80 are obtained and serve as input matrixes of a fifth layer;
and a fifth layer: the layer is a C3 module and comprises 3 standard convolution layers and a plurality of Bottleneck modules, the C3 module is a main module for learning residual error characteristics and is divided into two branches, one branch uses the plurality of Bottleneck modules to stack and the 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on a channel to obtain 256 output matrixes with the size of 80 multiplied by 80 as input matrixes of a sixth layer;
a sixth layer: the layer is a convolutional layer, 512 2 × 2 convolutional kernels are defined, each convolutional kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and 512 output matrixes with the size of 40 × 40 are obtained by using BN normalization and a SiLU activation function as input matrixes of a seventh layer;
a seventh layer: the layer is a C3 module, an input matrix is divided into two branches on the channel dimension, one branch passes through a plurality of Bottleneck stacking layers and 3 standard convolution layers, the other branch only passes through 1 basic convolution layer, and finally the two branches are spliced on the channel to obtain 512 output matrixes with the size of 40 multiplied by 40 as input matrixes of an eighth layer;
an eighth layer: the layer is a convolutional layer, 1024 convolution kernels of 2 × 2 are defined, each convolution kernel has the function of a filter and is used for learning and extracting characteristics of the deep neural network, and the layer uses BN normalization and a SiLU activation function to obtain 1024 output matrixes of which the sizes are 20 × 20 and serve as input matrixes of a ninth layer;
a ninth layer: the layer is an SPP module, an input matrix is serially passed through a plurality of MaxPool layers with the size of 5 multiplied by 5, 1024 output matrixes with the size of 20 multiplied by 20 are obtained and are used as input matrixes of a tenth layer;
a tenth layer: this layer is the coordinate attention module CA, resulting in 1024 output matrices of size 1 × 1.
5. The improved YOLOv 5-based small sample wild-animal assay of claim 1, wherein: the step (5) specifically comprises the following steps:
(5a) In the first stage, inputting the labeled first training set into a YOLOv5-CA network model for training to obtain the optimal weight;
(5b) In the second stage, the optimal weight obtained in the first stage is used as a pre-training weight of the YOLOv5-CA network model, and the labeled second training set is input into the YOLOv5-CA network model for training to obtain the YOLOv5-CA-TL network model;
(5c) And inputting the marked test set into a YOLOv5-CA-TL network model for detection, detecting to obtain wild animals in the picture, and performing feasibility verification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211018085.7A CN115393618A (en) | 2022-08-24 | 2022-08-24 | Improved YOLOv 5-based small-sample wild animal detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211018085.7A CN115393618A (en) | 2022-08-24 | 2022-08-24 | Improved YOLOv 5-based small-sample wild animal detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115393618A true CN115393618A (en) | 2022-11-25 |
Family
ID=84121016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211018085.7A Pending CN115393618A (en) | 2022-08-24 | 2022-08-24 | Improved YOLOv 5-based small-sample wild animal detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115393618A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116310894A (en) * | 2023-02-22 | 2023-06-23 | 中交第二公路勘察设计研究院有限公司 | Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope |
-
2022
- 2022-08-24 CN CN202211018085.7A patent/CN115393618A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116310894A (en) * | 2023-02-22 | 2023-06-23 | 中交第二公路勘察设计研究院有限公司 | Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope |
CN116310894B (en) * | 2023-02-22 | 2024-04-16 | 中交第二公路勘察设计研究院有限公司 | Unmanned aerial vehicle remote sensing-based intelligent recognition method for small-sample and small-target Tibetan antelope |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113065558A (en) | Lightweight small target detection method combined with attention mechanism | |
CN107527007A (en) | For detecting the image processing system of perpetual object | |
CN107092883A (en) | Object identification method for tracing | |
CN111507275B (en) | Video data time sequence information extraction method and device based on deep learning | |
CN111680705B (en) | MB-SSD method and MB-SSD feature extraction network suitable for target detection | |
CN112818969A (en) | Knowledge distillation-based face pose estimation method and system | |
CN108133235A (en) | A kind of pedestrian detection method based on neural network Analysis On Multi-scale Features figure | |
Meng et al. | Investigation and evaluation of algorithms for unmanned aerial vehicle multispectral image registration | |
CN111582401A (en) | Sunflower seed sorting method based on double-branch convolutional neural network | |
CN115393618A (en) | Improved YOLOv 5-based small-sample wild animal detection method | |
CN113011308A (en) | Pedestrian detection method introducing attention mechanism | |
CN115861799A (en) | Light-weight air-to-ground target detection method based on attention gradient | |
Cai et al. | A real-time smoke detection model based on YOLO-smoke algorithm | |
CN115578590A (en) | Image identification method and device based on convolutional neural network model and terminal equipment | |
Sehree et al. | Olive trees cases classification based on deep convolutional neural network from unmanned aerial vehicle imagery | |
WO2022205329A1 (en) | Object detection method, object detection apparatus, and object detection system | |
CN112668675B (en) | Image processing method and device, computer equipment and storage medium | |
Li et al. | Object Detection for UAV Images Based on Improved YOLOv6 | |
CN110991219B (en) | Behavior identification method based on two-way 3D convolution network | |
CN112364864A (en) | License plate recognition method and device, electronic equipment and storage medium | |
CN112308066A (en) | License plate recognition system | |
CN115546569B (en) | Attention mechanism-based data classification optimization method and related equipment | |
CN116563844A (en) | Cherry tomato maturity detection method, device, equipment and storage medium | |
CN115115552A (en) | Image correction model training method, image correction device and computer equipment | |
CN114429578A (en) | Method for inspecting ancient architecture ridge beast decoration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |