CN117333948A

CN117333948A - End-to-end multi-target broiler behavior identification method integrating space-time attention mechanism

Info

Publication number: CN117333948A
Application number: CN202311400318.4A
Authority: CN
Inventors: 崔笛; 胡逸磊; 熊家齐; 应义斌; 泮进明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-01-02

Abstract

The invention discloses an end-to-end multi-target broiler behavior identification method integrating a space-time attention mechanism, which comprises the following steps: acquiring a chicken group video and manufacturing a chicken group target detection data set; improving the YOLOv8 neural network, and training the improved YOLOv8 neural network by using a target detection data set to obtain a target detection model with reasoning capability; performing target detection on chicken group videos to be detected, performing target association on chicken with adjacent frames by adopting an improved ByteTrackV2 target tracking algorithm to obtain a continuous image sequence of the chicken, and manufacturing a behavior recognition data set; training a 3D-ResNet50-TSAM model fused with a space-time attention mechanism by adopting a behavior recognition data set to obtain a chicken behavior recognition model with reasoning capability; and (3) adopting a target detection model with reasoning capability, an improved ByteTrackV2 target tracking algorithm and a chicken behavior recognition model with reasoning capability to form a three-stage chicken behavior recognition algorithm CBRNet, and carrying out end-to-end multi-target broiler behavior recognition on an image sequence in a chicken crowd video to be detected by using the three-stage chicken behavior recognition algorithm CBRNet. The method is used for behavior recognition of the multi-target broiler chickens in the complex scene.

Description

End-to-end multi-target broiler behavior identification method integrating space-time attention mechanism

Technical Field

The invention relates to the field of chicken behavior recognition methods, in particular to an end-to-end multi-target broiler behavior recognition method integrating a space-time attention mechanism.

Background

The production of the broiler chickens is an important component part of the production of broiler chickens, and the natural behavior expression of the broiler chickens can reflect the welfare cultivation level of the broiler chickens, and the cultivation benefit of enterprises is determined by the welfare cultivation level. At present, the behavior of the chicken flock is mainly observed by means of manual inspection, and the sick, disabled, weak and dead chickens are selected according to the behavior expression of the chicken flock, so that the chicken flock structure is optimized, but the mode has the defects of high labor intensity, large influence of the main view and low efficiency, and is difficult to meet the development requirements of modern poultry raising enterprises. Computer vision is an emerging nondestructive testing technology, has the advantages of high efficiency, no stress and low cost, and is an effective means for detecting the behavior of broiler chickens.

The broiler chicken behavior recognition method based on the computer vision technology generally collects RGB images of chicken flocks through a camera, and utilizes an image processing algorithm to complete the behavior analysis of broiler chickens in the images. The current mainstream poultry behavior recognition algorithm mainly comprises two main categories, namely poultry behavior recognition based on single-frame RGB images, and poultry behavior recognition based on continuous image frames. The first type of behavior recognition method mainly utilizes a target detection model (such as YOLO and mask-RCNN) to extract image features, so that positioning of broiler chickens in the images and classification of broiler chicken behaviors are realized, the recognition speed is high, the real-time performance is good, but the recognition accuracy is greatly influenced by factors such as light change, chicken shielding, target incomplete and the like, and behavior labels are easy to jump. The second type of behavior recognition method mainly utilizes a target detection model (or a divider) to combine with a target tracking algorithm to realize continuous tracking of multi-target chickens, and continuous image sequences of a plurality of chickens are input into the behavior recognition model in parallel to realize behavior classification of the multi-target chickens. The method utilizes the spatial characteristics and the time sequence characteristics of the video at the same time, and can reduce the influence of chicken shielding and target incomplete on the model classification result to a great extent.

The video-based behavior recognition algorithm comprises four types, namely a double-flow method, an LSTM-based method, a 3D convolution-based method and a transducer-based method. The four methods are all modeling by fusing the spatial features and the time sequence features of the video, so that the accuracy and the robustness of the behavior recognition of the broiler chickens in a complex cultivation scene can be effectively improved, but the application of the behavior recognition algorithm based on the video in the poultry farming industry is limited at present: the continuous image sequence of a single chicken in the video is required to be manually intercepted or only a single target chicken in the artificial control video is required to be manually intercepted, so that the continuous image sequence is input into a behavior recognition model to realize single chicken behavior classification, namely, an end-to-end multi-target chicken behavior recognition method based on the video is required to be developed, and the accuracy of chicken crowd behavior recognition under a complex cultivation scene and the accurate management level of individual chicken are improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an end-to-end multi-target broiler behavior identification method integrating a space-time attention mechanism, which is used for behavior identification of multi-target broilers in a complex scene.

An end-to-end multi-target broiler behavior identification method integrating a space-time attention mechanism comprises the following steps:

1) Acquiring a chicken group video and manufacturing a chicken group target detection data set;

2) Improving the YOLOv8 neural network, and training the improved YOLOv8 neural network by adopting the target detection data set in the step 1) to obtain a target detection model with reasoning capability;

3) Performing target detection on chicken group videos to be detected by using the target detection model with reasoning capability in the step 2), performing target association on adjacent frame chickens by using an improved ByteTrackV2 target tracking algorithm to obtain a continuous image sequence of the chickens, and manufacturing a behavior recognition data set;

4) Training a 3D-ResNet50-TSAM model fused with a space-time attention mechanism by using the behavior recognition data set in the step 3) to obtain a chicken behavior recognition square model with reasoning capability;

5) Adopting the target detection model with reasoning capability in the step 2), the improved ByteTrackV2 target tracking algorithm in the step 3) and the chicken behavior recognition model with reasoning capability in the step 4) to form a three-stage chicken behavior recognition algorithm CBRNet, and carrying out end-to-end multi-target broiler behavior recognition on the image sequence in the chicken crowd video to be detected by using the three-stage chicken behavior recognition algorithm CBRNet.

In step 2), the improved YOLOv8 neural network comprises: the system comprises a backbone network, a neck network and a detection head, wherein a CBAM attention module is arranged between the backbone network and the neck network;

regression of loss function at bounding box of detection head Using loss function L _OACIoU 。

In step 2), the loss function L _OACIoU The method comprises the following steps:

wherein IoU is the cross-over ratio, h _o 、w _o For the height and width of the overlapping area of the predicted frame and the real frame, h and w are the height and width of the predicted frame respectively, h _gt 、w _gt Respectively the height and width of the real frame, h _c 、w _c The height and width of the smallest bounding rectangle of the prediction and real frames, respectively, α is the weight coefficient, v is the degree of useAnd (3) a function of the similarity of the aspect ratio of the quantity.

Loss function L _OACIoU Prediction errors of the width, height and length-width ratio of the prediction frame can be processed more finely, the detection capability of the improved YOLOv8 model on chickens with different sizes and postures is enhanced, and accurate and stable observation values are provided for a subsequent target tracking algorithm.

In step 3), the improved ByteTrackV2 target tracking algorithm calculates the total similarity S in two stages of high-score detection frame association and low-score detection frame association of the original algorithm _T ：

S _T ＝β·IoU(B,T)+(1-β)·S(B)

Wherein B is a detection frame, T is a track, ioU (B, T) is the intersection ratio of the detection frame B and the track T, beta is a weight coefficient for balancing the influence of two factors of space overlapping degree and significance, S (B) is a significance score of the detection frame B, B _w And b _n To detect the width and height of the frame B, (x, y) is the pixel inside the frame B, and S (x, y) is the saliency score of the pixel. S (x, y) is obtained by a pre-trained image segmentation model U-Net, specifically, a current frame image is input into a target detection model to obtain a detection frame B of a target chicken, a saliency distribution map is obtained by inputting the current frame image into the U-Net model, and the detection frame B is mapped into the saliency distribution map to obtain a saliency score S (x, y) of each pixel point in the detection frame.

The improved ByteTrackV2 target tracking algorithm simultaneously considers the space overlapping degree between the detection frame and the track and the saliency score of the target chicken in the target association stage, and the saliency score of the target chicken under the conditions of shielding, incomplete and the like is high enough, so that the total similarity can be higher, and the target tracking success rate under a complex scene is improved.

In step 4), the 3D-ResNet50-TSAM model fusing the space-time attention mechanism sequentially comprises: a 3D convolution layer, a first set of residual blocks, a second set of residual blocks, a third set of residual blocks, a fourth set of residual blocks, a pooling layer, a full connection layer, and an activation function;

a spatio-temporal attention module TSAM is added to each of the first, second, third and fourth sets of residual blocks.

In step 4), the spatiotemporal attention module TSAM comprises:

for the input feature map sequence { T ] ₁ 、T ₂ …T _n -where n is the number of image sequences input to the 3D-ResNet50-TSAM model, the feature map sequence { T } ₁ 、T ₂ …T _n Each feature map T in } _i (1. Ltoreq.i.ltoreq.n) carrying out maximum pooling and average pooling operations along the channel direction to obtain a maximum pooling characteristic diagram W _i-1-max Average pooling feature map W _i-1-avg ；

Will W _i-1-max Feature map and W _i-1-avg The feature images are added in a para mode to obtain a feature image W _i-1 Will characteristic diagram W _i-1 Performing global maximum pooling and global average pooling operations to obtain a maximum pooling feature map W _i-2-max Average pooling feature map W _i-2-avg ；

Will W _i-2-max Feature map and W _i-2-avg Adding the feature images to obtain a feature image W _i-2 Sequence of feature maps W _i-2 After the sigmoid activation function, the time weight W 'is obtained' _i-2 Weighting time W' _i-2 And feature map T _i Multiplying to obtain weighted feature diagram sequence { T' ₁ 、T′ ₂ …T′ _n }；

For the weighted sequence of feature maps { T' ₁ 、T′ ₂ …T′ _n Each feature map T 'in }' _i Carrying out maximum pooling and average pooling operation along the channel direction to obtain a maximum pooling characteristic diagram W _i-3-max Average pooling feature map W _i-3-avg ；

Will characteristic diagram W _i-3-max And characteristic diagram W _i-3-avg Adding the positions to obtain a characteristic diagram W _i-3 After a sigmoid activation function, a spatial weight W 'is obtained' _i-3 Weighting the space weight W' _i-3 And T' _i Multiplying to obtain final outputFeature map sequence { T } ₁ 、T″ ₂ ……T″ _n }。

The 3D-ResNet50-TSAM model fused with the space-time attention mechanism can perform deeper and finer chicken feature extraction in two dimensions of time and space, and autonomously learns important frames and important areas of each frame in the video.

Further preferably, the key of the end-to-end multi-target broiler behavior recognition method integrating the space-time attention mechanism is a three-stage chicken behavior recognition algorithm CBRNet, which mainly comprises the following steps:

CBRNet mainly comprises three modules: a target detection model, a target tracking algorithm and a behavior recognition model;

the target detection model adopts an improved YOLOv8 neural network, the YOLOv8 consists of a Backbone, neck layer and a Head layer, and the improvement of the invention comprises two aspects: firstly, a CBAM attention module is added between a back bone and a Neck layer of YOLOv8, and the CBAM is a lightweight convolution attention module integrating a channel and a spatial attention mechanism, has stronger characteristic expression capability, can help a model learn important areas and important channels of different layers of characteristic diagrams, enhances the distinguishing capability of the model on target chickens and backgrounds in a complex environment, and improves the detection capability and robustness of the model. Second, the loss function of YOLOv8 is optimized. The Loss function of YOLOv8 comprises category classification Loss and boundary box regression Loss, the category classification Loss adopts cross entropy Loss function BCE Loss, the boundary box regression Loss adopts Distribution Focal Loss and cioU Loss, the total Loss function is obtained by weighting and summing the three Loss functions according to a certain weight proportion, the invention decouples the overlapping area of a real box and a predicted box on the basis of cioU Loss, adds penalty items of wide and high overlapping area, and reserves penalty items of length-width ratio to form a Loss function L _OACIOU The calculation mode is shown as the following formula, wherein IOU is the cross ratio, h _o 、w _o Height and width for prediction frame and real frame overlap areaH and w are the height and width of the prediction frame respectively, h _gt 、w _gt Respectively the height and width of the real frame, h _c 、w _c The height and width of the smallest bounding rectangle, α being the weight coefficient, v being a function for measuring aspect ratio similarity, are the prediction box and the real box, respectively. Loss function L _OACIoU Prediction errors of the width, height and length-width ratio of the prediction frame can be processed more finely, the detection capability of the improved YOLOv8 model on chickens with different sizes and postures is enhanced, and accurate and stable observation values are provided for a subsequent target tracking algorithm.

And (3) acquiring a chicken flock image dataset and manually labeling to form a target detection dataset, and training an improved YOLOv8 neural network to obtain a target detection model which is trained and has reasoning capability.

The invention adds a saliency scoring module when the high-score detection frame association and the low-score detection frame association calculate the motion similarity based on the cross-correlation ratio in the two stages of the high-score detection frame association and the low-score detection frame association, namely the calculation of the total similarity considers the spatial overlapping degree between the detection frame and the track and the saliency score of the target chicken at the same time, and the saliency score of the chicken is high enough under the conditions of shielding, incomplete and the like, so that the higher total similarity can be obtained, thereby improving the target tracking success rate under complex scenes, and the total similarity S _T Wherein B is the detection frame, T is the track, ioU (B, T) is the intersection ratio of the detection frame B and the track T, and beta is oneThe weight coefficients are used for balancing the influence of two factors of space overlapping degree and significance, S (B) is a significance component of the detection frame B, B _w And b _h To detect the width and height of the frame B, (x, y) is the pixel inside the frame B, and S (x, y) is the saliency score of the pixel. S (x, y) is obtained by a pre-trained image segmentation model U-Net, specifically, a current frame image is input into a target detection model to obtain a detection frame B of a target chicken, a saliency distribution map is obtained by inputting the current frame image into the U-Net model, and the detection frame B is mapped into the saliency distribution map to obtain a saliency score S (x, y) of each pixel point in the detection frame.

S _T ＝β·IoU(B,T)+(1-β)·S(B)

And (3) utilizing an improved ByteTrackV2 target tracking algorithm to combine a target detection model to realize continuous tracking of the multi-target chicken, and obtaining a continuous image sequence of the chicken.

The behavior recognition model adopts an improved 3D-ResNet50 model, and the 3D-ResNet50 model is a depth residual error neural network based on 3D convolution operation, so that the spatial information and the time sequence information of a video can be extracted at the same time. The invention adds a space-time attention module TSAM (Temporal-Spatial Attention Module) on the basis of a 3D-ResNet50 model, replaces a ReLU activation function in a 3D residual block with a Swish activation function to accelerate the convergence rate of the model, and finally forms the model 3D-ResNet50-TSAM, so that the model can autonomously learn important positions and key frames in video. Compared with the current popular space-time attention module (such as Non Local Block) based on a self-attention mechanism, the space-time attention module TSAM provided by the invention has the advantages of low computational complexity, high efficiency and strong interpretability, and the principle is as follows: for the input feature map sequence { T ] ₁ 、T ₂ …T _n N is the number of image sequences input into the 3D-ResNet50-TSAM model, the feature map sequence { T } ₁ 、T ₂ …T _n Each feature map T in } _i (1≤i.ltoreq.n) is H×W×C, wherein H, W, C corresponds to the height, width and channel number of the feature map, respectively, and the feature map T _i Performing maximum pooling and average pooling operations along the channel direction to obtain a maximum pooling characteristic diagram W with the size of H multiplied by W multiplied by 1 _i-1-max Average pooling feature map W of size h×w×1 _i-1-avg Will W _i-1-max Feature map and W _i-1-avg The feature maps are added in pairs to obtain a feature map W with the size of H multiplied by W multiplied by 1 _i-1 Will characteristic diagram W _i-1 Global maximum pooling and global average pooling operations are carried out to obtain a maximum pooling characteristic diagram W with the size of 1 multiplied by 1 _i-2-max Size and dimensions is 1× 1X 1 is a mean pooled feature map W _i-2-avg Will W _i-2-max Feature map and W _i-2-avg The feature maps are added to obtain a feature map W with the size of 1 multiplied by 1 _i-2 Sequence of feature maps W _i-2 After the sigmoid activation function, the time weight W 'is obtained' _i-2 To match it with the characteristic diagram T _i Multiplying to obtain a weighted characteristic diagram sequence { T } with H×W×C' ₁ 、T′ ₂ … T 'n, which emphasizes the contribution of the important time frames, attenuates the interference of the low correlation time frames, and on the basis thereof, the sequence of feature maps { T' ₁ 、T′ ₂ …T′ _n Each feature map T 'in }' _i Performing maximum pooling and average pooling operations along the channel direction to obtain a maximum pooling characteristic diagram W with the size of H multiplied by W multiplied by 1 _i-3-max Average pooling feature map W of size h×w×1 _i-3-avg Will characteristic diagram W _i-3-max And characteristic diagram W _i-3-avg Para-addition to obtain a feature map W of size H W1 _i-3 After a sigmoid activation function, a spatial weight W 'is obtained' _i-3 It is combined with T' _i Multiplying to obtain final output characteristic diagram sequence { T }', with size of H×W×C ₁ 、T″ ₂ …T″ _n And (c) for emphasizing learning of important areas in each time frame.

The 3D-ResNet50-TSAM behavior recognition model adopts a weighted cross entropy loss function L _wce ，L _wce Wherein N is the number of behavior categories, w _i For the ith category of behaviorWeight coefficient, y _i Is the one-hot coded value corresponding to the i-th real behavior label,is the probability of the ith behavior predicted by the model. L (L) _wce The class weight of each behavior class is added on the basis of the cross entropy loss function, so that the problem of unbalanced sample number of chicken behavior classes can be effectively solved, and the prediction performance of the model on small sample behaviors is improved. The chicken behaviors to be identified in the invention comprise six types of standing, fighting, feeding, drinking, resting and feather straightening, and the weight coefficient calculation mode comprises the following steps: assume that the number of samples corresponding to the above six behaviors is (n ₁ ,n ₂ ,n ₃ ,n ₄ ,n ₅ ,n ₆ ) The average number of samples is m, the corresponding weight coefficient vector is +.>

Creating a behavior recognition dataset: assuming that the chicken does not have mutation within 32 continuous frames, intercepting a 32-frame image sequence corresponding to the chicken on the basis of continuously tracking the target chicken by a target tracking algorithm, taking the maximum image size in the image sequence as a reference, and scaling images with other sizes to the reference size under the condition of unchanged aspect ratio to finish the normalization of the image size. 16 frames of images are picked from the 32 frames of continuous image sequence through two frame sampling strategies of uniform sampling and random sampling and are assigned to one behavior label. Wherein the uniform sampling includes two ways, the first way is to acquire the { i, i+1, i+2, …, i+15} in the image sequence, i ε N ₀ ,i∈[0,16]The second way is to collect the 2i frame or 2i+1 frame in the image sequence, i ε N ₀ ,i∈[0,15]Random sampling refers to randomly picking 16 frames of images from an image sequence, and a behavior recognition data set is manufactured by the method.

Training a 3D-ResNet50-TSAM model by using a behavior recognition data set to obtain a chicken behavior recognition model with reasoning capability.

And the complete end-to-end multi-target broiler behavior recognition algorithm CBRNet with the reasoning capability is formed by cooperatively integrating the target detection model with the reasoning capability, the improved target tracking algorithm and the chicken behavior recognition model with the reasoning capability.

Compared with the prior art, the invention has the following advantages:

the invention provides an end-to-end multi-target broiler behavior recognition method integrating a space-time attention mechanism, which forms a complete three-stage broiler behavior recognition algorithm CBRNet by cooperatively integrating a target detection model, a target tracking algorithm and a behavior recognition model. Adding a CBAM attention module into a target detection model, providing a new boundary box loss function, increasing the detection capability of the target detection model on chickens with different sizes and postures, providing an improved calculation method of total similarity between a detection box and a track in a target tracking algorithm, increasing the tracking success rate of the target tracking algorithm on blocked and incomplete targets, adding a space-time attention module into a behavior recognition model, and simultaneously adopting a weighted cross entropy loss function, so that the model can pay attention to important space positions and key frames, effectively deal with the problem of sample imbalance among different behavior categories, thereby improving the accuracy and robustness of the CBRNet behavior recognition algorithm, and finally providing technical support for accurate management of chickens in complex environments and discrimination of sick, weak, incomplete and dead chickens.

Drawings

FIG. 1 is a schematic diagram of a three-stage chicken behavior recognition method in an embodiment of the invention;

FIG. 2 is a diagram of an improved YOLOv8 object detection model architecture in accordance with embodiments of the present invention;

FIG. 3 is an L in an improved YOLOv8 target detection model in an embodiment of the invention _OACIoU A calculation method schematic diagram;

FIG. 4 is a diagram of an improved ByteTrackV2 target tracking algorithm architecture in accordance with embodiments of the present invention;

FIG. 5 is a diagram of a 3D-ResNet50-TSAM model architecture in accordance with an embodiment of the present invention;

FIG. 6 is a diagram of a residual block architecture modified in the 3D-ResNet50-TSAM model in accordance with an embodiment of the present invention;

FIG. 7 is a diagram of a space-time attention module TSAM architecture in a residual block modified in the 3D-ResNet50-TSAM model in an embodiment of the present invention.

Detailed Description

The following describes the embodiments of the present invention in detail with reference to the drawings.

As shown in fig. 1, an end-to-end three-stage chicken behavior recognition algorithm CBRNet architecture diagram mainly comprises a target detection stage, a target tracking stage and a behavior recognition stage. And obtaining a continuous image sequence of the chicken in the video by combining the target detection model with a target tracking algorithm, and classifying the continuous image sequence of the chicken by using a behavior recognition model to obtain a behavior label of each chicken in the video.

The invention takes the performance of the CBRNet algorithm into consideration, and each module in the algorithm is optimized in a targeted manner.

An algorithmic architecture diagram for a modified YOLOv8 object detection model is shown in fig. 2. The CBAM attention module is added between the backlight and the Neck layer of the Yolov8, and the concrete adding method is as follows: a CBAM attention module is added between a second C2f component of the backbond and a second Concat layer of the Neck, a CBAM attention module is added between a third C2f component of the backbond and a first Concat layer of the Neck, and a CBAM attention module is added between an SPPF component of the backbond and a fourth Concat layer of the Neck, so that the model is helped to learn important areas and important channels of different hierarchical feature graphs, the distinguishing capability of the model on target chickens and backgrounds in a complex environment is enhanced, and the detection capability and robustness of the model are improved.

The method adopts Distribution Focal Loss and CIoU Loss for the regression Loss of the YOLOv8 bounding box, decouples the overlapping area of a real box and a predicted box on the basis of the CIoU Loss, adds penalty terms for the width and the height of the overlapping area, and reserves penalty terms for the length-width ratio to form a Loss function L _OACIoU The calculation method is shown in FIG. 3, and the formula is shown in the following formula, wherein IoU is the intersectionRatio, h _o 、w _o For the height and width of the overlapping area of the predicted frame and the real frame, h and w are the height and width of the predicted frame respectively, h _gt 、w _gt Respectively the height and width of the real frame, h _c 、w _c The height and width of the smallest bounding rectangle, α being the weight coefficient, v being a function for measuring aspect ratio similarity, are the prediction box and the real box, respectively. Loss function L _OACIoU Prediction errors of the width, height and length-width ratio of the prediction frame can be processed more finely, the detection capability of the improved YOLOv8 model on chickens with different sizes and postures is enhanced, and accurate and stable observation values are provided for a subsequent target tracking algorithm.

The improved YOLOv8 Loss function comprises category classification Loss and boundary box regression Loss, wherein the category classification Loss adopts cross entropy Loss function BCE Loss, and the boundary box regression Loss adopts Distribution Focal Loss and L _OACIoU The total loss function is obtained by weighting and summing the three loss functions according to a certain weight proportion, and the weight coefficients in the embodiment of the invention are respectively 0.5, 1.5 and 7.5.

And (5) collecting video data of the chicken flocks and manufacturing a chicken flock target detection data set. And training the improved YOLOv8 neural network by using the target detection data set to obtain a target detection model with reasoning capability.

As shown in FIG. 4, the improved ByteTrackV2 target tracking algorithm architecture is shown, the ByteTrackV2 target tracking algorithm comprises four stages of detection frame processing, motion prediction, high-score detection frame association and low-score detection frame association, and the calculation of the invention in the two stages of high-score detection frame association and low-score detection frame association is based onThe saliency scoring module is added when the motion similarity of the cross-over ratio is calculated, namely the total similarity is calculated, the space overlapping degree between the detection frame and the track and the saliency score of the target chicken are considered, the saliency score of the chicken under the conditions of shielding, incomplete and the like is high enough, and the higher total similarity can be obtained, so that the target tracking success rate under a complex scene is improved, and the total similarity S _T Wherein B is a detection frame, T is a track, ioU (B, T) is the intersection ratio of the detection frame B and the track T, beta is a weight coefficient for balancing the influence of two factors of space overlapping degree and significance, S (B) is a significance score of the detection frame B, B _w And b _h To detect the width and height of the frame B, (x, y) is the pixel inside the frame B, and S (x, y) is the saliency score of the pixel. S (x, y) is obtained by a pre-trained image segmentation model U-Net, specifically, a current frame image is input into a target detection model to obtain a detection frame B of a target chicken, a saliency distribution map is obtained by inputting the current frame image into the U-Net model, the detection frame B is mapped into the saliency distribution map, and the saliency score S (x, y) of each pixel point in the detection frame is obtained, wherein the value of beta is 0.9 in the embodiment of the invention.

S _T ＝β·IoU(B,T)+(1-β)·S(B)

And (3) utilizing an improved ByteTrackV2 algorithm to combine with a target detection model to realize continuous tracking of the multi-target chicken, and obtaining a continuous image sequence of the chicken.

As shown in fig. 5, the overall structure diagram of the 3D-res net50-TSAM behavior recognition model is shown, after the chicken image sequence is subjected to 3D convolution layer and 4 groups of improved residual block extraction features, the spatial dimension of Global average pooling layer Global AvgPool compression features is used, then the nonlinear relation of the features is fitted through the full connection layer FC, and finally the output is mapped into a probability value through a softmax function, so that the final behavior classification result is obtained.

As shown in fig. 6, a modified residual block architecture diagram in the 3D-res net50-TSAM model is shown, wherein for residual block a in fig. 5, the characteristic channel number f=64, for residual block B, the characteristic channel number f=128, for residual block C, the characteristic channel number f=256, and for residual block D, the characteristic channel number f=512. The residual block is composed of three groups of CBS components, each group of CBS components comprises a 3D convolution, batch normalization and Swish activation function, a attention module TSAM is added between a batch normalization layer and the activation function of the third group of CBS components, and the output of the TSAM is connected with the input of the residual block in a residual way.

As shown in fig. 7, the structure diagram of the spatiotemporal attention module TSAM in the improved residual Block in the improved 3d resnet50 model, where the spatiotemporal attention module TSAM is a fusion of a temporal attention mechanism and a spatial attention mechanism, has the advantages of low computational complexity, high efficiency and strong interpretability compared with the current popular spatiotemporal attention module based on a self-attention mechanism (such as Non Local Block), and the principle is as follows: for the input feature map sequence { T ] ₁ 、T ₂ …T _n N is the number of image sequences input into the 3D-ResNet50-TSAM model, the feature map sequence { T } ₁ 、T ₂ …T _n Each feature map T in } _i The size of (1.ltoreq.i.ltoreq.n) is H×W×C, wherein H, W, C corresponds to the height, width and channel number of the feature map, respectively, and the feature map T _i Performing maximum pooling and average pooling operations along the channel direction to obtain a maximum pooling characteristic diagram W with the size of H multiplied by W multiplied by 1 _i-1-max Average pooling feature map W of size h×w×1 _i-1-avg Will W _i-1-max Feature map and W _i-1-avg The feature maps are added in pairs to obtain a feature map W with the size of H multiplied by W multiplied by 1 _i-1 Will characteristic diagram W _i-1 Global maximum pooling and global average pooling operations are carried out to obtain a maximum pooling characteristic diagram W with the size of 1 multiplied by 1 _i-2-max Size and dimensions is 1× 1X 1 is a mean pooled feature map W _i-2-avg Will W _i-2-max Feature map and W _i-2-avg The feature maps are added to obtain a feature map W with the size of 1 multiplied by 1 _i-2 Sequence of feature maps W _i-2 After the sigmoid activation function, the time weight W 'is obtained' _i-2 To match it with the characteristic diagram T _i Multiplying to obtain a weighted characteristic diagram sequence { T } with H×W×C' ₁ 、T ₂ …T′ _n The operation emphasizes the contribution of important time frames, weakens the interference of low-correlation time frames, and on the basis, performs the operation on the characteristic diagram sequence { T' ₁ 、T′ ₂ …T′ _n Each feature map T 'in }' _i Performing maximum pooling and average pooling operations along the channel direction to obtain a maximum pooling characteristic diagram W with the size of H multiplied by W multiplied by 1 _i-3-max Average pooling feature map W of size h×w×1 _i-3-avg Will characteristic diagram W _i-3-max And characteristic diagram W _i-3-avg Para-addition to obtain a feature map W of size H W1 _i-3 After a sigmoid activation function, a spatial weight W 'is obtained' _i-3 Multiplying this by T to obtain the final output feature map sequence { T }', which is H×W×C ₁ 、T″ ₂ …T″ _n And (c) for emphasizing learning of important areas in each time frame.

The 3D-ResNet50-TSAM behavior recognition model adopts a weighted cross entropy loss function L _wce The definition is shown in the following formula, wherein N is the number of behavior categories, and w _i Weight coefficient for the ith behavior class, y _i Is the one-hot coded value corresponding to the i-th real behavior label,is the probability of the ith behavior predicted by the model. L (L) _wce The class weight of each behavior class is added on the basis of the cross entropy loss function, so that the problem of unbalanced sample number of chicken behavior classes can be effectively solved, and the prediction performance of the model on small sample behaviors is improved. The chicken behaviors to be identified in the invention comprise six types of standing, fighting, feeding, drinking, resting and feather straightening, and the weight coefficient calculation mode is as follows: assume that the sample number vector corresponding to the above six behaviors is (n ₁ ,n ₂ ,n ₃ ,n ₄ ,n ₅ ,n ₆ ) The average number of samples is m, the corresponding weight coefficient vector is +.>

Creating a behavior recognition dataset: assuming that the chicken does not have mutation within 32 continuous frames, intercepting a 32-frame image sequence corresponding to each chicken on the basis of continuously tracking the target chicken by a target tracking algorithm, taking the maximum image size in the image sequence as a reference, and scaling images with other sizes to the reference size under the condition of unchanged aspect ratio to finish the normalization of the image size. 16 frames of images are picked from the 32 frames of continuous image sequence through two frame sampling strategies of uniform sampling and random sampling and are assigned to one behavior label. Wherein the uniform sampling includes two ways, the first way is to acquire the { i, i+1, i+2, …, i+15} in the image sequence, i ε N ₀ ,i∈[0,16]The second way is to collect the 2i frame or 2i+1 frame in the image sequence, i ε N ₀ ,i∈[0,15]Random sampling refers to randomly picking 16 frames of images from a sequence of images.

The method is used for making a behavior recognition data set and training a 3D-ResNet50-TSAM model to obtain the chicken behavior recognition model with reasoning capability.

A three-stage chicken behavior recognition algorithm CBRNet is formed by adopting a target detection model with reasoning capability, an improved ByteTrackV2 target tracking algorithm and a chicken behavior recognition model with reasoning capability.

Three-stage chicken behavior recognition algorithm CBRNet reasoning flow: the method comprises the steps of circularly obtaining a continuous 32-frame image sequence of chickens through a target detection model and a target tracking algorithm, normalizing the image size, forming input data (B, C, T, H and W) of a behavior recognition model, wherein B is the number of chickens tracked in batches, namely videos, C is the number of channels, for RGB videos, the value is 3, T is the number of video frames, namely the number of image sequences, the value is 16, H is the height of the images, W is the width of the images, obtaining 16-frame images through a uniform sampling or random sampling mode, and finally reasoning by the behavior recognition model to obtain the behavior tags of the multi-target broilers. It should be noted that if a chicken under a certain tracking id causes short tracking loss only due to shielding, light change, motion blur and the like, and then the chicken is restored to the previous tracking id, the frame of tracking loss is replaced by a blank frame; if tracking of a certain target is lost, then a new tracking id is allocated when the target is re-tracked, and at the moment, 16 frames of images are acquired by the new tracking id according to the logic and sent into a behavior recognition model to conduct behavior classification of chickens.

Claims

1. The end-to-end multi-target broiler behavior identification method integrating the space-time attention mechanism is characterized by comprising the following steps of:

4) Training a 3D-ResNet50-TSAM model fused with a space-time attention mechanism by using the behavior recognition data set in the step 3) to obtain a chicken behavior recognition model with reasoning capability;

2. The method for identifying end-to-end multi-target broiler behaviors with fusion of spatiotemporal attention mechanisms according to claim 1, wherein in step 2), the improved YOLOv8 neural network comprises: the system comprises a backbone network, a neck network and a detection head, wherein a CBAM attention module is arranged between the backbone network and the neck network;

regression of the loss at the bounding box of the inspection headThe loss function adopts a loss function L _OACIoU 。

3. The method for identifying end-to-end multi-target broiler behaviors with fusion of spatiotemporal attention mechanisms according to claim 2, wherein in step 2), the loss function L _OACIoU The method comprises the following steps:

wherein IoU is the cross-over ratio, h _o 、w _o For the height and width of the overlapping area of the predicted frame and the real frame, h and w are the height and width of the predicted frame respectively, h _gt 、w _gt Respectively the height and width of the real frame, h _c 、w _c The height and width of the smallest bounding rectangle, α being the weight coefficient, v being a function for measuring aspect ratio similarity, are the prediction box and the real box, respectively.

4. The method for identifying end-to-end multi-target broiler behaviors with fusion of spatiotemporal attention mechanisms according to claim 1, wherein in step 3), an improved ByteTrackV2 target tracking algorithm is adopted to perform target association only on adjacent frame chickens, and the method specifically comprises the following steps:

calculating the total similarity S between two stages of high-score detection frame association and low-score detection frame association _T ：

S _T ＝β·IoU(B,T)+(1-β)·S(B)

Wherein B is a detection frame, T is a track, ioU (B, T) is the intersection ratio of the detection frame B and the track T, beta is a weight coefficient for balancing the influence of two factors of space overlapping degree and significance, S (B) is a significance score of the detection frame B, B _w And b _h To detect the width and height of the frame B, (x, y) is the pixel inside the frame B, and S (x, y) is the saliency score of the pixel.

5. The method for identifying end-to-end multi-target broiler behaviors by fusing a space-time attention mechanism according to claim 1, wherein in the step 3), S (x, y) is obtained by a pre-trained image segmentation model U-Net, a current frame image is input into a target detection model to obtain a detection frame B of a target chicken, and meanwhile, the current frame image is input into the image segmentation model U-Net to obtain a saliency distribution map, and the detection frame B is mapped into the saliency distribution map to obtain a saliency score S (x, y) of each pixel point in the detection frame.

6. The method for identifying end-to-end multi-target broiler behaviors with fusion of spatiotemporal attention mechanisms according to claim 1, wherein in step 4), a 3D-res net50-TSAM model with fusion of spatiotemporal attention mechanisms sequentially comprises: a 3D convolution layer, a first set of residual blocks, a second set of residual blocks, a third set of residual blocks, a fourth set of residual blocks, a pooling layer, a full connection layer, and an activation function;

7. The method for identifying end-to-end multi-target broiler behaviors with fusion of spatiotemporal attention mechanisms according to claim 6, wherein in step 4), the spatiotemporal attention module TSAM comprises:

for the input feature map sequence { T ] ₁ 、T ₂ …T _n -where n is the number of image sequences input to the 3D-ResNet50-TSAM model, the feature map sequence { T } ₁ 、T ₂ …T _n Each bit in }Sign T _i (1. Ltoreq.i.ltoreq.n) carrying out maximum pooling and average pooling operations along the channel direction to obtain a maximum pooling characteristic diagram W _i-1-max Average pooling feature map W _i-1-avg ；

Will characteristic diagram W _i-3-max And characteristic diagram W _i-3-avg Adding the positions to obtain a characteristic diagram W _i-3 After a sigmoid activation function, a spatial weight W 'is obtained' _i-3 Weighting the space weight W' _i-3 And T' _i Multiplying to obtain final output characteristic diagram sequence { T } ₁ 、T″ ₂ …T″ _n }。