CN116563895A

CN116563895A - Video-based animal individual identification method

Info

Publication number: CN116563895A
Application number: CN202310840442.6A
Authority: CN
Inventors: 赵启军; 陈鹏; 李蕾; 侯蓉; 何梦楠; 唐金龙; 吴鹏程; 彭宗铭; 闵清悦; 邱涵茜
Original assignee: CHENGDU RESEARCH BASE OF GIANT PANDA BREEDING; Sichuan University
Current assignee: CHENGDU RESEARCH BASE OF GIANT PANDA BREEDING; Sichuan University
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-08-08

Abstract

The invention discloses a video-based animal individual identification method, which comprises the following steps: s1, extracting characteristics of each frame in animal tracking video based on a convolutional neural network; s2, acquiring local discriminant features in each frame based on a local discriminant feature extraction module; step S3, the local discriminant features are aggregated based on the hypergraph neural network, and final local features are obtained; step S4, aggregating the cross-frame features in the features of each frame extracted in the step S1 by adopting time average pooling to obtain global features; and S5, connecting the local features and the global features to obtain the mixed features of the animals so as to realize individual identification of the animals. When the method is applied, the individual identification accuracy of animals can be improved under the unconstrained shooting environment.

Description

Video-based animal individual identification method

Technical Field

The invention relates to the technical field of computer vision, in particular to a video-based animal individual identification method.

Background

With the development of computer technology, the application of computer vision and image processing technology to animal individual identification research is becoming more and more widespread, and picture-based animal individual identification methods are also rapidly developing. In picture-based animal individual identification methods, it is often based on certain conditions, such as facial visibility, side pattern visibility, etc. Among them, pandas, northeast tigers, seal, whales, etc. are more common. However, the animal face or the whole body in the real scene is clear and free from the picture which is difficult to capture, a large amount of unavailable data is often required to be discarded, and the performance of the recognition model can be greatly reduced under the conditions of shielding, darkness or blurring.

At present, the video-based re-recognition task is generally applied to pedestrian recognition, compared with pedestrians, animal motions are more random, a large number of external shielding, self-shielding or motion blurring and the like are easy to occur, the differences among different animal individuals of the same species are small, no obvious difference exists in appearance, and the same animal individual can generate larger intra-class differences under different postures. Thus, there are major challenges in applying video-based re-identification to individual animal identification, no relevant reports are reported today, and no relevant introduction is available in the prior art.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a video-based animal individual identification method which can improve the individual identification accuracy of animals under an unconstrained shooting environment when applied.

The aim of the invention is mainly realized by the following technical scheme:

a video-based animal individual identification method comprising the steps of:

s1, extracting characteristics of each frame in animal tracking video based on a convolutional neural network;

s2, acquiring local discriminant features in each frame based on a local discriminant feature extraction module;

step S3, the local discriminant features are aggregated based on the hypergraph neural network, and final local features are obtained;

step S4, aggregating the cross-frame features in the features of each frame extracted in the step S1 by adopting time average pooling to obtain global features;

and S5, connecting the local features and the global features to obtain the mixed features of the animals so as to realize individual identification of the animals.

Further, the local discriminant feature obtained in step S2 is a feature map on each frame, where the feature map is obtained by searching information of each frame through a sliding window.

Further, step S2 further includes sliding the obtained feature images with windows of different sizes, outputting feature images of different sizes, calculating an activation mean value of each window through global averaging pooling, sorting the activation mean values, selecting a set number of windows according to the sorting result, clipping a local area selected by the windows on the input original image, and then sending the local area to a convolutional neural network again for learning, so as to obtain a new feature image as a local discriminant feature.

Further, the step S3 includes the following steps:

step S31, constructing hypergraph g= (V, E); wherein G represents a hypergraph, V represents a node set, E represents a hyperedge set, the nodes of the hypergraph are local discriminant features obtained in the step S2, each hyperedge is connected with a plurality of nodes, and the nodes connected on each hyperedge are connected according to a set fixed time threshold valueT _h To select;

and step S32, updating the node characteristics by adopting a hypergraph neural network to obtain hypergraph characteristics.

Further, in step S31, the nodes connected to each superside are determined according to a set fixed time thresholdT _h To select the method comprises the following steps: for any node, a fixed time threshold is found out from the time interval between the node and the nodeT _h Then K nearest connection points of the node are calculated by adopting a K nearest neighbor algorithm, and then K+1 characteristic nodes are connected by adopting a superside; and executing the operations to establish the supersides for all the nodes to obtain a plurality of supersides.

Further, the step S32 includes constructing an over-edge feature according to the feature of the node and finishing feature update of the node by converging the over-edge feature;

the construction of the superside feature according to the feature of the node comprises the following steps: for each node, acquiring all superedges containing the node, and for each superedge, carrying out average operation on other node characteristics except the node, thereby constructing the characteristic of the superedge;

the feature updating of the convergent superside feature completion node comprises the following steps: aggregating the superside characteristics to obtain all superside information related to the node, performing connection operation on the information of the node and the superside information, and updating the node characteristics by using a full connection layer; and executing the same operation on the nodes of all the graphs, and executing the average pooling operation on the characteristics of all the nodes to obtain the final local characteristics.

The invention adopts a double-branch network structure, comprising local branches and global branches, extracts the local features and global features of animals respectively, fully digs fine-grained features and simultaneously retains global information, and combines the fine-grained features to obtain more differentiated feature expression. Because animals are more flexible and have more various postures compared with pedestrians, the invention designs a fine-granularity characteristic extraction module which is used for adaptively searching the distinguishing characteristics in each frame on a local branch so as to adapt to complex posture changes of the animals, and the difference between different individuals with similar appearance is found out as much as possible by further mining the distinguishing local characteristics. The invention establishes the time dependency relationship among the local features of different frames by using the hypergraph neural network, and aggregates the local features so as to better solve the problem that animal individuals in certain frames are blocked.

In summary, compared with the prior art, the invention has the following beneficial effects: the invention is inspired by a pedestrian re-identification technology based on video, and provides an animal individual identification method based on video to solve the problem of the animal individual identification method based on pictures.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention. In the drawings:

FIG. 1 is a flow chart of an embodiment of the present invention;

fig. 2 is a schematic diagram of a network architecture corresponding to the application of an embodiment of the present invention in panda individual identification.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

Examples:

as shown in fig. 1, the video-based animal individual identification method comprises the following steps: s1, extracting characteristics of each frame in animal tracking video based on a convolutional neural network; s2, acquiring local discriminant features in each frame based on a local discriminant feature extraction module; step S3, the local discriminant features are aggregated based on the hypergraph neural network, and final local features are obtained; step S4, aggregating the cross-frame features in the features of each frame extracted in the step S1 by adopting time average pooling to obtain global features; and S5, connecting the local features and the global features to obtain the mixed features of the animals so as to realize individual identification of the animals. In the embodiment, the time-averaged pooling in step S4, that is, global-averaged pooling in the time dimension, aggregates the features of the cross-frame to obtain the global features of the animal. In the embodiment, steps S1 to S3 are sequentially executed, step S4 is executed synchronously with steps S2 and S3, and step S5 is executed after steps S3 and S4 are executed. In the embodiment, the convolutional neural network uses the resnet50 network as a backbone network to extract the features of each frame.

To alleviate the pressure of manual identification, many deep learning-based methods have been studied for automated individual animal identification. Often, different individuals of the same species differ only slightly in appearance, and therefore most work is distinguished based on the fine-grained local characteristics of animals. At present, the work of animal re-identification is almost based on pictures, a method which is more suitable for animal video individual identification is not designed according to the characteristics of animals, and the performance of the existing method is obviously reduced in an unconstrained environment. Therefore, the embodiment proposes to use video to perform individual recognition so as to enhance the robustness of the individual recognition model and expand the application scene. Currently, in video-based re-recognition work, the subject is typically a pedestrian. Video re-recognition based on deep learning mainly explores how to aggregate a series of image-level features into video-level features, i.e. how to perform temporal modeling. The prior time modeling method of pedestrian re-identification technology based on video mainly comprises five main categories: time pooling, time attention, optical flow, recurrent Neural Networks (RNNs), and three-dimensional convolutional neural networks. However, the existing method does not consider time modeling of parts with finer granularity, is difficult to distinguish individuals with similar appearance, and is difficult to deal with the problems of shielding, posture change and the like. In order to further capture the identification characteristics among different individuals, some time modeling methods combining local information and global information are proposed, the time modeling methods often adopt the form of horizontal segmentation of a characteristic diagram when acquiring the local characteristics of pedestrians, animals are different from pedestrians, the pedestrians are often stable up-down structures, and many animals can have any postures. Therefore, the embodiment adopts a more flexible local feature acquisition mode, and the identifying features are searched through model self-adaptive learning.

As shown in fig. 2, when the present embodiment is applied to panda individual recognition, a panda video clip with T frames is provided, and the present embodiment extracts the features of each frame through a convolutional neural network. Then, in the local branches, the embodiment obtains a discriminating local part of each frame by using a local discriminating feature extracting module (DPSM, discriminative Patch Search Module), and obtains a local feature representation of the panda by aggregating the local features by a hypergraph neural network module (PHM, patch Hypergraph Module) based on the local key features. Meanwhile, in the global branch, the embodiment uses time-averaged pooling to aggregate the cross-frame features to obtain the global feature representation of the pandas. Finally, by linking together the local and global features of the panda, a hybrid feature representation is obtained to achieve individual identification of the panda. In FIG. 2TFor the number of video level clip frames,Hfor the height of the video level clip frame,Wfor the width of the video-level clip frame,F ₁ to extract features of the first frame via the convolutional neural network,F _T to extract the first via convolutional neural networkTThe characteristics of the frame are such that,BNthe batch standardization is indicated and the batch is standardized,FCindicating that the full-link layer is to be formed,L _tri representing the loss of a triplet,L _ce representing the cross-entropy loss,f _local the local features are represented by a graph of the local features,f _global representing global features.

The local discriminant features obtained in step S2 of the present embodiment are feature graphs on each frame, where the feature graphs are obtained by searching information of each frame by means of a sliding window. Step S2 of this embodiment further includes sliding the obtained feature graphs by using windows of different sizes, outputting feature graphs of different sizes, calculating an activation mean value of each window by global averaging pooling, sorting the activation mean values, selecting a set number of windows from large to small according to the sorting result, clipping a local area selected by the windows on the original graph, and then sending the local area to a convolutional neural network again for learning, so as to obtain a new feature graph as a local discriminant feature.

The different animal individuals of the same special type are quite similar, and when the embodiment is applied to the identification of panda individuals, some local features may not be visible in one frame but are visible in another frame due to shielding. This example better distinguishes between different animal individuals by learning fine-grained features. In pedestrian re-recognition, the local features are often obtained by horizontally slicing the feature map, but animals are greatly different from pedestrians. The animal can generate larger deformation and self-shielding under the condition of larger movement amplitude. The inventors believe that the use of horizontal dicing does not match well to the distinguishing characteristics of the animal. Therefore, the present embodiment proposes a local discriminant feature extraction module (DPSM, discriminative Patch Search Module) that searches for key information on each frame through network adaptive learning.

Specifically, the embodiment obtains the feature map on each input frame through the convolutional neural network, searches the most information-rich area in each frame by using a sliding window mode in target detection in the local discriminant feature extraction module, uses the most information-rich area as a local key feature, and uses the full convolutional network to realize the process. In this embodiment, the obtained feature images are slid by using windows with different sizes, and feature images with different sizes are output. This is due to the fact that animals are more mobile, such as standing, eating, sleeping, etc., which can exhibit rich shape characteristics, it is difficult to match the characteristics of animals in all shapes if a single size sliding window is used. Then, the embodiment calculates the activation mean value of each window through global average pooling, and the larger the activation mean value is, the more abundant the information contained under the window is. Thus, the present embodiment sorts the activation averages. In addition, by visual inspection, the inventors found that the distinguishing characteristics of the animals are distributed mostly in the head, tail and neck, and thus the window that the present embodiment is expected to select when applied is not concentrated in only one area. In this embodiment, non-maximum suppression (NMS) is used to select N windows according to the sorting result, and these local areas are cut on the original image input, and then sent to the convolutional neural network again for learning, so as to obtain a new feature map. In this way, the present embodiment can pay attention to the detail difference between different individuals of the same specific type as much as possible.

The local discriminant feature extraction module of this embodiment is implemented with a local pooling layer that slides on the feature map using sliding windows of different sizes (i.e., pooling cores of different sizes) to calculate the activation mean for each window. The means are then sorted and a fixed number of windows are selected using non-maximum suppression (NMS), each window corresponding to a local discriminant feature.

After searching the discriminant feature of each frame, the present embodiment does not reflect the frame-to-frame association. During panda motion, a portion that is occluded on one frame may appear in other frames. The present embodiment further achieves this by establishing a time-dependent relationship between different frames to obtain a more comprehensive characterization, particularly using hypergraph neural networks. Each edge of the hypergraph neural network can be connected with a plurality of nodes compared with the standard graph neural network, which is helpful for modeling complex data relations and obtaining more powerful feature expression.

Step S3 of the present embodiment includes the steps of: step S31, constructing hypergraph g= (V, E); wherein G represents a hypergraph, V represents a node set, E represents a hyperedge set, the nodes of the hypergraph are local discriminant features obtained in the step S2, each hyperedge is connected with a plurality of nodes, and the nodes connected on each hyperedge are connected according to a set fixed time threshold valueT _h To select; step S32. And updating the node characteristics by adopting a hypergraph neural network to obtain hypergraph characteristics. The nodes are represented as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,Va set of nodes is represented and,the node is represented by a set of nodes,Tfor the number of video level clip frames,Nfor the number of windows selected.

In step S31 of this embodiment, the nodes connected to each superside are set according to the set fixed time thresholdT _h To select the method comprises the following steps: for any node, a fixed time threshold is found out from the time interval between the node and the nodeT _h Then K nearest connection points of the node are calculated by adopting a K nearest neighbor algorithm, and then K+1 characteristic nodes are connected by adopting a superside; and executing the operations to establish the supersides for all the nodes to obtain a plurality of supersides. Wherein each node does not have more than one superside, as it may be included in K-nearest neighbor nodes of other nodes.

The hyperedge is expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,Erepresenting a set of hyperedges that are to be processed,representing the union of the sets->Representation->Is a set of K neighboring nodes. Thus, the present embodiment completes the construction of the hypergraph.

Step S32 of the present embodiment includes constructing a superside feature according to a feature of a node and completing feature update of the node by aggregating the superside feature, where constructing the superside feature according to the feature of the node includes: and for each node, acquiring all the superedges comprising the node, and for each superedge, carrying out average operation on the characteristics of other nodes except the node, thereby constructing the characteristics of the superedge.

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing the division of +/in each superside>Features of nodes other than the node.

The feature updating of the converging superside feature completion node in this embodiment includes: and aggregating the superside characteristics to obtain all the superside information related to the nodes, and then performing splicing operation on the information of the nodes and the superside information, so that the characteristics of each node comprise other characteristics adjacent in time, the characteristics of the nodes are enhanced or supplemented, and the learning of more complete animal characteristic expression is facilitated. Then updating the node characteristics by using a full connection layer; and executing the same operation on the nodes of all the graphs, and executing the average pooling operation on the characteristics of all the nodes to obtain the final local characteristics.

According to the embodiment, the local features and the global features are connected to obtain the mixed features of the animals to realize individual identification of the animals, so that the animal features with more representativeness and discrimination are obtained.

When the embodiment is applied, the cross entropy loss is adoptedL _ce And triplet lossL _tri To co-monitor training. Triple lossL _tri Can make the degree of distinction between different classes higher and cross entropy lossL _ce The model can be aided in learning the inter-class differences from global, both in combination to help learn more discriminative animal feature representations.

The embodiment provides a video-based automatic animal individual identification network framework combining local features and global features. In the embodiment, the animal characteristic representation with finer granularity is obtained through the local discriminant characteristic extraction module so as to improve the discriminant capability of the model. The hypergraph neural network is further used for modeling the time dependency relationship between local features so as to obtain more comprehensive video-level feature representation and alleviate the problem of model performance degradation caused by shielding.

To evaluate the effectiveness of the method of this example, the inventors constructed an animal video database, with video data from a adult panda breeding study base, containing 171 videos of 60 small pandas. The performance on the present dataset shows that the present embodiment can adaptively learn the local discriminant features of the pandas in each frame when applied, although the pandas exhibit large deformation or posture changes in different frames. In the embodiment, whether video clips in the query set and the gallery set are from the same source video is evaluated respectively, and the evaluation proves that the accuracy of the method in the aspect of animal individual identification is high, and the effectiveness of the method is proved. The method is also compared with a picture baseline method when the method is applied to the identification of panda individuals, the probability of incorrect matching of the picture baseline method is high, and the method can be correctly matched when the method is applied, so that the effectiveness of the method in the identification of animal individuals is proved.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A video-based method for identifying an individual animal comprising the steps of:

2. The method according to claim 1, wherein the local discriminant features obtained in step S2 are feature maps on each frame, and the feature maps are obtained by searching for information of each frame through a sliding window.

3. The method for identifying animal individuals based on video according to claim 2, wherein step S2 further comprises sliding windows of different sizes after obtaining the feature map, outputting the feature map of different sizes, calculating an activation mean value of each window through global averaging pooling, sorting the activation mean values, selecting a set number of windows according to the sorting result, clipping a local area selected by the windows on the input original map, and then sending the local area to a convolutional neural network again for learning, so as to obtain a new feature map as a local discriminant feature.

4. The video-based animal individual identification method according to claim 1, wherein the step S3 comprises the steps of:

5. The video-based animal individual identification method according to claim 4, wherein the node connected to each superside in the step S31 is determined according to a set fixed time thresholdT _h To select the method comprises the following steps: for any node, a fixed time threshold is found out from the time interval between the node and the nodeT _h Then K nearest connection points of the node are calculated by adopting a K nearest neighbor algorithm, and then K+1 characteristic nodes are connected by adopting a superside; and executing the operations to establish the supersides for all the nodes to obtain a plurality of supersides.

6. The video-based animal individual identification method according to claim 4, wherein the step S32 includes constructing superside features from the features of the nodes and completing feature update of the nodes by converging the superside features;