WO2024103417A1

WO2024103417A1 - Behavior recognition method, storage medium and electronic device

Info

Publication number: WO2024103417A1
Application number: PCT/CN2022/133025
Authority: WO
Inventors: 朱富强; 詹阳; 张慧康; 张曦昊; 王振
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2024-05-23

Abstract

Disclosed in the present application are a behavior recognition method, a storage medium and an electronic device. The method comprises: acquiring a reference video, wherein the reference video comprises specified behavior content; acquiring a video to be subjected to recognition that requires behavior recognition; acquiring the degree of similarity between the video to be subjected to recognition and the reference video by means of a video discrimination model based on a siamese neural network; and according to the degree of similarity, recognizing whether the video to be subjected to recognition comprises specified behavior content so as to obtain a recognition result.

Description

Behavior recognition method, storage medium and electronic device

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular to a behavior recognition method, a storage medium, and an electronic device.

Background technique

Animal behavior reflects information such as the animal's higher central nervous system function, learning and memory ability, psychological state, and motor coordination from a macroscopic perspective. Studying animal behavior can assess the animal's adaptation to the environment or pharmacology, and has a wide range of applications in toxicology, pharmacology, sports injuries, and recovery.

With the rapid development of artificial intelligence technology, supervised learning methods based on artificial intelligence technology can classify animal behaviors. However, this method can only classify animal behaviors into one of the fixed categories.

technical problem

It is difficult to accurately classify animal behaviors of unknown categories.

Technical Solutions

The embodiments of the present application provide a behavior recognition method, a storage medium, and an electronic device, which can improve the accuracy of identifying animal behavior.

In a first aspect, an embodiment of the present application provides a behavior recognition method, the method comprising:

Obtain a reference video, where the reference video includes specified behavior content;

Obtain the video to be identified for behavior recognition;

The similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;

According to the similarity, it is determined whether the video to be identified includes the specified behavior content to obtain the identification result.

In a second aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored. When the computer program runs on a computer, the computer executes a behavior recognition method as provided in any embodiment of the present application.

In a third aspect, an embodiment of the present application further provides an electronic device, including a processor and a memory, wherein the memory has a computer program, and the processor executes a behavior recognition method provided in any embodiment of the present application by calling the computer program.

Beneficial Effects

For a reference video that includes the specified behavior content, the video discrimination model based on the twin neural network is used to perform similarity recognition between the reference video and the video to be identified to determine whether the video to be identified includes the specified behavior content. In this way, the behavior of the video to be identified can be accurately identified through the reference video, and the video to be identified can be quickly classified.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following briefly introduces the drawings required for use in the description of the embodiments. Obviously, the drawings described below are only some embodiments of the present application, and those skilled in the art can obtain other drawings based on these drawings without creative work.

FIG1 is a schematic diagram of an application scenario of a behavior recognition method provided in an embodiment of the present application.

FIG2 is a flow chart of a behavior recognition method provided in an embodiment of the present application.

FIG3 is a schematic diagram of using a sliding window to capture video clips in the behavior recognition method provided in an embodiment of the present application.

FIG4 is a schematic diagram of the structure of a video discrimination model in the behavior recognition method provided in an embodiment of the present application.

FIG. 5 is a detailed flowchart of a video identification method provided in an embodiment of the present application.

FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present application.

Reference to "embodiments" herein means that a particular feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various locations in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

Artificial Intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.

Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies. The basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes machine learning (ML) technology, among which deep learning (DL) is a new research direction in machine learning. It is introduced into machine learning to make it closer to the original goal, namely artificial intelligence. At present, deep learning is mainly used in computer vision, natural language processing and other fields.

Deep learning is a type of machine learning, and machine learning is the only way to achieve artificial intelligence. The concept of deep learning originated from the study of artificial neural networks. Multilayer perceptrons with multiple hidden layers are a type of deep learning structure. Deep learning combines low-level features to form more abstract high-level representations of attribute categories or features to discover distributed feature representations of data. The motivation for studying deep learning is to establish a neural network that simulates the human brain for analysis and learning. It imitates the mechanism of the human brain to interpret data, such as images, sounds, and text.

Animal behavior can be divided into foraging behavior, food storage behavior, attack behavior, defense behavior, reproductive behavior, rhythmic behavior, communication behavior, etc. according to its different manifestations.

In the fields of toxicology, pharmacology, sports injuries and their recovery, neuroscience, etc., evaluating animal behavior can provide key information for related research. With the development of artificial intelligence technology, machine learning methods are also used in related technologies to identify animal behavior. Specifically, the neural network model can be trained by machine learning methods, and then the animal behavior can be classified by the trained neural network model. Among them, the method of training the neural network model includes supervised machine learning and unsupervised machine learning. Supervised machine learning trains the neural network model through sample data pre-labeled with behavioral labels, so that the neural network model learns the mapping relationship between sample data and its corresponding behavioral labels. Unsupervised machine learning clusters similar sample data through clustering algorithms, thereby realizing the classification of different sample data.

However, unsupervised machine learning methods find it difficult to quickly find the animal behaviors that users need from large amounts of data.

To this end, for supervised machine learning methods, related technologies also include key point-based animal recognition methods and video-based classification methods.

Among them, the key point-based animal recognition method tracks the key points of the animal's body and then classifies the animal's behavior based on the position information of the key points (such as limb joints, etc.). However, this method relies on the tracking of key points, and when the key points are blocked or the tracking is lost, the classification results of the animal's behavior will be inaccurate. In addition, the key point tracking also loses the background information, resulting in the omission of animal behavior information related to the background information, which will also cause the classification results of the animal's behavior to be inaccurate.

Although the video-based classification method does not require key point tracking, it needs to classify the actions of the video frames based on the pixel values of the video frame by frame, and then identify the animal behavior based on the action classification results. This method results in a large amount of calculation and it is difficult to quickly identify the animal behavior.

In order to solve the problems existing in the related art, the embodiments of the present application provide a behavior recognition method, a storage medium, and an electronic device to quickly and accurately recognize animal behavior. It can be understood that the behavior recognition method provided by the present application can recognize the behavior of various animals and various people. In the following embodiments, the method provided by the embodiments of the present application is described in detail by taking animal behavior as an example.

First, please refer to Figure 1, which is a schematic diagram of the application scenario of the behavior recognition method provided by the embodiment of the present application. The execution subject of the behavior recognition method is an electronic device. First, the user selects a video with specified content as a reference video, and then for the video to be recognized, the two are input into the video discrimination model for similarity discrimination to determine the behavior content included in the video to be recognized. In this way, fast and accurate behavior recognition is achieved.

Specifically, please refer to Figure 2, which is a schematic diagram of the flow of the behavior recognition method provided in the embodiment of the present application. The specific flow of the behavior recognition method provided in the embodiment of the present application can be as follows:

101. Obtain a reference video, where the reference video includes specified behavior content.

The specified behavior content includes but is not limited to various animal behaviors or various human behaviors. Taking animals as an example, animals can be cats, dogs, monkeys, mice, etc. Common behaviors of animals include but are not limited to: standing up, moving the head, drinking water, hanging, combing hair, walking, resting, eating, licking limbs, etc.

Exemplarily, there are multiple ways to obtain a reference video, including but not limited to: obtaining a video including specified behavior content by shooting, obtaining a video including specified behavior content by video editing, etc.

Here is an example, for example, a user records a video of a dog eating, a user records a video of a cat combing its hair, a user records a video of a monkey hanging upside down, etc. as a reference video. For another example, a dog eating video is selected from videos of a dog eating, a dog walking, a dog licking its limbs, etc. as a reference video. For another example, images related to dog eating are selected from multiple groups of dog images and synthesized to obtain a video of a dog eating as a reference video. For another example, a video of a dog eating part is captured from a long recorded video as a reference video.

102. Obtain a video to be identified that requires behavior identification.

The method of obtaining the video to be identified may include: obtaining the video to be identified by shooting, obtaining the video to be identified by video editing, etc. The video to be identified that needs to be identified by behavior recognition has not been marked.

In this embodiment, the animals or people to which the video to be identified and the reference video belong are the same, thereby improving the accuracy of behavior recognition of the video to be identified.

Exemplarily, before obtaining the video to be identified that requires behavior recognition, the organism contained in the video to be identified can also be identified first. When the organism contained in the video to be identified is consistent with the organism belonging to the reference video, the video to be identified is determined as the video to be identified that requires behavior recognition.

103. The similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network.

In this embodiment, a video discrimination model based on a twin neural network is proposed. The video discrimination model includes a dual-branch network architecture with the same parameters. The video to be identified and the reference video are respectively used as inputs with a dual-branch network architecture, and then the probability values of similarity and dissimilarity between the two are output. Then, the similarity between the video to be identified and the reference video is evaluated by the similar or dissimilar probability values.

For example, when the probability value of the similarity between the video to be identified and the reference video is large, it means that the similarity between the two is high. When the probability value of the dissimilarity between the video to be identified and the reference video is large, it means that the similarity between the two is low.

104. Identify whether the video to be identified includes specified behavior content according to the similarity, and obtain an identification result.

If the similarity between the video to be identified and the reference video is high, it can be determined that the video to be identified and the reference video are the same or similar videos, and both can indicate the specified behavior content. If the similarity between the video to be identified and the reference video is low, it can be determined that the video to be identified and the reference video are not the same or similar videos, and the behavior content indicated by the two is different.

In specific implementation, the present application is not limited by the execution order of the various steps described. If no conflict occurs, some steps can be performed in other orders or simultaneously.

The behavior recognition method in the embodiment of the present application first selects a reference video required by a user, the reference video includes the specified behavior content, and then uses the video discrimination model to determine the similarity between the reference video and the video to be identified, so as to perform behavior recognition on the video to be identified. On the one hand, it can flexibly identify the behavior of the video with high accuracy, and on the other hand, it avoids tracking key points or classifying pixel values of video frames, reduces the amount of data processed for the video, and improves the efficiency of behavior recognition for the video.

In some embodiments, obtaining a video to be identified that needs to be identified includes:

Obtain the initial video for behavior recognition;

Extracting at least one video segment from the initial video according to the length of the reference video;

At least one captured video segment is determined as a video to be identified.

The length of the initial video is not limited in this application, and an initial video of any length can be selected according to actual needs. The organism to which the initial video belongs is the same as the organism to which the reference video belongs.

After the initial video is intercepted according to the length of the reference video, the length of the obtained video segment is the same as the length of the initial video. The lengths of the reference video and the initial video can be determined according to the playback duration or the number of video frames.

The number of the captured video segments can be one, two or more, depending on the length of the original video and the capture method.

As an implementation mode, when the length of the initial video is less than the length of the reference video, one video segment is captured, and the initial video is interpolated to obtain a video segment having the same length as the reference video as the video to be identified.

As another implementation, when the length of the initial video is equal to the length of the reference video, the initial video may be directly used as a video segment.

As another implementation, when the length of the initial video is greater than the length of the reference video, at least two video segments can be captured, wherein the number of video segments is determined according to the ratio of the length of the initial video to the length of the reference video, and when the ratio is greater than 1 and less than 2, two video segments can be obtained by adding frames. When the ratio is greater than 2, multiple video segments can be obtained.

Exemplarily, when there are at least two video segments, adjacent video segments in at least two video segments may partially overlap or may not overlap. When adjacent video segments do not overlap, the adjacent video segments are connected end to end. When adjacent video segments overlap, the adjacent video segments have overlapping video frames, and the number of the overlapping video frames may be determined according to actual needs.

Specifically, a sliding window can be used to capture at least two video clips as videos to be identified. Please refer to Figure 3, which is a schematic diagram of using a sliding window to capture video clips in the behavior recognition method provided in the embodiment of the present application. If the number of frames of the reference video is 10 and the number of frames of the initial video is 35, the length of the sliding window is set to 10 frames, and the sliding window is moved every N frames to capture video clips. As shown in Figure 3 (a), if N is equal to 10, the captured video clips do not overlap with each other. As shown in Figure 3 (b), if N is less than 10, the captured video clips overlap with each other. Taking N=1 as an example, in the embodiment of the present application, the sliding window can be set to move every frame to capture video clips, thereby obtaining multiple video clips, wherein the overlap between video clips can be selected according to actual needs and is not limited here.

In some embodiments, if adjacent video clips partially overlap, identifying whether the video to be identified includes the specified behavior content according to the similarity, and obtaining the identification result, further comprising:

Determine the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;

Determine the total score of the initial video according to the score of each video frame in the initial video;

If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;

If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content.

In this embodiment, after determining the recognition result of the video to be recognized based on the similarity between the reference video and the video to be recognized, each video frame in the video to be recognized is scored based on the recognition result. For example, the similarity is divided into two cases: the reference video is similar to the video to be recognized, or the reference video is not similar to the video to be recognized. Among them, when the two are similar, the recognition result can be set to 1, and when the two are not similar, the recognition result can be set to 0. For different videos to be recognized, the recognition result is 1 or 0. By assigning the recognition result to each video frame contained therein as a score, each video frame has a score in different videos to be recognized, and the score is 0 or 1. After obtaining the score of each video frame, if there are video frames with multiple scores, the average of their scores is taken as the final score. For example, the score of the second video frame in the first video to be recognized is 0, and the score in the second video to be recognized is 1, then 0.5 is used as the final score of the second video frame.

After determining the final score of each video frame, the total score of the initial video is also determined based on the scores of each video frame in the initial video. Specifically, the total score of the initial video can be determined by taking the average, median, or median of the scores of each video frame. For example, taking the average of the scores of each video frame as an example, the initial video has a total of 10 frames, and the scores of the first frame to the tenth frame are 0, 0.5, 1, 1, 0.5, 0, 1, 1, 1, 1, and the total score of the initial video is 0.7.

In this embodiment, a preset threshold is also set to compare with the total score, so that when the total score is greater than the preset threshold, it is determined that the initial video includes the specified behavioral content, that is, the behavior indicated by the initial video is the same as that indicated by the reference video. When the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content, that is, the behavior indicated by the initial video is different from that indicated by the reference video. For example, the preset threshold is set to 0.5, and if the total score of the initial video is 0.7, it is determined that the initial video includes the specified behavioral content.

This embodiment obtains the video to be identified by sliding a screenshot from the initial video, determines the total score of the initial video based on the scores of the video frames in different videos to be identified, and compares the total score with a preset threshold to evaluate the recognition result of the initial video, thereby making the recognition result of the initial video more accurate.

In some embodiments, adjacent video clips do not overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the following is further included:

Determine the total score of the initial video according to the recognition results of each video to be recognized;

Among them, when adjacent video clips do not overlap, after determining the recognition result of the video to be recognized, the similarity is divided into two cases, namely similarity and dissimilarity. When the reference video is dissimilar to the video to be recognized, the recognition result of the video to be recognized is set to 0. When the reference video is similar to the video to be recognized, the video result of the video to be recognized is set to 1.

After determining the scores of different videos to be identified, the total score of the initial video is also determined based on the scores, wherein the method of determining the total score includes but is not limited to: finding the average value, median, median, etc. of each video to be identified. For example, the total score of the initial video is obtained by finding the average value of the scores of each video to be identified.

As above, a preset threshold is set to compare with the total score to determine the recognition result of the initial video. Please refer to the above content for details, which will not be repeated here.

In some embodiments, after determining that the initial video includes the specified behavior content, the method further includes:

The initial video is labeled according to the specified behavioral content. After the recognition result of the initial video is determined, the initial video is also labeled. Specifically, if the initial video includes the specified behavioral content, the initial video is assigned a corresponding behavioral label. If the initial video does not include the specified behavioral content, the initial video is not assigned a corresponding behavioral label. For example, the reference video includes eating behavioral content. If the initial video includes the specified behavioral content, the initial video is assigned an eating behavioral label. Otherwise, the initial video is not labeled, or is labeled with a non-eating behavioral label.

In this embodiment, after the initial video is marked, it is possible to easily select a video having the same behavior as the reference video from a large number of initial videos, thereby improving the screening efficiency and query efficiency of the video.

In some embodiments, the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module, and a similarity discrimination module. The first feature extraction branch and the second feature extraction branch have the same structure and share network parameters. The similarity between the video to be identified and the reference video is obtained through the video discrimination model based on the twin neural network, including:

Inputting the reference video into the first feature extraction branch to extract features, and obtaining first frame sequence features;

Inputting the video to be identified into the second feature extraction branch to extract features, and obtaining second frame sequence features;

Inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain fusion features;

The fused features are input into the similarity judgment module for similarity judgment to obtain the similarity between the video to be identified and the reference video.

Please refer to Figure 4, which is a schematic diagram of the structure of the video discrimination model in the behavior recognition method provided in the embodiment of the present application. The video discrimination model includes a first feature extraction branch and a second feature extraction branch with the same network parameters and structure, and the first feature extraction branch and the second feature extraction branch are also connected to the feature fusion module, and the feature fusion module is connected to the similarity discrimination module.

When obtaining the similarity between the reference video and the video to be identified through the video discrimination model based on the twin neural network. The reference video is input into the first feature extraction branch for feature extraction to obtain the first frame sequence feature, and the video to be identified is input into the second feature extraction branch for feature extraction to obtain the second frame sequence feature. It can be understood that the reference video can also be input into the second feature extraction branch for feature extraction, and the video to be identified can be input into the first feature extraction branch for feature extraction. It is not limited here which feature extraction branch the two are input into.

Exemplarily, before the reference video and the video to be identified are input into the first feature extraction branch and the second feature extraction branch, the reference video and the video to be identified are also converted into a frame sequence as input. When the first frame sequence feature and the second frame sequence feature are input into the feature fusion model for feature fusion processing, the feature fusion model obtains the fusion feature by performing vector subtraction on the first frame sequence feature and the second frame sequence feature.

The similarity determination module is used to determine the probability value of similarity or dissimilarity of fusion features.

This embodiment extracts the temporal features of the entire segment of the reference video and the video to be identified through the first feature extraction branch and the second feature extraction branch, and then performs similarity determination based on the temporal features, taking into account the temporal and spatial dependencies between behavioral features, so that the similarity determination result is more accurate. Compared with the related art that requires processing video frames frame by frame, the method provided by this embodiment can quickly determine the similarity between the video to be identified and the reference video, reducing the amount of model calculation and improving the model's determination rate.

In some embodiments, the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism, and the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism.

Specifically, the first frame sequence features of the reference video are extracted through a first feature extraction layer based on the self-attention mechanism, and the second frame sequence features of the video to be identified are extracted through a second feature extraction layer based on the self-attention mechanism.

By capturing the correlation between video frames through the feature extraction layer based on the self-attention mechanism, the mutual influence between the video frames can be taken into account, and the accuracy of the judgment can be improved when the similarity is judged based on the first frame sequence features and the second frame sequence features.

In some embodiments, before obtaining the reference video, the method further includes:

Obtaining a pre-trained initial model, where the initial model is pre-trained according to first video samples of different behavior contents;

Obtain a second video sample corresponding to the specified behavior content;

The parameters of the initial model are adjusted according to the second video sample to obtain a video discrimination model.

The first video samples include, but are not limited to, ImageNet dataset (a large-scale image recognition database in computer vision research), Kinetics-700 dataset (a human behavior dataset), etc. The model parameters of the initial model are obtained by pre-training the initial model based on the twin neural network through the first video samples.

The second video samples include video samples of designated behavior content. For example, when the designated behavior content is eating, the second video samples may include video samples of dogs eating, video samples of cats eating, video samples of chickens eating, etc. Of course, the second video samples also include video samples that do not include designated behavior content. For example, when the designated behavior content is eating, the second video samples may also include video samples of pandas hanging, video samples of dogs rolling, video samples of cats licking limbs, etc.

When adjusting the parameters of the initial model through the second video sample, video samples with the same or different behavioral content in the second video sample are also combined in pairs to input the first feature extraction branch and the second feature extraction branch of the initial model for model training, respectively, until the model is fitted or the specified number of training times is reached to obtain a video discrimination model.

In this embodiment, by combining two types of video samples in the second video sample in pairs and performing model training through a video discrimination model based on a twin neural network, the scale of training samples can be expanded, and the initial model can be trained through a small amount of labeled second video samples, thereby improving the rate of model training.

In some embodiments, adjusting parameters of the initial model according to the second video sample to obtain a video discrimination model includes:

The parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain a video discrimination model.

When fine-tuning the initial model through the second video sample, the video discrimination model is obtained by adjusting the parameters of the feature fusion module and the similarity discrimination module.

Exemplarily, the network parameters of the first feature extraction branch and the second feature extraction branch can be frozen, and the second video sample can be input into the initial model for model training, so as to optimize the network parameters of the feature fusion module and the similarity discrimination module during the model training process, so as to minimize the loss of the feature fusion module and the similarity discrimination module, and achieve fine-tuning of the network parameters of the feature fusion module and the similarity discrimination module.

In the embodiment of the present application, model training is implemented based on a small amount of labeled second video samples through pre-training and model fine-tuning, which reduces the scale requirement for training data and improves the efficiency of model training.

In some embodiments, the specified behavior includes a specified behavior of a specified animal, and obtaining a second video sample corresponding to the specified behavior content includes:

Get a sample of videos of a specified animal marked with a specified behavior;

Perform data augmentation processing on the specified animal video sample to obtain a second video sample.

A video sample of a specified animal with a specified behavior can be obtained as the second video sample, for example, various videos of various dogs eating can be obtained as the second video sample. Various dogs can be divided according to their breeds, and different dogs have different eating actions. By making such videos into the second video sample to fine-tune the parameters of the initial model, the trained video discrimination model can be applied to dog video discrimination with higher accuracy.

It can be understood that after determining which animal the second video sample used belongs to, a video about the animal's behavior can also be selected as a reference video to accurately determine the similarity between the reference video and the video to be identified, so that when the animal to be identified belongs to is not the animal to which the reference video belongs, it can be directly determined that the video to be identified is not similar to the reference video.

In this embodiment, data augmentation is also performed on the designated animal video samples marked with designated behaviors to expand the sample size and improve the generalization ability of the model. The data augmentation methods include but are not limited to: video cropping, video frame extraction, image translation, image rotation, image scaling, etc.

In this embodiment, data augmentation processing is performed on the designated animal video sample to obtain a second video sample with a larger data volume to fine-tune the initial model, so that the generalization ability of the model is better.

The above-mentioned contents are further described here in detail, please refer to FIG5, which is a detailed flow chart of the video identification method provided in the embodiment of the present application. The contents indicated in the diagram are as follows:

Phase 1: Building a video discrimination model based on a twin neural network;

The video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module and a similarity discrimination module.

The second stage: training video discrimination model;

201. Obtain a pre-trained initial model.

202. Obtain a second video sample corresponding to the specified behavior content.

203. Freeze the network parameters of the first feature extraction branch and the second feature extraction branch, and train the initial model through the second video sample to fine-tune the network parameters of the feature fusion module and the similarity discrimination module.

The third stage: Apply the video discriminant model to predict similarity;

204. Obtain a reference video, where the reference video includes the specified behavior content;

205. Acquire an initial video, and extract at least one video segment from the initial video according to the length of the reference video as a video to be identified;

206. Input the reference video into the first feature extraction branch to obtain a first frame sequence feature;

207. Input the video to be identified into the second feature extraction branch to obtain a second frame sequence feature;

208. Input the first frame sequence feature and the second frame sequence feature into a feature fusion module for feature fusion to obtain a fusion feature;

209. Input the fusion feature into a similarity determination module for similarity determination, and obtain the similarity between the reference video and the video to be identified;

210. Score the video to be identified according to the similarity;

211. Determine a score of each video frame in the initial video according to scores of the same video frame corresponding to different videos to be identified;

212. Determine a total score of the initial video according to the score of each video frame in the initial video;

213. If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;

214. Assign a label of the specified behavior to the initial video.

In the embodiment of the present application, a video discrimination model based on a twin neural network is constructed to perform behavior recognition, which is universal for various behaviors and has high recognition accuracy. The following table provides the recognition results of eight behaviors on an animal behavior dataset:

行为 Behavior	精度 Accuracy	召回率 Recall	f1-分数 f1-score
起身 Get up	0.69 0.69	0.88 0.88	0.78 0.78
头动 Head movement	0.48 0.48	0.78 0.78	0.60 0.60
饮水 Drinking water	0.10 0.10	1.00 1.00	0.18 0.18
悬挂 Suspension	0.96 0.96	0.94 0.94	0.95 0.95
理毛 Grooming	0.95 0.95	0.77 0.77	0.85 0.85
行走 Walking	0.84 0.84	0.85 0.85	0.85 0.85
休息 rest	0.98 0.98	0.96 0.96	0.97 0.97
进食 Eating	0.91 0.91	0.85 0.85	0.88 0.88

As can be seen from the above, the behavior recognition method proposed in the embodiment of the present invention, by constructing a video discrimination model based on a twin neural network, and by pre-training the video discrimination model to reduce the sample data required for the training model, and by combining the sample data in pairs as the input of the dual-branch network, can greatly expand the scale of the sample data to improve the generalization ability of the model. And by determining the similarity between the reference video and the video to be identified through the video discrimination model, so as to perform similarity judgment on the two in the time dimension, it is possible to reduce the amount of calculation consumed by the similarity judgment, thereby quickly obtaining the judgment result of the video to be identified. In addition, it is also possible to realize behavior recognition of initial videos of varying lengths, thereby improving the flexibility of behavior recognition.

The present application also provides a behavior recognition device, which is applied to an electronic device and includes:

A first video acquisition module is used to acquire a reference video, where the reference video includes specified behavior content;

The second video acquisition module is used to acquire a video to be identified that needs to be identified;

A similarity discrimination module is used to obtain the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;

The behavior recognition module is used to identify whether the video to be recognized includes the specified behavior content according to the similarity and obtain the recognition result.

In some embodiments, the second video acquisition module is further used to:

Obtain the initial video for behavior recognition;

At least one captured video segment is determined as a video to be identified.

In some embodiments, when the video to be identified includes at least two video segments, adjacent video segments of the at least two video segments partially overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the behavior identification module is further used to:

In some embodiments, after determining that the initial video includes the specified behavior content, the behavior recognition module is further configured to:

The initial video is labeled according to the specified behavioral content.

In some embodiments, when the video to be identified includes at least two video segments, adjacent video segments of the at least two video segments do not overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the behavior identification module is further used to:

In some embodiments, before acquiring the initial video for behavior recognition, the second video acquisition module is further used to:

Get the original video;

Determine whether the organism to which the original video belongs is the same as the organism to which the reference video belongs;

If so, the original video is determined as the initial video for behavior recognition.

In some embodiments, the length of the initial video is shorter than the length of the reference video, and the second video acquisition module is further configured to:

The initial video is supplemented with frames according to the length of the reference video to obtain a video clip.

In some embodiments, the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module, and a similarity discrimination module. The first feature extraction branch and the second feature extraction branch have the same structure and share network parameters. The similarity discrimination module is further used to:

In some embodiments, before obtaining the reference video, the similarity determination module is further used to:

Obtain a second video sample corresponding to the specified behavior content;

In some embodiments, the similarity determination module is further used to:

In some embodiments, before adjusting parameters of the feature fusion module and the similarity discrimination module according to the second video sample to obtain the video discrimination model, the similarity discrimination module is further used to:

Freeze parameters of the first feature extraction branch and the second feature extraction branch.

In some embodiments, the similarity determination module is further used to:

Get a sample of videos of a specified animal marked with a specified behavior;

It should be noted that the behavior recognition device provided in the embodiment of the present application and the behavior recognition method in the above embodiment are of the same concept, and any method provided in the behavior recognition method embodiment can be implemented through the behavior recognition device, and the same technical effect can be achieved. The specific implementation process is detailed in the behavior recognition method embodiment, and will not be repeated here.

The embodiment of the present application also provides an electronic device, which may be a smart phone, a folding screen mobile phone, a tablet computer, a PDA, a desktop computer and the like. As shown in Figure 6, Figure 6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application. The electronic device 300 includes a processor 301 having one or more processing cores, a memory 302 having one or more computer-readable storage media, and a computer program stored in the memory 302 and executable on the processor. Among them, the processor 301 is electrically connected to the memory 302. Those skilled in the art will appreciate that the electronic device structure shown in the figure does not constitute a limitation on the electronic device, and may include more or fewer components than shown, or a combination of certain components, or different component arrangements.

The processor 301 is the control center of the electronic device 300. It uses various interfaces and lines to connect various parts of the entire electronic device 300, executes various functions of the electronic device 300 and processes data by running or loading software programs and/or modules stored in the memory 302, and calling data stored in the memory 302, thereby monitoring the electronic device 300 as a whole.

In the embodiment of the present application, the processor 301 in the electronic device 300 will load instructions corresponding to the processes of one or more application programs into the memory 302 according to the following steps, and the processor 301 will run the application programs stored in the memory 302 to implement various functions:

Obtain the video to be identified for behavior recognition;

The specific implementation of the above operations can be found in the previous embodiments, which will not be described in detail here.

In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

As can be seen from the above, the electronic device provided in this embodiment, by constructing a video discrimination model based on a twin neural network, and by pre-training the video discrimination model to reduce the sample data required for the training model, and by combining the sample data in pairs as the input of the dual-branch network, can greatly expand the scale of the sample data to improve the generalization ability of the model. And by determining the similarity between the reference video and the video to be identified through the video discrimination model, so as to perform similarity discrimination on the two in the time dimension, the amount of calculation consumed by the similarity discrimination can be reduced, so as to quickly obtain the discrimination result of the video to be identified. In addition, it is also possible to realize behavior recognition of initial videos of varying lengths, thereby improving the flexibility of behavior recognition.

A person of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be completed by instructions, or by controlling related hardware through instructions. The instructions may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present application provides a computer-readable storage medium. A person skilled in the art can understand that all or part of the steps in the above-mentioned embodiment method can be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. When the program is executed, it includes the following steps:

Obtain the video to be identified for behavior recognition;

The above-mentioned storage medium may be ROM/RAM, a magnetic disk, an optical disk, etc. Since the computer program stored in the storage medium can execute the steps in any behavior recognition method provided in the embodiments of the present application, the beneficial effects that can be achieved by any behavior recognition method provided in the embodiments of the present application can be achieved, as detailed in the previous embodiments, which will not be repeated here.

The above is a detailed introduction to a behavior recognition method, storage medium and electronic device provided in the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for technical personnel in this field, according to the ideas of the present application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims

A behavior recognition method, comprising:

Acquire a reference video, wherein the reference video includes specified behavior content;

Obtain the video to be identified for behavior recognition;

Obtaining the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;

It is determined whether the video to be identified includes the specified behavior content according to the similarity to obtain an identification result.
According to the behavior recognition method of claim 1, wherein the step of obtaining a video to be recognized that requires behavior recognition comprises:

Obtain the initial video for behavior recognition;

Extracting at least one video segment from the initial video according to the length of the reference video;

The at least one captured video segment is determined as the video to be identified.
According to the behavior recognition method of claim 2, when the video to be recognized includes at least two video segments, adjacent video segments of the at least two video segments partially overlap, and the identifying whether the video to be recognized includes the specified behavior content according to the similarity, after obtaining the recognition result, further includes:

Determining the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;

Determining a total score of the initial video according to the score of each video frame in the initial video;

If the total score is greater than a preset threshold, determining that the initial video includes the specified behavioral content;

If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavior content.
The behavior recognition method according to claim 3, wherein after determining that the initial video includes the specified behavior content, it also includes:

The initial video is marked according to the specified behavior content.
According to the behavior recognition method of claim 2, when the video to be recognized includes at least two video segments, adjacent video segments of the at least two video segments do not overlap, and the step of identifying whether the video to be recognized includes the specified behavior content according to the similarity, after obtaining the recognition result, further comprises:

Determine the total score of the initial video according to the recognition results of each video to be recognized;

If the total score is greater than a preset threshold, determining that the initial video includes the specified behavioral content;

If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavior content.
The behavior recognition method according to claim 2, wherein, before obtaining the initial video for behavior recognition, the method further comprises:

Get the original video;

Determining whether the organism to which the original video belongs is the same as the organism to which the reference video belongs;

If so, the original video is determined as the initial video for which behavior recognition is required.
According to the behavior recognition method of claim 2, wherein the length of the initial video is less than the length of the reference video, and the extracting at least one video segment from the initial video according to the length of the reference video comprises:

The initial video is subjected to frame interpolation processing according to the length of the reference video to obtain a video clip.
According to the behavior recognition method of claim 1, wherein the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module and a similarity discrimination module, the first feature extraction branch and the second feature extraction branch have the same structure and share network parameters, and the similarity between the video to be identified and the reference video is obtained by using the video discrimination model based on the twin neural network, including:

Inputting the reference video into the first feature extraction branch to perform feature extraction to obtain first frame sequence features;

Inputting the to-be-recognized video into the second feature extraction branch to perform feature extraction to obtain a second frame sequence feature;

Inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain a fusion feature;

The fusion feature is input into the similarity determination module for similarity determination to obtain the similarity between the video to be identified and the reference video.
According to the behavior recognition method according to claim 8, wherein the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism, and the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism.
According to the behavior recognition method of claim 8, wherein the step of inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain the fusion feature comprises:

The first frame sequence feature and the second frame sequence feature are input into the feature fusion module to perform vector subtraction to obtain the fusion feature.
The behavior recognition method according to claim 8, wherein before obtaining the reference video, the method further comprises:

Obtaining a pre-trained initial model, where the initial model is pre-trained according to first video samples with different behavioral contents;

Obtain a second video sample corresponding to the specified behavior content;

The parameters of the initial model are adjusted according to the second video sample to obtain the video discrimination model.
According to the behavior recognition method of claim 8, wherein the step of adjusting parameters of the initial model according to the second video sample to obtain the video discrimination model comprises:

The parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain the video discrimination model.
According to the behavior recognition method of claim 12, before adjusting the parameters of the feature fusion module and the similarity discrimination module according to the second video sample to obtain the video discrimination model, the method further comprises:

Parameters of the first feature extraction branch and the second feature extraction branch are frozen.
According to the behavior recognition method of claim 11, wherein the specified behavior includes a specified behavior of a specified animal, and the step of obtaining a second video sample corresponding to the specified behavior content includes:

Obtaining a specified animal video sample marked with the specified behavior;

Perform data augmentation processing on the designated animal video sample to obtain the second video sample.
A behavior recognition device, comprising:

A first video acquisition module, used to acquire a reference video, wherein the reference video includes specified behavior content;

The second video acquisition module is used to acquire a video to be identified that needs to be identified;

A similarity discrimination module, used to obtain the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;

The behavior recognition module is used to recognize whether the video to be recognized includes the specified behavior content according to the similarity, and obtain a recognition result.
The behavior recognition device according to claim 15, wherein the second video acquisition module is further used for:

Obtain the initial video for behavior recognition;

Extracting at least one video segment from the initial video according to the length of the reference video;

The at least one captured video segment is determined as the video to be identified.
According to the behavior recognition device of claim 16, when the video to be recognized includes at least two video segments, adjacent video segments of the at least two video segments partially overlap, and after the similarity is used to identify whether the video to be recognized includes the specified behavior content, and the recognition result is obtained, the behavior recognition module is further used to:

Determining the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;

Determining a total score of the initial video according to the score of each video frame in the initial video;

If the total score is greater than a preset threshold, determining that the initial video includes the specified behavioral content;

If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavior content.
According to the behavior recognition device of claim 16, wherein, before acquiring the initial video for behavior recognition, the second video acquisition module is further used to:

Get the original video;

Determining whether the organism to which the original video belongs is the same as the organism to which the reference video belongs;

If so, the original video is determined as the initial video for which behavior recognition is required.
A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is run on a computer, the computer is caused to execute the behavior recognition method as claimed in any one of claims 1 to 18.
An electronic device comprises a processor and a memory, wherein the memory stores a computer program, wherein the processor is used to execute the behavior recognition method as claimed in any one of claims 1 to 18 by calling the computer program.