WO2024103417A1 - Behavior recognition method, storage medium and electronic device - Google Patents

Behavior recognition method, storage medium and electronic device Download PDF

Info

Publication number
WO2024103417A1
WO2024103417A1 PCT/CN2022/133025 CN2022133025W WO2024103417A1 WO 2024103417 A1 WO2024103417 A1 WO 2024103417A1 CN 2022133025 W CN2022133025 W CN 2022133025W WO 2024103417 A1 WO2024103417 A1 WO 2024103417A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
behavior
initial
similarity
behavior recognition
Prior art date
Application number
PCT/CN2022/133025
Other languages
French (fr)
Chinese (zh)
Inventor
朱富强
詹阳
张慧康
张曦昊
王振
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2022/133025 priority Critical patent/WO2024103417A1/en
Publication of WO2024103417A1 publication Critical patent/WO2024103417A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a behavior recognition method, a storage medium, and an electronic device.
  • Animal behavior reflects information such as the animal's higher central nervous system function, learning and memory ability, psychological state, and motor coordination from a macroscopic perspective. Studying animal behavior can assess the animal's adaptation to the environment or pharmacology, and has a wide range of applications in toxicology, pharmacology, sports injuries, and recovery.
  • the embodiments of the present application provide a behavior recognition method, a storage medium, and an electronic device, which can improve the accuracy of identifying animal behavior.
  • an embodiment of the present application provides a behavior recognition method, the method comprising:
  • the similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored.
  • the computer program runs on a computer, the computer executes a behavior recognition method as provided in any embodiment of the present application.
  • an embodiment of the present application further provides an electronic device, including a processor and a memory, wherein the memory has a computer program, and the processor executes a behavior recognition method provided in any embodiment of the present application by calling the computer program.
  • the video discrimination model based on the twin neural network is used to perform similarity recognition between the reference video and the video to be identified to determine whether the video to be identified includes the specified behavior content. In this way, the behavior of the video to be identified can be accurately identified through the reference video, and the video to be identified can be quickly classified.
  • FIG1 is a schematic diagram of an application scenario of a behavior recognition method provided in an embodiment of the present application.
  • FIG2 is a flow chart of a behavior recognition method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of using a sliding window to capture video clips in the behavior recognition method provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of a video discrimination model in the behavior recognition method provided in an embodiment of the present application.
  • FIG. 5 is a detailed flowchart of a video identification method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • Artificial Intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies.
  • the basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes machine learning (ML) technology, among which deep learning (DL) is a new research direction in machine learning. It is introduced into machine learning to make it closer to the original goal, namely artificial intelligence.
  • DL deep learning
  • deep learning is mainly used in computer vision, natural language processing and other fields.
  • Deep learning is a type of machine learning, and machine learning is the only way to achieve artificial intelligence.
  • the concept of deep learning originated from the study of artificial neural networks.
  • Multilayer perceptrons with multiple hidden layers are a type of deep learning structure. Deep learning combines low-level features to form more abstract high-level representations of attribute categories or features to discover distributed feature representations of data.
  • the motivation for studying deep learning is to establish a neural network that simulates the human brain for analysis and learning. It imitates the mechanism of the human brain to interpret data, such as images, sounds, and text.
  • Animal behavior can be divided into foraging behavior, food storage behavior, attack behavior, defense behavior, reproductive behavior, rhythmic behavior, communication behavior, etc. according to its different manifestations.
  • the neural network model can be trained by machine learning methods, and then the animal behavior can be classified by the trained neural network model.
  • the method of training the neural network model includes supervised machine learning and unsupervised machine learning. Supervised machine learning trains the neural network model through sample data pre-labeled with behavioral labels, so that the neural network model learns the mapping relationship between sample data and its corresponding behavioral labels. Unsupervised machine learning clusters similar sample data through clustering algorithms, thereby realizing the classification of different sample data.
  • related technologies also include key point-based animal recognition methods and video-based classification methods.
  • the key point-based animal recognition method tracks the key points of the animal's body and then classifies the animal's behavior based on the position information of the key points (such as limb joints, etc.).
  • this method relies on the tracking of key points, and when the key points are blocked or the tracking is lost, the classification results of the animal's behavior will be inaccurate.
  • the key point tracking also loses the background information, resulting in the omission of animal behavior information related to the background information, which will also cause the classification results of the animal's behavior to be inaccurate.
  • the video-based classification method does not require key point tracking, it needs to classify the actions of the video frames based on the pixel values of the video frame by frame, and then identify the animal behavior based on the action classification results. This method results in a large amount of calculation and it is difficult to quickly identify the animal behavior.
  • the embodiments of the present application provide a behavior recognition method, a storage medium, and an electronic device to quickly and accurately recognize animal behavior. It can be understood that the behavior recognition method provided by the present application can recognize the behavior of various animals and various people. In the following embodiments, the method provided by the embodiments of the present application is described in detail by taking animal behavior as an example.
  • Figure 1 is a schematic diagram of the application scenario of the behavior recognition method provided by the embodiment of the present application.
  • the execution subject of the behavior recognition method is an electronic device.
  • the user selects a video with specified content as a reference video, and then for the video to be recognized, the two are input into the video discrimination model for similarity discrimination to determine the behavior content included in the video to be recognized. In this way, fast and accurate behavior recognition is achieved.
  • Figure 2 is a schematic diagram of the flow of the behavior recognition method provided in the embodiment of the present application.
  • the specific flow of the behavior recognition method provided in the embodiment of the present application can be as follows:
  • the specified behavior content includes but is not limited to various animal behaviors or various human behaviors. Taking animals as an example, animals can be cats, dogs, monkeys, mice, etc. Common behaviors of animals include but are not limited to: standing up, moving the head, drinking water, hanging, combing hair, walking, resting, eating, licking limbs, etc.
  • obtaining a reference video including but not limited to: obtaining a video including specified behavior content by shooting, obtaining a video including specified behavior content by video editing, etc.
  • a user records a video of a dog eating
  • a user records a video of a cat combing its hair
  • a user records a video of a monkey hanging upside down, etc. as a reference video.
  • a dog eating video is selected from videos of a dog eating, a dog walking, a dog licking its limbs, etc. as a reference video.
  • images related to dog eating are selected from multiple groups of dog images and synthesized to obtain a video of a dog eating as a reference video.
  • a video of a dog eating part is captured from a long recorded video as a reference video.
  • the method of obtaining the video to be identified may include: obtaining the video to be identified by shooting, obtaining the video to be identified by video editing, etc.
  • the video to be identified that needs to be identified by behavior recognition has not been marked.
  • the animals or people to which the video to be identified and the reference video belong are the same, thereby improving the accuracy of behavior recognition of the video to be identified.
  • the organism contained in the video to be identified can also be identified first.
  • the video to be identified is determined as the video to be identified that requires behavior recognition.
  • the similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network.
  • a video discrimination model based on a twin neural network includes a dual-branch network architecture with the same parameters.
  • the video to be identified and the reference video are respectively used as inputs with a dual-branch network architecture, and then the probability values of similarity and dissimilarity between the two are output. Then, the similarity between the video to be identified and the reference video is evaluated by the similar or dissimilar probability values.
  • the probability value of the similarity between the video to be identified and the reference video when the probability value of the similarity between the video to be identified and the reference video is large, it means that the similarity between the two is high. When the probability value of the dissimilarity between the video to be identified and the reference video is large, it means that the similarity between the two is low.
  • Identify whether the video to be identified includes specified behavior content according to the similarity, and obtain an identification result.
  • the similarity between the video to be identified and the reference video is high, it can be determined that the video to be identified and the reference video are the same or similar videos, and both can indicate the specified behavior content. If the similarity between the video to be identified and the reference video is low, it can be determined that the video to be identified and the reference video are not the same or similar videos, and the behavior content indicated by the two is different.
  • the present application is not limited by the execution order of the various steps described. If no conflict occurs, some steps can be performed in other orders or simultaneously.
  • the behavior recognition method in the embodiment of the present application first selects a reference video required by a user, the reference video includes the specified behavior content, and then uses the video discrimination model to determine the similarity between the reference video and the video to be identified, so as to perform behavior recognition on the video to be identified.
  • the video discrimination model determines the similarity between the reference video and the video to be identified, so as to perform behavior recognition on the video to be identified.
  • it can flexibly identify the behavior of the video with high accuracy, and on the other hand, it avoids tracking key points or classifying pixel values of video frames, reduces the amount of data processed for the video, and improves the efficiency of behavior recognition for the video.
  • obtaining a video to be identified that needs to be identified includes:
  • At least one captured video segment is determined as a video to be identified.
  • the length of the initial video is not limited in this application, and an initial video of any length can be selected according to actual needs.
  • the organism to which the initial video belongs is the same as the organism to which the reference video belongs.
  • the length of the obtained video segment is the same as the length of the initial video.
  • the lengths of the reference video and the initial video can be determined according to the playback duration or the number of video frames.
  • the number of the captured video segments can be one, two or more, depending on the length of the original video and the capture method.
  • the length of the initial video is less than the length of the reference video
  • one video segment is captured, and the initial video is interpolated to obtain a video segment having the same length as the reference video as the video to be identified.
  • the initial video when the length of the initial video is equal to the length of the reference video, the initial video may be directly used as a video segment.
  • the length of the initial video is greater than the length of the reference video
  • at least two video segments can be captured, wherein the number of video segments is determined according to the ratio of the length of the initial video to the length of the reference video, and when the ratio is greater than 1 and less than 2, two video segments can be obtained by adding frames. When the ratio is greater than 2, multiple video segments can be obtained.
  • adjacent video segments in at least two video segments may partially overlap or may not overlap.
  • adjacent video segments do not overlap, the adjacent video segments are connected end to end.
  • adjacent video segments overlap the adjacent video segments have overlapping video frames, and the number of the overlapping video frames may be determined according to actual needs.
  • a sliding window can be used to capture at least two video clips as videos to be identified.
  • identifying whether the video to be identified includes the specified behavior content according to the similarity, and obtaining the identification result, further comprising:
  • each video frame in the video to be recognized is scored based on the recognition result.
  • the similarity is divided into two cases: the reference video is similar to the video to be recognized, or the reference video is not similar to the video to be recognized.
  • the recognition result can be set to 1, and when the two are not similar, the recognition result can be set to 0.
  • the recognition result is 1 or 0.
  • the average of their scores is taken as the final score. For example, the score of the second video frame in the first video to be recognized is 0, and the score in the second video to be recognized is 1, then 0.5 is used as the final score of the second video frame.
  • the total score of the initial video is also determined based on the scores of each video frame in the initial video. Specifically, the total score of the initial video can be determined by taking the average, median, or median of the scores of each video frame. For example, taking the average of the scores of each video frame as an example, the initial video has a total of 10 frames, and the scores of the first frame to the tenth frame are 0, 0.5, 1, 1, 0.5, 0, 1, 1, 1, 1, and the total score of the initial video is 0.7.
  • a preset threshold is also set to compare with the total score, so that when the total score is greater than the preset threshold, it is determined that the initial video includes the specified behavioral content, that is, the behavior indicated by the initial video is the same as that indicated by the reference video.
  • the preset threshold is set to 0.5, and if the total score of the initial video is 0.7, it is determined that the initial video includes the specified behavioral content.
  • This embodiment obtains the video to be identified by sliding a screenshot from the initial video, determines the total score of the initial video based on the scores of the video frames in different videos to be identified, and compares the total score with a preset threshold to evaluate the recognition result of the initial video, thereby making the recognition result of the initial video more accurate.
  • adjacent video clips do not overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the following is further included:
  • the similarity is divided into two cases, namely similarity and dissimilarity.
  • the recognition result of the video to be recognized is set to 0.
  • the video result of the video to be recognized is set to 1.
  • the total score of the initial video is also determined based on the scores, wherein the method of determining the total score includes but is not limited to: finding the average value, median, median, etc. of each video to be identified. For example, the total score of the initial video is obtained by finding the average value of the scores of each video to be identified.
  • a preset threshold is set to compare with the total score to determine the recognition result of the initial video. Please refer to the above content for details, which will not be repeated here.
  • the method further includes:
  • the initial video is labeled according to the specified behavioral content. After the recognition result of the initial video is determined, the initial video is also labeled. Specifically, if the initial video includes the specified behavioral content, the initial video is assigned a corresponding behavioral label. If the initial video does not include the specified behavioral content, the initial video is not assigned a corresponding behavioral label. For example, the reference video includes eating behavioral content. If the initial video includes the specified behavioral content, the initial video is assigned an eating behavioral label. Otherwise, the initial video is not labeled, or is labeled with a non-eating behavioral label.
  • the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module, and a similarity discrimination module.
  • the first feature extraction branch and the second feature extraction branch have the same structure and share network parameters.
  • the similarity between the video to be identified and the reference video is obtained through the video discrimination model based on the twin neural network, including:
  • the fused features are input into the similarity judgment module for similarity judgment to obtain the similarity between the video to be identified and the reference video.
  • FIG 4 is a schematic diagram of the structure of the video discrimination model in the behavior recognition method provided in the embodiment of the present application.
  • the video discrimination model includes a first feature extraction branch and a second feature extraction branch with the same network parameters and structure, and the first feature extraction branch and the second feature extraction branch are also connected to the feature fusion module, and the feature fusion module is connected to the similarity discrimination module.
  • the reference video is input into the first feature extraction branch for feature extraction to obtain the first frame sequence feature
  • the video to be identified is input into the second feature extraction branch for feature extraction to obtain the second frame sequence feature.
  • the reference video can also be input into the second feature extraction branch for feature extraction, and the video to be identified can be input into the first feature extraction branch for feature extraction. It is not limited here which feature extraction branch the two are input into.
  • the reference video and the video to be identified are also converted into a frame sequence as input.
  • the feature fusion model obtains the fusion feature by performing vector subtraction on the first frame sequence feature and the second frame sequence feature.
  • the similarity determination module is used to determine the probability value of similarity or dissimilarity of fusion features.
  • This embodiment extracts the temporal features of the entire segment of the reference video and the video to be identified through the first feature extraction branch and the second feature extraction branch, and then performs similarity determination based on the temporal features, taking into account the temporal and spatial dependencies between behavioral features, so that the similarity determination result is more accurate.
  • the method provided by this embodiment can quickly determine the similarity between the video to be identified and the reference video, reducing the amount of model calculation and improving the model's determination rate.
  • the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism
  • the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism
  • the first frame sequence features of the reference video are extracted through a first feature extraction layer based on the self-attention mechanism
  • the second frame sequence features of the video to be identified are extracted through a second feature extraction layer based on the self-attention mechanism.
  • the mutual influence between the video frames can be taken into account, and the accuracy of the judgment can be improved when the similarity is judged based on the first frame sequence features and the second frame sequence features.
  • the method before obtaining the reference video, the method further includes:
  • the parameters of the initial model are adjusted according to the second video sample to obtain a video discrimination model.
  • the first video samples include, but are not limited to, ImageNet dataset (a large-scale image recognition database in computer vision research), Kinetics-700 dataset (a human behavior dataset), etc.
  • the model parameters of the initial model are obtained by pre-training the initial model based on the twin neural network through the first video samples.
  • the second video samples include video samples of designated behavior content.
  • the second video samples may include video samples of dogs eating, video samples of cats eating, video samples of chickens eating, etc.
  • the second video samples also include video samples that do not include designated behavior content.
  • the second video samples may also include video samples of pandas hanging, video samples of dogs rolling, video samples of cats licking limbs, etc.
  • video samples with the same or different behavioral content in the second video sample are also combined in pairs to input the first feature extraction branch and the second feature extraction branch of the initial model for model training, respectively, until the model is fitted or the specified number of training times is reached to obtain a video discrimination model.
  • the scale of training samples can be expanded, and the initial model can be trained through a small amount of labeled second video samples, thereby improving the rate of model training.
  • adjusting parameters of the initial model according to the second video sample to obtain a video discrimination model includes:
  • the parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain a video discrimination model.
  • the video discrimination model is obtained by adjusting the parameters of the feature fusion module and the similarity discrimination module.
  • the network parameters of the first feature extraction branch and the second feature extraction branch can be frozen, and the second video sample can be input into the initial model for model training, so as to optimize the network parameters of the feature fusion module and the similarity discrimination module during the model training process, so as to minimize the loss of the feature fusion module and the similarity discrimination module, and achieve fine-tuning of the network parameters of the feature fusion module and the similarity discrimination module.
  • model training is implemented based on a small amount of labeled second video samples through pre-training and model fine-tuning, which reduces the scale requirement for training data and improves the efficiency of model training.
  • the specified behavior includes a specified behavior of a specified animal
  • obtaining a second video sample corresponding to the specified behavior content includes:
  • a video sample of a specified animal with a specified behavior can be obtained as the second video sample, for example, various videos of various dogs eating can be obtained as the second video sample.
  • various dogs can be divided according to their breeds, and different dogs have different eating actions.
  • the trained video discrimination model can be applied to dog video discrimination with higher accuracy.
  • a video about the animal's behavior can also be selected as a reference video to accurately determine the similarity between the reference video and the video to be identified, so that when the animal to be identified belongs to is not the animal to which the reference video belongs, it can be directly determined that the video to be identified is not similar to the reference video.
  • data augmentation is also performed on the designated animal video samples marked with designated behaviors to expand the sample size and improve the generalization ability of the model.
  • the data augmentation methods include but are not limited to: video cropping, video frame extraction, image translation, image rotation, image scaling, etc.
  • data augmentation processing is performed on the designated animal video sample to obtain a second video sample with a larger data volume to fine-tune the initial model, so that the generalization ability of the model is better.
  • FIG5 is a detailed flow chart of the video identification method provided in the embodiment of the present application.
  • the contents indicated in the diagram are as follows:
  • Phase 1 Building a video discrimination model based on a twin neural network
  • the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module and a similarity discrimination module.
  • the second stage training video discrimination model
  • the third stage Apply the video discriminant model to predict similarity
  • a video discrimination model based on a twin neural network is constructed to perform behavior recognition, which is universal for various behaviors and has high recognition accuracy.
  • the following table provides the recognition results of eight behaviors on an animal behavior dataset:
  • the behavior recognition method proposed in the embodiment of the present invention by constructing a video discrimination model based on a twin neural network, and by pre-training the video discrimination model to reduce the sample data required for the training model, and by combining the sample data in pairs as the input of the dual-branch network, can greatly expand the scale of the sample data to improve the generalization ability of the model. And by determining the similarity between the reference video and the video to be identified through the video discrimination model, so as to perform similarity judgment on the two in the time dimension, it is possible to reduce the amount of calculation consumed by the similarity judgment, thereby quickly obtaining the judgment result of the video to be identified. In addition, it is also possible to realize behavior recognition of initial videos of varying lengths, thereby improving the flexibility of behavior recognition.
  • the present application also provides a behavior recognition device, which is applied to an electronic device and includes:
  • a first video acquisition module is used to acquire a reference video, where the reference video includes specified behavior content
  • the second video acquisition module is used to acquire a video to be identified that needs to be identified;
  • a similarity discrimination module is used to obtain the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;
  • the behavior recognition module is used to identify whether the video to be recognized includes the specified behavior content according to the similarity and obtain the recognition result.
  • the second video acquisition module is further used to:
  • At least one captured video segment is determined as a video to be identified.
  • the behavior identification module is further used to:
  • the behavior recognition module is further configured to:
  • the initial video is labeled according to the specified behavioral content.
  • the behavior identification module is further used to:
  • the second video acquisition module is further used to:
  • the original video is determined as the initial video for behavior recognition.
  • the length of the initial video is shorter than the length of the reference video
  • the second video acquisition module is further configured to:
  • the initial video is supplemented with frames according to the length of the reference video to obtain a video clip.
  • the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module, and a similarity discrimination module.
  • the first feature extraction branch and the second feature extraction branch have the same structure and share network parameters.
  • the similarity discrimination module is further used to:
  • the fused features are input into the similarity judgment module for similarity judgment to obtain the similarity between the video to be identified and the reference video.
  • the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism
  • the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism
  • the similarity determination module is further used to:
  • the parameters of the initial model are adjusted according to the second video sample to obtain a video discrimination model.
  • the similarity determination module is further used to:
  • the parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain a video discrimination model.
  • the similarity discrimination module before adjusting parameters of the feature fusion module and the similarity discrimination module according to the second video sample to obtain the video discrimination model, is further used to:
  • Freeze parameters of the first feature extraction branch and the second feature extraction branch are Freeze parameters of the first feature extraction branch and the second feature extraction branch.
  • the similarity determination module is further used to:
  • behavior recognition device provided in the embodiment of the present application and the behavior recognition method in the above embodiment are of the same concept, and any method provided in the behavior recognition method embodiment can be implemented through the behavior recognition device, and the same technical effect can be achieved.
  • the specific implementation process is detailed in the behavior recognition method embodiment, and will not be repeated here.
  • the embodiment of the present application also provides an electronic device, which may be a smart phone, a folding screen mobile phone, a tablet computer, a PDA, a desktop computer and the like.
  • Figure 6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • the electronic device 300 includes a processor 301 having one or more processing cores, a memory 302 having one or more computer-readable storage media, and a computer program stored in the memory 302 and executable on the processor.
  • the processor 301 is electrically connected to the memory 302.
  • the electronic device structure shown in the figure does not constitute a limitation on the electronic device, and may include more or fewer components than shown, or a combination of certain components, or different component arrangements.
  • the processor 301 is the control center of the electronic device 300. It uses various interfaces and lines to connect various parts of the entire electronic device 300, executes various functions of the electronic device 300 and processes data by running or loading software programs and/or modules stored in the memory 302, and calling data stored in the memory 302, thereby monitoring the electronic device 300 as a whole.
  • the processor 301 in the electronic device 300 will load instructions corresponding to the processes of one or more application programs into the memory 302 according to the following steps, and the processor 301 will run the application programs stored in the memory 302 to implement various functions:
  • the similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;
  • the electronic device by constructing a video discrimination model based on a twin neural network, and by pre-training the video discrimination model to reduce the sample data required for the training model, and by combining the sample data in pairs as the input of the dual-branch network, can greatly expand the scale of the sample data to improve the generalization ability of the model. And by determining the similarity between the reference video and the video to be identified through the video discrimination model, so as to perform similarity discrimination on the two in the time dimension, the amount of calculation consumed by the similarity discrimination can be reduced, so as to quickly obtain the discrimination result of the video to be identified. In addition, it is also possible to realize behavior recognition of initial videos of varying lengths, thereby improving the flexibility of behavior recognition.
  • the embodiment of the present application provides a computer-readable storage medium.
  • a person skilled in the art can understand that all or part of the steps in the above-mentioned embodiment method can be completed by instructing related hardware through a program.
  • the program can be stored in a computer-readable storage medium. When the program is executed, it includes the following steps:
  • the similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;
  • the above-mentioned storage medium may be ROM/RAM, a magnetic disk, an optical disk, etc. Since the computer program stored in the storage medium can execute the steps in any behavior recognition method provided in the embodiments of the present application, the beneficial effects that can be achieved by any behavior recognition method provided in the embodiments of the present application can be achieved, as detailed in the previous embodiments, which will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in the present application are a behavior recognition method, a storage medium and an electronic device. The method comprises: acquiring a reference video, wherein the reference video comprises specified behavior content; acquiring a video to be subjected to recognition that requires behavior recognition; acquiring the degree of similarity between the video to be subjected to recognition and the reference video by means of a video discrimination model based on a siamese neural network; and according to the degree of similarity, recognizing whether the video to be subjected to recognition comprises specified behavior content so as to obtain a recognition result.

Description

行为识别方法、存储介质及电子设备Behavior recognition method, storage medium and electronic device 技术领域Technical Field
本申请涉及人工智能技术领域,具体涉及一种行为识别方法、存储介质及电子设备。The present application relates to the field of artificial intelligence technology, and in particular to a behavior recognition method, a storage medium, and an electronic device.
背景技术Background technique
动物行为从宏观上反应了动物高级中枢神经功能、学习记忆能力、心理状态、运动协调性等信息。研究动物行为能够评估动物对于环境或者药理的适应情况,在毒理学、药理学、运动损伤及恢复等领域具有广泛应用。Animal behavior reflects information such as the animal's higher central nervous system function, learning and memory ability, psychological state, and motor coordination from a macroscopic perspective. Studying animal behavior can assess the animal's adaptation to the environment or pharmacology, and has a wide range of applications in toxicology, pharmacology, sports injuries, and recovery.
随着人工智能技术的快速发展,基于人工智能技术的有监督学习方法能够对动物行为进行分类。但该方法仅能将动物行为划分为固定类别中的一种。With the rapid development of artificial intelligence technology, supervised learning methods based on artificial intelligence technology can classify animal behaviors. However, this method can only classify animal behaviors into one of the fixed categories.
技术问题technical problem
对于未知类别的动物行为则难以对其进行准确分类。It is difficult to accurately classify animal behaviors of unknown categories.
技术解决方案Technical Solutions
本申请实施例提供一种行为识别方法、存储介质及电子设备,能够根据提高对动物行为进行识别的准确度。The embodiments of the present application provide a behavior recognition method, a storage medium, and an electronic device, which can improve the accuracy of identifying animal behavior.
第一方面,本申请实施例提供一种行为识别方法,方法包括:In a first aspect, an embodiment of the present application provides a behavior recognition method, the method comprising:
获取参考视频,参考视频包括指定的行为内容;Obtain a reference video, where the reference video includes specified behavior content;
获取需要进行行为识别的待识别视频;Obtain the video to be identified for behavior recognition;
通过基于孪生神经网络的视频判别模型,获取待识别视频与参考视频的相似度;The similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;
根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果。According to the similarity, it is determined whether the video to be identified includes the specified behavior content to obtain the identification result.
第二方面,本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,当计算机程序在计算机上运行时,使得计算机执行如本申请任一实施例提供的行为识别方法。In a second aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored. When the computer program runs on a computer, the computer executes a behavior recognition method as provided in any embodiment of the present application.
第三方面,本申请实施例还提供一种电子设备,包括处理器和存储器,存储器有计算机程序,处理器通过调用计算机程序,用于执行如本申请任一实施例提供的行为识别方法。In a third aspect, an embodiment of the present application further provides an electronic device, including a processor and a memory, wherein the memory has a computer program, and the processor executes a behavior recognition method provided in any embodiment of the present application by calling the computer program.
有益效果Beneficial Effects
对于包括指定的行为内容的参考视频,通过基于孪生神经网络的视频判别模型对该参考视频和待识别视频进行相似度识别,以确定该待识别视频是否包括指定的行为内容。以此,通过参考视频对待识别视频进行行为识别,一方面能够准确地对待识别视频的行为进行识别,另一方面能够快速地对待识别视频进行分类。For a reference video that includes the specified behavior content, the video discrimination model based on the twin neural network is used to perform similarity recognition between the reference video and the video to be identified to determine whether the video to be identified includes the specified behavior content. In this way, the behavior of the video to be identified can be accurately identified through the reference video, and the video to be identified can be quickly classified.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following briefly introduces the drawings required for use in the description of the embodiments. Obviously, the drawings described below are only some embodiments of the present application, and those skilled in the art can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的行为识别方法的应用场景示意图。FIG1 is a schematic diagram of an application scenario of a behavior recognition method provided in an embodiment of the present application.
图2为本申请实施例提供的行为识别方法的流程示意图。FIG2 is a flow chart of a behavior recognition method provided in an embodiment of the present application.
图3为本申请实施例提供的行为识别方法中使用滑动窗口截取视频片段的示意图。FIG3 is a schematic diagram of using a sliding window to capture video clips in the behavior recognition method provided in an embodiment of the present application.
图4为本申请实施例提供的行为识别方法中视频判别模型的结构示意图。FIG4 is a schematic diagram of the structure of a video discrimination model in the behavior recognition method provided in an embodiment of the present application.
图5为本申请实施例提供的视频判别方法的细节流程示意图。FIG. 5 is a detailed flowchart of a video identification method provided in an embodiment of the present application.
图6为本申请实施例提供的电子设备的结构示意图。FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本申请的保护范围。The technical solutions in the embodiments of the present application will be described clearly and completely below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present application.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference to "embodiments" herein means that a particular feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various locations in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
人工智能(Artificial Intelligence, AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能、感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括机器学习(Machine Learning, ML)技术,其中,深度学习(Deep Learning,DL)是机器学习中一个新的研究方向,它被引入机器学习以使其更接近于最初的目标,即人工智能。目前,深度学习主要应用在计算机视觉、自然语言处理等领域。Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies. The basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes machine learning (ML) technology, among which deep learning (DL) is a new research direction in machine learning. It is introduced into machine learning to make it closer to the original goal, namely artificial intelligence. At present, deep learning is mainly used in computer vision, natural language processing and other fields.
深度学习是机器学习的一种,而机器学习是实现人工智能的必经路径。深度学习的概念源于人工神经网络的研究,含多个隐藏层的多层感知器就是一种深度学习结构。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。研究深度学习的动机在于建立模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本等。Deep learning is a type of machine learning, and machine learning is the only way to achieve artificial intelligence. The concept of deep learning originated from the study of artificial neural networks. Multilayer perceptrons with multiple hidden layers are a type of deep learning structure. Deep learning combines low-level features to form more abstract high-level representations of attribute categories or features to discover distributed feature representations of data. The motivation for studying deep learning is to establish a neural network that simulates the human brain for analysis and learning. It imitates the mechanism of the human brain to interpret data, such as images, sounds, and text.
动物的行为按照其不同表现可以分为觅食行为、贮食行为、攻击行为、防御行为、繁殖行为、节律行为、通讯行为等。Animal behavior can be divided into foraging behavior, food storage behavior, attack behavior, defense behavior, reproductive behavior, rhythmic behavior, communication behavior, etc. according to its different manifestations.
在毒理学、药理学、运动损伤及其恢复、神经科学等领域,通过评估动物行为能够为相关研究提供关键信息。随着人工智能技术的发展,相关技术中也采用了机器学习的方法对动物行为进行识别。具体地,可以通过机器学习的方法训练神经网络模型,进而通过训练好的神经网络模型对动物行为进行分类,其中,训练神经网络模型的方法包括有监督的机器学习和无监督的机器学习,有监督的机器学习通过预先标记有行为标签的样本数据对神经网络模型进行训练,使得神经网络模型学习样本数据及其对应的行为标签之间的映射关系。无监督的机器学习通过聚类算法对同类样本数据进行聚类,从而实现对不同样本数据进行分类。In the fields of toxicology, pharmacology, sports injuries and their recovery, neuroscience, etc., evaluating animal behavior can provide key information for related research. With the development of artificial intelligence technology, machine learning methods are also used in related technologies to identify animal behavior. Specifically, the neural network model can be trained by machine learning methods, and then the animal behavior can be classified by the trained neural network model. Among them, the method of training the neural network model includes supervised machine learning and unsupervised machine learning. Supervised machine learning trains the neural network model through sample data pre-labeled with behavioral labels, so that the neural network model learns the mapping relationship between sample data and its corresponding behavioral labels. Unsupervised machine learning clusters similar sample data through clustering algorithms, thereby realizing the classification of different sample data.
然而,无监督的机器学习方法难以快速从大量数据中查找到用户需要的动物行为。However, unsupervised machine learning methods find it difficult to quickly find the animal behaviors that users need from large amounts of data.
为此,针对有监督的机器学习方法,相关技术中还包括基于关键点的动物识别方法和基于视频的分类方法。To this end, for supervised machine learning methods, related technologies also include key point-based animal recognition methods and video-based classification methods.
其中,基于关键点的动物识别方法通过对动物身体关键点进行追踪,再根据关键点的位置信息(如肢体关节等)对动物行为进行分类。但此类方法依赖于对关键点的追踪,在关键点被遮挡或追踪丢失时则会造成对动物行为的分类结果不准确。另外,基于关键点追踪还丢失了背景信息,造成与背景信息相关的动物行为信息的遗漏,也会造成对动物行为的分类结果不够准确。Among them, the key point-based animal recognition method tracks the key points of the animal's body and then classifies the animal's behavior based on the position information of the key points (such as limb joints, etc.). However, this method relies on the tracking of key points, and when the key points are blocked or the tracking is lost, the classification results of the animal's behavior will be inaccurate. In addition, the key point tracking also loses the background information, resulting in the omission of animal behavior information related to the background information, which will also cause the classification results of the animal's behavior to be inaccurate.
基于视频的分类方法虽然不需要进行关键点追踪,但需要逐帧基于视频的像素值对视频帧进行动作分类,进而根据动作分类结果识别动物行为,此种方式造成计算量大,难以快速地识别动物的行为。Although the video-based classification method does not require key point tracking, it needs to classify the actions of the video frames based on the pixel values of the video frame by frame, and then identify the animal behavior based on the action classification results. This method results in a large amount of calculation and it is difficult to quickly identify the animal behavior.
为解决相关技术中存在的问题,本申请实施例提供了一种行为识别方法、存储介质以及电子设备,以快速且准确地对动物行为进行识别,可以理解地,本申请提供的行为识别方法可以对各种动物以及各种人群等进行行为识别。在以下实施例中则以动物行为为例对本申请实施例提供的方法作详细阐述。In order to solve the problems existing in the related art, the embodiments of the present application provide a behavior recognition method, a storage medium, and an electronic device to quickly and accurately recognize animal behavior. It can be understood that the behavior recognition method provided by the present application can recognize the behavior of various animals and various people. In the following embodiments, the method provided by the embodiments of the present application is described in detail by taking animal behavior as an example.
首先,请参阅图1,图1为本申请实施例提供的行为识别方法的应用场景示意图。该行为识别方法的执行主体为电子设备,首先,用户先选择一个具有指定内容的视频作为参考视频,然后对于需要识别的待识别视频,将两者输入视频判别模型进行相似度判别,以确定待识别视频所包括的行为内容。以此,实现了快速且准确地进行行为识别。First, please refer to Figure 1, which is a schematic diagram of the application scenario of the behavior recognition method provided by the embodiment of the present application. The execution subject of the behavior recognition method is an electronic device. First, the user selects a video with specified content as a reference video, and then for the video to be recognized, the two are input into the video discrimination model for similarity discrimination to determine the behavior content included in the video to be recognized. In this way, fast and accurate behavior recognition is achieved.
具体地,请参阅图2,图2为本申请实施例提供的行为识别方法的流程示意图。本申请实施例提供的行为识别方法的具体流程可以如下:Specifically, please refer to Figure 2, which is a schematic diagram of the flow of the behavior recognition method provided in the embodiment of the present application. The specific flow of the behavior recognition method provided in the embodiment of the present application can be as follows:
101、获取参考视频,参考视频包括指定的行为内容。101. Obtain a reference video, where the reference video includes specified behavior content.
其中,指定的行为内容包括但不限于各类动物行为或者各类人群行为。以动物为例,动物可以为猫、狗、猴子、老鼠等,动物的常见行为包括但不限于:起身、头部运动、饮水、悬挂、毛发梳理、行走、休息、进食、舔肢等。The specified behavior content includes but is not limited to various animal behaviors or various human behaviors. Taking animals as an example, animals can be cats, dogs, monkeys, mice, etc. Common behaviors of animals include but are not limited to: standing up, moving the head, drinking water, hanging, combing hair, walking, resting, eating, licking limbs, etc.
示例性地,获取参考视频的方式有多种,包括但不限于:通过拍摄得到包括指定的行为内容的视频,通过视频剪辑得到包括指定的行为内容的视频等。Exemplarily, there are multiple ways to obtain a reference video, including but not limited to: obtaining a video including specified behavior content by shooting, obtaining a video including specified behavior content by video editing, etc.
此处进行举例说明,比如,用户录制狗进食的视频,用户录制猫进行毛发梳理的视频,用户录制猴子倒挂的视频等作为参考视频。再比如,从狗进食、狗行走、狗舔肢等视频中选择狗进食作为参考视频。又比如,从狗的多组图像中筛选出与狗进食相关的图像进行合成,以得到狗进食的视频作为参考视频。又比如,从录制的长视频中截取狗进食部分的视频作为参考视频。Here is an example, for example, a user records a video of a dog eating, a user records a video of a cat combing its hair, a user records a video of a monkey hanging upside down, etc. as a reference video. For another example, a dog eating video is selected from videos of a dog eating, a dog walking, a dog licking its limbs, etc. as a reference video. For another example, images related to dog eating are selected from multiple groups of dog images and synthesized to obtain a video of a dog eating as a reference video. For another example, a video of a dog eating part is captured from a long recorded video as a reference video.
102、获取需要进行行为识别的待识别视频。102. Obtain a video to be identified that requires behavior identification.
其中,获取待识别视频的方式可以包括:通过拍摄得到待识别视频,通过视频剪辑得到待识别视频等。且需要进行行为识别的待识别视频并未进行视频标识。The method of obtaining the video to be identified may include: obtaining the video to be identified by shooting, obtaining the video to be identified by video editing, etc. The video to be identified that needs to be identified by behavior recognition has not been marked.
本实施例中,待识别视频和参考视频所属的动物或人群相同。以此,提高对待识别视频进行行为识别的准确性。In this embodiment, the animals or people to which the video to be identified and the reference video belong are the same, thereby improving the accuracy of behavior recognition of the video to be identified.
示例性地,在获取需要进行行为识别的待识别视频之前,还可以先对待识别视频中包含的生物进行识别,当待识别视频中包含的生物与参视频所属的生物一致时,将该待识别视频确定为需要进行行为视频的待识别视频。Exemplarily, before obtaining the video to be identified that requires behavior recognition, the organism contained in the video to be identified can also be identified first. When the organism contained in the video to be identified is consistent with the organism belonging to the reference video, the video to be identified is determined as the video to be identified that requires behavior recognition.
103、通过基于孪生神经网络的视频判别模型,获取待识别视频与参考视频的相似度。103. The similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network.
本实施例中,提出了基于孪生神经网络的视频判别模型,该视频判别模型包括相同参数的双分支网络架构,通过将待识别视频与参考视频分别作为具有双分子网络架构的输入,进而输出两者相似的概率值和不相似的概率值,之后通过相似或不相似的概率值评估待识别视频与参考视频的相似度。In this embodiment, a video discrimination model based on a twin neural network is proposed. The video discrimination model includes a dual-branch network architecture with the same parameters. The video to be identified and the reference video are respectively used as inputs with a dual-branch network architecture, and then the probability values of similarity and dissimilarity between the two are output. Then, the similarity between the video to be identified and the reference video is evaluated by the similar or dissimilar probability values.
比如,待识别视频与参考视频之间相似的概率值大时,说明两者的相似度较高,待识别视频与参考视频之间不相似的概率值大时,说明两者的相似度较低。For example, when the probability value of the similarity between the video to be identified and the reference video is large, it means that the similarity between the two is high. When the probability value of the dissimilarity between the video to be identified and the reference video is large, it means that the similarity between the two is low.
104、根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果。104. Identify whether the video to be identified includes specified behavior content according to the similarity, and obtain an identification result.
其中,若待识别视频与参考视频的相似度较高,则可以确定待识别视频与参考视频属于相同或相似的视频,两者均可指示指定的行为内容。而若待识别视频与参考视频的相似度较低,则可以确定待识别视频与参考视频不属于相同或相似的视频,两者指示的行为内容不同。If the similarity between the video to be identified and the reference video is high, it can be determined that the video to be identified and the reference video are the same or similar videos, and both can indicate the specified behavior content. If the similarity between the video to be identified and the reference video is low, it can be determined that the video to be identified and the reference video are not the same or similar videos, and the behavior content indicated by the two is different.
具体实施时,本申请不受所描述的各个步骤的执行顺序的限制,在不产生冲突的情况下,某些步骤还可以采用其它顺序进行或者同时进行。In specific implementation, the present application is not limited by the execution order of the various steps described. If no conflict occurs, some steps can be performed in other orders or simultaneously.
本申请实施例中的行为识别方法,通过先选择一个用户所需要的参考视频,该参考视频包括指定的行为内容,进而通过视频判别模型判别参考视频与需要进行行为识别的待识别视频之间的相似度,以对待识别视频进行行为识别。一方面能够灵活地对视频进行行为判别,且准确性较高,另一方面避免了对视频帧进行关键点追踪或像素值分类,减小了对视频处理的数据量,提高了对视频进行行为识别的效率。The behavior recognition method in the embodiment of the present application first selects a reference video required by a user, the reference video includes the specified behavior content, and then uses the video discrimination model to determine the similarity between the reference video and the video to be identified, so as to perform behavior recognition on the video to be identified. On the one hand, it can flexibly identify the behavior of the video with high accuracy, and on the other hand, it avoids tracking key points or classifying pixel values of video frames, reduces the amount of data processed for the video, and improves the efficiency of behavior recognition for the video.
在一些实施例中,获取需要进行行为识别的待识别视频,包括:In some embodiments, obtaining a video to be identified that needs to be identified includes:
获取需要进行行为识别的初始视频;Obtain the initial video for behavior recognition;
根据参考视频的长度从初始视频中截取至少一个视频片段;Extracting at least one video segment from the initial video according to the length of the reference video;
将截取的至少一个视频片段确定为待识别视频。At least one captured video segment is determined as a video to be identified.
其中,初始视频的长度本申请中并未进行限定,可以根据实际需求选择任意长度的初始视频。初始视频所属的生物与参考视频所属的生物相同。The length of the initial video is not limited in this application, and an initial video of any length can be selected according to actual needs. The organism to which the initial video belongs is the same as the organism to which the reference video belongs.
在根据参考视频的长度对初始视频截取后,得到的视频片段的长度与初始视频的长度相同。其中,参考视频和初始视频的长度可以根据播放时长或者视频帧数确定。After the initial video is intercepted according to the length of the reference video, the length of the obtained video segment is the same as the length of the initial video. The lengths of the reference video and the initial video can be determined according to the playback duration or the number of video frames.
截取到的视频片段的数量可以为一个、两个或多个。具体可视初始视频的长度和截取方式确定。The number of the captured video segments can be one, two or more, depending on the length of the original video and the capture method.
作为一种实施方式,在初始视频的长度小于参考视频的长度时,截取的视频片段为一个,通过对初始视频进行补帧处理,以得到和参考视频的长度一致的一个视频片段作为待识别视频。As an implementation mode, when the length of the initial video is less than the length of the reference video, one video segment is captured, and the initial video is interpolated to obtain a video segment having the same length as the reference video as the video to be identified.
作为另一种实施方式,在初始视频的长度等于参考视频的长度时,可以直接将该初始视频作为一个视频片段。As another implementation, when the length of the initial video is equal to the length of the reference video, the initial video may be directly used as a video segment.
作为又一种实施方式,在初始视频的长度大于参考视频的长度时,可以截取到至少两个视频片段,其中,视频片段的数量根据初始视频的长度和参考视频的长度的比值确定,当该比值为大于1小于2的数值时,可以通过补帧的方式得到两个视频片段。当该比值为大于2的数值时,可以得到多个视频片段。As another implementation, when the length of the initial video is greater than the length of the reference video, at least two video segments can be captured, wherein the number of video segments is determined according to the ratio of the length of the initial video to the length of the reference video, and when the ratio is greater than 1 and less than 2, two video segments can be obtained by adding frames. When the ratio is greater than 2, multiple video segments can be obtained.
示例性地,在视频片段至少有两个时,至少两个视频片段中的相邻的视频片段可以部分重合,也可以不重合。其中,在相邻的视频片段不重合时,相邻的视频片段首尾相接。在相邻的视频片段重合时,相邻的视频片段具有重叠的视频帧,该重叠的视频帧的数量可根据实际需求确定。Exemplarily, when there are at least two video segments, adjacent video segments in at least two video segments may partially overlap or may not overlap. When adjacent video segments do not overlap, the adjacent video segments are connected end to end. When adjacent video segments overlap, the adjacent video segments have overlapping video frames, and the number of the overlapping video frames may be determined according to actual needs.
具体地,可以使用滑动窗口截取至少两个视频片段作为待识别视频,请参阅图3,图3为本申请实施例提供的行为识别方法中使用滑动窗口截取视频片段的示意图。其中,若参考视频的帧数为10,初始视频的帧数为35帧,则将滑动窗口的长度设为10帧,每隔N帧移动滑动窗口截取视频片段,其中,如图3(a)所示,若N等于10,则截取的视频片段相互之前不重合,如图3(b)所示,若N小于10,则截取的视频片段相互之间重合。以N=1为例,本申请实施例中可以设置滑动窗口每隔一帧移动以截取视频片段,从而得到多个视频片段,其中,视频片段之间的重合度可以根据实际需求选择,此处并不进行限定。Specifically, a sliding window can be used to capture at least two video clips as videos to be identified. Please refer to Figure 3, which is a schematic diagram of using a sliding window to capture video clips in the behavior recognition method provided in the embodiment of the present application. If the number of frames of the reference video is 10 and the number of frames of the initial video is 35, the length of the sliding window is set to 10 frames, and the sliding window is moved every N frames to capture video clips. As shown in Figure 3 (a), if N is equal to 10, the captured video clips do not overlap with each other. As shown in Figure 3 (b), if N is less than 10, the captured video clips overlap with each other. Taking N=1 as an example, in the embodiment of the present application, the sliding window can be set to move every frame to capture video clips, thereby obtaining multiple video clips, wherein the overlap between video clips can be selected according to actual needs and is not limited here.
在一些实施例中,若相邻的视频片段部分重合,根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果之后,还包括:In some embodiments, if adjacent video clips partially overlap, identifying whether the video to be identified includes the specified behavior content according to the similarity, and obtaining the identification result, further comprising:
根据同一视频帧在不同待识别视频中的识别结果,确定初始视频中各视频帧的评分;Determine the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;
根据初始视频中各视频帧的评分,确定初始视频的总评分;Determine the total score of the initial video according to the score of each video frame in the initial video;
若总评分大于预设阈值,则确定初始视频包括指定的行为内容;If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;
若总评分不大于预设阈值,则确定初始视频不包括指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content.
本实施例中,在根据参考视频与待识别视频的相似度确定出待识别视频的识别结果之后,还根据识别结果对待识别视频中的每一视频帧进行评分。比如,将相似度分为两种情况:参考视频与待识别视频相似,或者参考视频与待识别视频不相似。其中,在两者相似时,可以设定识别结果为1,在两者不相似时,可以设定识别结果为0。对于不同的待识别视频,其识别结果为1或0,通过将识别结果赋予其包含的每一视频帧作为评分,则每一视频帧在不同的待识别视频中均具有一个评分,该评分为0或1。在得到每一视频帧的评分之后,若存在视频帧具有多个评分,则取其评分的平均值作为最终评分。比如,第二个视频帧在第一个待识别视频中的评分为0,在第二个待识别视频中的评分为1,则将0.5作为该第二个视频帧的最终评分。In this embodiment, after determining the recognition result of the video to be recognized based on the similarity between the reference video and the video to be recognized, each video frame in the video to be recognized is scored based on the recognition result. For example, the similarity is divided into two cases: the reference video is similar to the video to be recognized, or the reference video is not similar to the video to be recognized. Among them, when the two are similar, the recognition result can be set to 1, and when the two are not similar, the recognition result can be set to 0. For different videos to be recognized, the recognition result is 1 or 0. By assigning the recognition result to each video frame contained therein as a score, each video frame has a score in different videos to be recognized, and the score is 0 or 1. After obtaining the score of each video frame, if there are video frames with multiple scores, the average of their scores is taken as the final score. For example, the score of the second video frame in the first video to be recognized is 0, and the score in the second video to be recognized is 1, then 0.5 is used as the final score of the second video frame.
在确定了每一视频帧的最终评分之后,还基于初始视频中各视频帧的评分确定初始视频的总评分。具体地,可以对各视频帧的评分以求取平均值、中值、中位数等方式确定初始视频的总评分。比如,以对各视频帧的评分求取平均值为例,初始视频共10帧,第一帧至第10帧的评分分别为0、0.5、1、1、0.5、0、1、1、1、1,初始视频的总评分为0.7。After determining the final score of each video frame, the total score of the initial video is also determined based on the scores of each video frame in the initial video. Specifically, the total score of the initial video can be determined by taking the average, median, or median of the scores of each video frame. For example, taking the average of the scores of each video frame as an example, the initial video has a total of 10 frames, and the scores of the first frame to the tenth frame are 0, 0.5, 1, 1, 0.5, 0, 1, 1, 1, 1, and the total score of the initial video is 0.7.
本实施例中还设置了预设阈值与总评分进行比较,以当总评分大于预设阈值时,确定初始视频包括指定的行为内容,即初始视频与参考视频所指示的行为相同。当总评分不大于预设阈值时,确定初始视频不包括指定的行为内容,即初始视频与参考视频所指示的行为不相同。比如,设置预设阈值为0.5,若初始视频的总评分为0.7,则确定初始视频包括指定的行为内容。In this embodiment, a preset threshold is also set to compare with the total score, so that when the total score is greater than the preset threshold, it is determined that the initial video includes the specified behavioral content, that is, the behavior indicated by the initial video is the same as that indicated by the reference video. When the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content, that is, the behavior indicated by the initial video is different from that indicated by the reference video. For example, the preset threshold is set to 0.5, and if the total score of the initial video is 0.7, it is determined that the initial video includes the specified behavioral content.
本实施例通过从初始视频中滑动截取得到待识别视频,并基于视频帧在不同待识别视频中的评分以确定初始视频的总评分,并将总评分与预设阈值进行比较以评估初始视频的识别结果,能够使得初始视频的识别结果更加准确。This embodiment obtains the video to be identified by sliding a screenshot from the initial video, determines the total score of the initial video based on the scores of the video frames in different videos to be identified, and compares the total score with a preset threshold to evaluate the recognition result of the initial video, thereby making the recognition result of the initial video more accurate.
在一些实施例中,相邻的视频片段不重合,根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果之后,还包括:In some embodiments, adjacent video clips do not overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the following is further included:
根据各待识别视频的识别结果,确定初始视频的总评分;Determine the total score of the initial video according to the recognition results of each video to be recognized;
若总评分大于预设阈值,则确定初始视频包括指定的行为内容;If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;
若总评分不大于预设阈值,则确定初始视频不包括指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content.
其中,在相邻的视频片段不重合时,在确定待识别视频的识别结果之后,还将相似度度分为两种情况,即相似和不相似,在参考视频与待识别视频不相似时,设定待识别视频的识别结果为0,在参考视频与待识别视频相似时,设定待识别视频的视频结果为1。Among them, when adjacent video clips do not overlap, after determining the recognition result of the video to be recognized, the similarity is divided into two cases, namely similarity and dissimilarity. When the reference video is dissimilar to the video to be recognized, the recognition result of the video to be recognized is set to 0. When the reference video is similar to the video to be recognized, the video result of the video to be recognized is set to 1.
在确定不同待识别视频的评分之后,还根据该评分确定初始视频的总评分,其中,确定总评分的方式包括但不限于:对各待识别视频求取平均值、中值、中位数等方式。比如,通过对各待识别视频的评分求取平均值得到初始视频的总评分。After determining the scores of different videos to be identified, the total score of the initial video is also determined based on the scores, wherein the method of determining the total score includes but is not limited to: finding the average value, median, median, etc. of each video to be identified. For example, the total score of the initial video is obtained by finding the average value of the scores of each video to be identified.
如上,还通过设置预设阈值与总评分进行对比,以确定初始视频的识别结果。具体参照上述内容,此处不再赘述。As above, a preset threshold is set to compare with the total score to determine the recognition result of the initial video. Please refer to the above content for details, which will not be repeated here.
在一些实施例中,确定初始视频包括指定的行为内容之后,还包括:In some embodiments, after determining that the initial video includes the specified behavior content, the method further includes:
按照指定的行为内容对初始视频进行标识。其中,在确定初始视频的识别结果之后,还对初始视频进行了标识。具体地,若初始视频包括指定的行为内容,则对初始视频赋予相应的行为标签,若初始视频不包括指定的行为内容,则不对初始视频赋予相应的行为标签。比如,参考视频包括的行为内容为进食,若初始视频包括指定的行为内容,则对初始视频赋予进食的行为标签,否则,则不对初始视频进行标识,或者标识为非进食的行为标签。The initial video is labeled according to the specified behavioral content. After the recognition result of the initial video is determined, the initial video is also labeled. Specifically, if the initial video includes the specified behavioral content, the initial video is assigned a corresponding behavioral label. If the initial video does not include the specified behavioral content, the initial video is not assigned a corresponding behavioral label. For example, the reference video includes eating behavioral content. If the initial video includes the specified behavioral content, the initial video is assigned an eating behavioral label. Otherwise, the initial video is not labeled, or is labeled with a non-eating behavioral label.
本实施例中通过对初始视频进行标识之后,能够便于从大量的初始视频中选择出与参考视频具有相同行为的视频,从而提高了对视频的筛选效率和查询效率。In this embodiment, after the initial video is marked, it is possible to easily select a video having the same behavior as the reference video from a large number of initial videos, thereby improving the screening efficiency and query efficiency of the video.
在一些实施例中,视频判别模型包括第一特征提取分支、第二特征提取分支、特征融合模块以及相似度判别模块,第一特征提取分支和第二特征提取分支的结构相同,且网络参数共享,通过基于孪生神经网络的视频判别模型,获取待识别视频与参考视频的相似度,包括:In some embodiments, the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module, and a similarity discrimination module. The first feature extraction branch and the second feature extraction branch have the same structure and share network parameters. The similarity between the video to be identified and the reference video is obtained through the video discrimination model based on the twin neural network, including:
将参考视频输入第一特征提取分支进行特征提取,得到第一帧序列特征;Inputting the reference video into the first feature extraction branch to extract features, and obtaining first frame sequence features;
将待识别视频输入第二特征提取分支进行特征提取,得到第二帧序列特征;Inputting the video to be identified into the second feature extraction branch to extract features, and obtaining second frame sequence features;
将第一帧序列特征和第二帧序列特征输入特征融合模块进行特征融合处理,得到融合特征;Inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain fusion features;
将融合特征输入相似度判别模块进行相似度判别,得到待识别视频与参考视频的相似度。The fused features are input into the similarity judgment module for similarity judgment to obtain the similarity between the video to be identified and the reference video.
请参阅图4,图4为本申请实施例提供的行为识别方法中视频判别模型的结构示意图。其中,该视频判别模型包括网络参数、结构均相同的第一特征提取分支和第二特征提取分支,第一特征提取分支和第二特征提取分支还均与特征融合模块连接,特征融合模块与相似度判别模块连接。Please refer to Figure 4, which is a schematic diagram of the structure of the video discrimination model in the behavior recognition method provided in the embodiment of the present application. The video discrimination model includes a first feature extraction branch and a second feature extraction branch with the same network parameters and structure, and the first feature extraction branch and the second feature extraction branch are also connected to the feature fusion module, and the feature fusion module is connected to the similarity discrimination module.
在通过基于孪生神经网络的视频判别模型获取参考视频与待识别视频的相似度时。通过将参考视频输入第一特征提取分支进行特征提取,得到第一帧序列特征,将待识别视频输入第二特征提取分支进行特征提取,得到第二帧序列特征。可以理解地,也可以将参考视频输入第二特征提取分支进行特征提取,将待识别视频输入第一特征提取分支进行特征提取,两者具体输入哪个特征提取分支此处并不进行限定。When obtaining the similarity between the reference video and the video to be identified through the video discrimination model based on the twin neural network. The reference video is input into the first feature extraction branch for feature extraction to obtain the first frame sequence feature, and the video to be identified is input into the second feature extraction branch for feature extraction to obtain the second frame sequence feature. It can be understood that the reference video can also be input into the second feature extraction branch for feature extraction, and the video to be identified can be input into the first feature extraction branch for feature extraction. It is not limited here which feature extraction branch the two are input into.
示例性地,在将参考视频和待识别视频输入第一特征提取分支和第二特征提取分支之前,还将参考视频和待识别视频转换成帧序列作为输入。其中,在将第一帧序列特征和第二帧序列特征输入特征融合模型进行特征融合处理时,特征融合模型通过对第一帧序列特征和第二帧序列特征进行向量相减以得到融合特征。Exemplarily, before the reference video and the video to be identified are input into the first feature extraction branch and the second feature extraction branch, the reference video and the video to be identified are also converted into a frame sequence as input. When the first frame sequence feature and the second frame sequence feature are input into the feature fusion model for feature fusion processing, the feature fusion model obtains the fusion feature by performing vector subtraction on the first frame sequence feature and the second frame sequence feature.
相似度判别模块用于判别融合特征相似的概率值或者不相似的概率值。The similarity determination module is used to determine the probability value of similarity or dissimilarity of fusion features.
本实施例通过第一特征提取分支和第二提取分支分别提取参考视频与待识别视频的整段的时序特征,进而根据时序特征进行相似度判别,考虑到了行为特征之间的时空依赖性,使得相似度判别结果更加准确。且相较于相关技术中需要逐帧对视频帧进行处理而言,本实施例提供的方法能够快速地判别待识别视频与参考视频的相似度,减小了模型的计算量,提高了模型的判别速率。This embodiment extracts the temporal features of the entire segment of the reference video and the video to be identified through the first feature extraction branch and the second feature extraction branch, and then performs similarity determination based on the temporal features, taking into account the temporal and spatial dependencies between behavioral features, so that the similarity determination result is more accurate. Compared with the related art that requires processing video frames frame by frame, the method provided by this embodiment can quickly determine the similarity between the video to be identified and the reference video, reducing the amount of model calculation and improving the model's determination rate.
在一些实施例中,第一特征提取分支包括基于自注意力机制的第一特征提取层,第二特征提取分支包括基于自注意力机制的第二特征提取层。In some embodiments, the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism, and the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism.
具体地,通过基于自注意力机制的第一特征提取层提取参考视频的第一帧序列特征,通过基于自注意力机制的第二特征提取层提取待识别视频的第二帧序列特征。Specifically, the first frame sequence features of the reference video are extracted through a first feature extraction layer based on the self-attention mechanism, and the second frame sequence features of the video to be identified are extracted through a second feature extraction layer based on the self-attention mechanism.
通过基于自注意力机制的特征提取层捕捉视频帧之间的相关性,能够考虑到各视频帧之间的相互影响,在基于第一帧序列特征和第二帧序列特征进行相似度判别时,能够提高判别的准确度。By capturing the correlation between video frames through the feature extraction layer based on the self-attention mechanism, the mutual influence between the video frames can be taken into account, and the accuracy of the judgment can be improved when the similarity is judged based on the first frame sequence features and the second frame sequence features.
在一些实施例中,获取参考视频之前,还包括:In some embodiments, before obtaining the reference video, the method further includes:
获取预训练的初始模型,初始模型根据不同行为内容的第一视频样本预训练得到;Obtaining a pre-trained initial model, where the initial model is pre-trained according to first video samples of different behavior contents;
获取对应指定行为内容的第二视频样本;Obtain a second video sample corresponding to the specified behavior content;
根据第二视频样本对初始模型进行参数调整,得到视频判别模型。The parameters of the initial model are adjusted according to the second video sample to obtain a video discrimination model.
其中,第一视频样本包括但不限于:ImageNet数据集(是计算机视觉研究中的大型图像识别数据库)、Kinetics-700数据集(是一种人类行为数据集)等,通过第一视频样本预训练基于孪生神经网络的初始模型得到初始模型的模型参数。The first video samples include, but are not limited to, ImageNet dataset (a large-scale image recognition database in computer vision research), Kinetics-700 dataset (a human behavior dataset), etc. The model parameters of the initial model are obtained by pre-training the initial model based on the twin neural network through the first video samples.
第二视频样本包括指定行为内容的视频样本,比如,在指定行为内容为进食时,第二视频样本中可以包括狗进食的视频样本、猫进食的视频样本、鸡进食的视频样本等。当然地,第二视频样本还包括不包括指定行为内容的视频样本,比如,在指定行为内容为进食时,第二视频样本中还可以包括熊猫悬挂的视频样本、狗打滚的视频样本、猫舔肢的视频样本等。The second video samples include video samples of designated behavior content. For example, when the designated behavior content is eating, the second video samples may include video samples of dogs eating, video samples of cats eating, video samples of chickens eating, etc. Of course, the second video samples also include video samples that do not include designated behavior content. For example, when the designated behavior content is eating, the second video samples may also include video samples of pandas hanging, video samples of dogs rolling, video samples of cats licking limbs, etc.
在通过第二视频样本对初始模型进行参数调整时,还将第二视频样本中相同或不同的行为内容的视频样本进行两两组合,以分别输入初始模型的第一特征提取分支和第二特征提取分支进行模型训练,直至模型拟合或者达到指定的训练次数,得到视频判别模型。When adjusting the parameters of the initial model through the second video sample, video samples with the same or different behavioral content in the second video sample are also combined in pairs to input the first feature extraction branch and the second feature extraction branch of the initial model for model training, respectively, until the model is fitted or the specified number of training times is reached to obtain a video discrimination model.
本实施例中,通过将第二视频样本中的两类视频样本进行两两组合,并通过基于孪生神经网络的视频判别模型进行模型训练,能够扩展训练样本的规模,实现了通过少量标记的第二视频样本对初始模型进行训练,提高了模型训练的速率。In this embodiment, by combining two types of video samples in the second video sample in pairs and performing model training through a video discrimination model based on a twin neural network, the scale of training samples can be expanded, and the initial model can be trained through a small amount of labeled second video samples, thereby improving the rate of model training.
在一些实施例中,根据第二视频样本对初始模型进行参数调整,得到视频判别模型,包括:In some embodiments, adjusting parameters of the initial model according to the second video sample to obtain a video discrimination model includes:
根据第二视频样本对特征融合模块和相似度判别模块进行参数调整,得到视频判别模型。The parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain a video discrimination model.
其中,在通过第二视频样本对初始模型进行模型微调时,通过对特征融合模块和相似度判别模块进行参数调整,以得到视频判别模型。When fine-tuning the initial model through the second video sample, the video discrimination model is obtained by adjusting the parameters of the feature fusion module and the similarity discrimination module.
示例性地,可以通过冻结第一特征提取分支和第二特征提取分支的网络参数,并将第二视频样本输入初始模型进行模型训练,以在模型训练过程中对特征融合模块和相似度判别模块的网络参数进行优化,以最小化特征融合模块和相似度判别模块的损失,实现对特征融合模块和相似度判别模块的网络参数的微调。Exemplarily, the network parameters of the first feature extraction branch and the second feature extraction branch can be frozen, and the second video sample can be input into the initial model for model training, so as to optimize the network parameters of the feature fusion module and the similarity discrimination module during the model training process, so as to minimize the loss of the feature fusion module and the similarity discrimination module, and achieve fine-tuning of the network parameters of the feature fusion module and the similarity discrimination module.
本申请实施例中通过预训练和模型微调的方式实现基于少量标注的第二视频样本进行模型训练,缩减了对训练数据的规模要求,且提高了模型训练的效率。In the embodiment of the present application, model training is implemented based on a small amount of labeled second video samples through pre-training and model fine-tuning, which reduces the scale requirement for training data and improves the efficiency of model training.
在一些实施例中,指定行为包括指定动物的指定行为,获取对应指定行为内容的第二视频样本,包括:In some embodiments, the specified behavior includes a specified behavior of a specified animal, and obtaining a second video sample corresponding to the specified behavior content includes:
获取标记为指定行为的指定动物视频样本;Get a sample of videos of a specified animal marked with a specified behavior;
对指定动物视频样本进行数据增广处理,得到第二视频样本。Perform data augmentation processing on the specified animal video sample to obtain a second video sample.
其中,可以获取某一动物的指定行为的指定动物视频样本作为第二视频样本,比如,获取各种狗的各种进食视频作为第二视频样本。其中,各种狗可以根据狗的品种进行划分,而不同狗进食的动作各自存在差异,通过将这类视频制作成第二视频样本对初始模型进行参数微调,在将训练好的视频判别模型应用于狗的视频判别时,具有更高的准确度。A video sample of a specified animal with a specified behavior can be obtained as the second video sample, for example, various videos of various dogs eating can be obtained as the second video sample. Various dogs can be divided according to their breeds, and different dogs have different eating actions. By making such videos into the second video sample to fine-tune the parameters of the initial model, the trained video discrimination model can be applied to dog video discrimination with higher accuracy.
可以理解地,在确定使用的第二视频样本属于哪个动物之后,还可选择一个关于该动物行为的视频作为参考视频,以准确地判别参考视频与待识别视频之间的相似度,以当待识别视频所属的动物不为参考视频所属的动物时,可直接判定待识别视频与参考视频不相似。It can be understood that after determining which animal the second video sample used belongs to, a video about the animal's behavior can also be selected as a reference video to accurately determine the similarity between the reference video and the video to be identified, so that when the animal to be identified belongs to is not the animal to which the reference video belongs, it can be directly determined that the video to be identified is not similar to the reference video.
本实施例中,还对标记有指定行为的指定动物视频样本进行了数据增广处理,以扩大样本规模,提高模型的泛化能力。其中,进行数据增广的方式包括但不限于:视频裁剪、视频抽帧、图像平移、图像旋转、图像缩放等。In this embodiment, data augmentation is also performed on the designated animal video samples marked with designated behaviors to expand the sample size and improve the generalization ability of the model. The data augmentation methods include but are not limited to: video cropping, video frame extraction, image translation, image rotation, image scaling, etc.
本实施例中,通过对指定动物视频样本进行数据增广处理,以得到更大数据量的第二视频样本对初始模型进行微调,使得模型的泛化能力更好。In this embodiment, data augmentation processing is performed on the designated animal video sample to obtain a second video sample with a larger data volume to fine-tune the initial model, so that the generalization ability of the model is better.
对于上述提及的内容,此处还进行详细介绍,请参阅图5,图5为本申请实施例提供的视频判别方法的细节流程示意图。该示意图所指示的内容如下:The above-mentioned contents are further described here in detail, please refer to FIG5, which is a detailed flow chart of the video identification method provided in the embodiment of the present application. The contents indicated in the diagram are as follows:
第一阶段:构建基于孪生神经网络的视频判别模型;Phase 1: Building a video discrimination model based on a twin neural network;
该视频判别模型包括第一特征提取分支、第二特征提取分支、特征融合模块以及相似度判别模块。The video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module and a similarity discrimination module.
第二阶段:训练视频判别模型;The second stage: training video discrimination model;
201、获取预训练的初始模型。201. Obtain a pre-trained initial model.
202、获取对应指定行为内容的第二视频样本。202. Obtain a second video sample corresponding to the specified behavior content.
203、冻结第一特征提取分支和第二特征提取分支的网络参数,通过第二视频样本对初始模型进行训练,以微调特征融合模块和相似度判别模块的网络参数。203. Freeze the network parameters of the first feature extraction branch and the second feature extraction branch, and train the initial model through the second video sample to fine-tune the network parameters of the feature fusion module and the similarity discrimination module.
第三阶段:应用视频判别模型进行相似度预测;The third stage: Apply the video discriminant model to predict similarity;
204、获取参考视频,参考视频包括指定的行为内容;204. Obtain a reference video, where the reference video includes the specified behavior content;
205、获取初始视频,并按照参考视频的长度从初始视频中截取至少一个视频片段作为待识别视频;205. Acquire an initial video, and extract at least one video segment from the initial video according to the length of the reference video as a video to be identified;
206、将参考视频输入第一特征提取分支,得到第一帧序列特征;206. Input the reference video into the first feature extraction branch to obtain a first frame sequence feature;
207、将待识别视频输入第二特征提取分支,得到第二帧序列特征;207. Input the video to be identified into the second feature extraction branch to obtain a second frame sequence feature;
208、将第一帧序列特征和第二帧序列特征输入特征融合模块进行特征融合,得到融合特征;208. Input the first frame sequence feature and the second frame sequence feature into a feature fusion module for feature fusion to obtain a fusion feature;
209、将融合特征输入相似度判别模块进行相似度判别,得到参考视频与待识别视频的相似度;209. Input the fusion feature into a similarity determination module for similarity determination, and obtain the similarity between the reference video and the video to be identified;
210、根据相似度对待识别视频进行评分;210. Score the video to be identified according to the similarity;
211、根据同一视频帧对应不同待识别视频的评分,确定初始视频中各视频帧的评分;211. Determine a score of each video frame in the initial video according to scores of the same video frame corresponding to different videos to be identified;
212、根据初始视频中各视频帧的评分确定初始视频的总评分;212. Determine a total score of the initial video according to the score of each video frame in the initial video;
213、若总评分大于预设阈值,则确定初始视频包括指定的行为内容;213. If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;
214、对初始视频赋予指定的行为的标签。214. Assign a label of the specified behavior to the initial video.
本申请实施例中通过构建基于孪生神经网络的视频判别模型进行行为识别,对各种行为具有通用性,且识别准确度较高。如下表格提供了在一个动物行为数据集上进行八种行为的识别结果:In the embodiment of the present application, a video discrimination model based on a twin neural network is constructed to perform behavior recognition, which is universal for various behaviors and has high recognition accuracy. The following table provides the recognition results of eight behaviors on an animal behavior dataset:
行为 Behavior 精度 Accuracy 召回率 Recall f1-分数 f1-score
起身 Get up 0.69 0.69 0.88 0.88 0.78 0.78
头动 Head movement 0.48 0.48 0.78 0.78 0.60 0.60
饮水 Drinking water 0.10 0.10 1.00 1.00 0.18 0.18
悬挂 Suspension 0.96 0.96 0.94 0.94 0.95 0.95
理毛 Grooming 0.95 0.95 0.77 0.77 0.85 0.85
行走 Walking 0.84 0.84 0.85 0.85 0.85 0.85
休息 rest 0.98 0.98 0.96 0.96 0.97 0.97
进食 Eating 0.91 0.91 0.85 0.85 0.88 0.88
由上可知,本发明实施例提出的行为识别方法,通过构建基于孪生神经网络的视频判别模型,并通过对视频判别模型进行预训练的方式以减小训练模型所需的样本数据,且通过对样本数据进行成对组合以作为双分支网络的输入,能够极大程度地扩展样本数据的规模,以提高模型的泛化能力。且通过视频判别模型确定参考视频与待识别视频之间的相似度,以在时序维度对两者进行相似度判别,能够减小相似度判别耗用的计算量,从而快速地得到对待识别视频的判别结果。另外,还能够实现对不定时长的初始视频进行行为识别,提高了行为识别的灵活性。As can be seen from the above, the behavior recognition method proposed in the embodiment of the present invention, by constructing a video discrimination model based on a twin neural network, and by pre-training the video discrimination model to reduce the sample data required for the training model, and by combining the sample data in pairs as the input of the dual-branch network, can greatly expand the scale of the sample data to improve the generalization ability of the model. And by determining the similarity between the reference video and the video to be identified through the video discrimination model, so as to perform similarity judgment on the two in the time dimension, it is possible to reduce the amount of calculation consumed by the similarity judgment, thereby quickly obtaining the judgment result of the video to be identified. In addition, it is also possible to realize behavior recognition of initial videos of varying lengths, thereby improving the flexibility of behavior recognition.
本申请实施例还提供一种行为识别装置,该行为识别装置应用于电子设备,包括:The present application also provides a behavior recognition device, which is applied to an electronic device and includes:
第一视频获取模块,用于获取参考视频,参考视频包括指定的行为内容;A first video acquisition module is used to acquire a reference video, where the reference video includes specified behavior content;
第二视频获取模块,用于获取需要进行行为识别的待识别视频;The second video acquisition module is used to acquire a video to be identified that needs to be identified;
相似度判别模块,用于通过基于孪生神经网络的视频判别模型,获取待识别视频与参考视频的相似度;A similarity discrimination module is used to obtain the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;
行为识别模块,用于根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果。The behavior recognition module is used to identify whether the video to be recognized includes the specified behavior content according to the similarity and obtain the recognition result.
在一些实施例中,第二视频获取模块还用于:In some embodiments, the second video acquisition module is further used to:
获取需要进行行为识别的初始视频;Obtain the initial video for behavior recognition;
根据参考视频的长度从初始视频中截取至少一个视频片段;Extracting at least one video segment from the initial video according to the length of the reference video;
将截取的至少一个视频片段确定为待识别视频。At least one captured video segment is determined as a video to be identified.
在一些实施例中,当待识别视频包括至少两个视频片段时,至少两个视频片段中相邻的视频片段部分重合,根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果之后,行为识别模块还用于:In some embodiments, when the video to be identified includes at least two video segments, adjacent video segments of the at least two video segments partially overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the behavior identification module is further used to:
根据同一视频帧在不同待识别视频中的识别结果,确定初始视频中各视频帧的评分;Determine the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;
根据初始视频中各视频帧的评分,确定初始视频的总评分;Determine the total score of the initial video according to the score of each video frame in the initial video;
若总评分大于预设阈值,则确定初始视频包括指定的行为内容;If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;
若总评分不大于预设阈值,则确定初始视频不包括指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content.
在一些实施例中,确定初始视频包括指定的行为内容之后,行为识别模块还用于:In some embodiments, after determining that the initial video includes the specified behavior content, the behavior recognition module is further configured to:
按照指定的行为内容对初始视频进行标识。The initial video is labeled according to the specified behavioral content.
在一些实施例中,当待识别视频包括至少两个视频片段时,至少两个视频片段中相邻的视频片段不重合,根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果之后,行为识别模块还用于:In some embodiments, when the video to be identified includes at least two video segments, adjacent video segments of the at least two video segments do not overlap, and whether the video to be identified includes the specified behavior content is identified based on the similarity. After obtaining the identification result, the behavior identification module is further used to:
根据各待识别视频的识别结果,确定初始视频的总评分;Determine the total score of the initial video according to the recognition results of each video to be recognized;
若总评分大于预设阈值,则确定初始视频包括指定的行为内容;If the total score is greater than a preset threshold, it is determined that the initial video includes the specified behavioral content;
若总评分不大于预设阈值,则确定初始视频不包括指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavioral content.
在一些实施例中,获取需要进行行为识别的初始视频之前,第二视频获取模块还用于:In some embodiments, before acquiring the initial video for behavior recognition, the second video acquisition module is further used to:
获取原始视频;Get the original video;
确定原始视频所属的生物与参考视频所属的生物是否相同;Determine whether the organism to which the original video belongs is the same as the organism to which the reference video belongs;
若是,则将原始视频确定为需要进行行为识别的初始视频。If so, the original video is determined as the initial video for behavior recognition.
在一些实施例中,初始视频的长度小于参考视频的长度,第二视频获取模块还用于:In some embodiments, the length of the initial video is shorter than the length of the reference video, and the second video acquisition module is further configured to:
按照参考视频的长度对初始视频进行补帧处理,得到一个视频片段。The initial video is supplemented with frames according to the length of the reference video to obtain a video clip.
在一些实施例中,视频判别模型包括第一特征提取分支、第二特征提取分支、特征融合模块以及相似度判别模块,第一特征提取分支和第二特征提取分支的结构相同,且网络参数共享,相似度判别模块还用于:In some embodiments, the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module, and a similarity discrimination module. The first feature extraction branch and the second feature extraction branch have the same structure and share network parameters. The similarity discrimination module is further used to:
将参考视频输入第一特征提取分支进行特征提取,得到第一帧序列特征;Inputting the reference video into the first feature extraction branch to extract features, and obtaining first frame sequence features;
将待识别视频输入第二特征提取分支进行特征提取,得到第二帧序列特征;Inputting the video to be identified into the second feature extraction branch to extract features, and obtaining second frame sequence features;
将第一帧序列特征和第二帧序列特征输入特征融合模块进行特征融合处理,得到融合特征;Inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain fusion features;
将融合特征输入相似度判别模块进行相似度判别,得到待识别视频与参考视频的相似度。The fused features are input into the similarity judgment module for similarity judgment to obtain the similarity between the video to be identified and the reference video.
在一些实施例中,第一特征提取分支包括基于自注意力机制的第一特征提取层,第二特征提取分支包括基于自注意力机制的第二特征提取层。In some embodiments, the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism, and the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism.
在一些实施例中,获取参考视频之前,相似度判别模块还用于:In some embodiments, before obtaining the reference video, the similarity determination module is further used to:
获取预训练的初始模型,初始模型根据不同行为内容的第一视频样本预训练得到;Obtaining a pre-trained initial model, where the initial model is pre-trained according to first video samples of different behavior contents;
获取对应指定行为内容的第二视频样本;Obtain a second video sample corresponding to the specified behavior content;
根据第二视频样本对初始模型进行参数调整,得到视频判别模型。The parameters of the initial model are adjusted according to the second video sample to obtain a video discrimination model.
在一些实施例中,相似度判别模块还用于:In some embodiments, the similarity determination module is further used to:
根据第二视频样本对特征融合模块和相似度判别模块进行参数调整,得到视频判别模型。The parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain a video discrimination model.
在一些实施例中,根据第二视频样本对特征融合模块和相似度判别模块进行参数调整,得到视频判别模型之前,相似度判别模块还用于:In some embodiments, before adjusting parameters of the feature fusion module and the similarity discrimination module according to the second video sample to obtain the video discrimination model, the similarity discrimination module is further used to:
冻结第一特征提取分支和第二特征提取分支的参数。Freeze parameters of the first feature extraction branch and the second feature extraction branch.
在一些实施例中,相似度判别模块还用于:In some embodiments, the similarity determination module is further used to:
获取标记为指定行为的指定动物视频样本;Get a sample of videos of a specified animal marked with a specified behavior;
对指定动物视频样本进行数据增广处理,得到第二视频样本。Perform data augmentation processing on the specified animal video sample to obtain a second video sample.
应当说明的是,本申请实施例提供的行为识别装置与上文实施例中的行为识别方法属于同一构思,通过该行为识别装置可以实现行为识别方法实施例中提供的任一方法,且能达到相同的技术效果。其具体实现过程详见行为识别方法实施例,此处不再赘述。It should be noted that the behavior recognition device provided in the embodiment of the present application and the behavior recognition method in the above embodiment are of the same concept, and any method provided in the behavior recognition method embodiment can be implemented through the behavior recognition device, and the same technical effect can be achieved. The specific implementation process is detailed in the behavior recognition method embodiment, and will not be repeated here.
本申请实施例还提供一种电子设备,该电子设备可以是智能手机、折叠屏手机、平板电脑、掌上电脑、台式电脑等设备。如图6所示,图6为本申请实施例提供的电子设备的结构示意图。该电子设备300包括有一个或者一个以上处理核心的处理器301、有一个或一个以上计算机可读存储介质的存储器302及存储在存储器302上并可在处理器上运行的计算机程序。其中,处理器301与存储器302电性连接。本领域技术人员可以理解,图中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The embodiment of the present application also provides an electronic device, which may be a smart phone, a folding screen mobile phone, a tablet computer, a PDA, a desktop computer and the like. As shown in Figure 6, Figure 6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application. The electronic device 300 includes a processor 301 having one or more processing cores, a memory 302 having one or more computer-readable storage media, and a computer program stored in the memory 302 and executable on the processor. Among them, the processor 301 is electrically connected to the memory 302. Those skilled in the art will appreciate that the electronic device structure shown in the figure does not constitute a limitation on the electronic device, and may include more or fewer components than shown, or a combination of certain components, or different component arrangements.
处理器301是电子设备300的控制中心,利用各种接口和线路连接整个电子设备300的各个部分,通过运行或加载存储在存储器302内的软件程序和/或模块,以及调用存储在存储器302内的数据,执行电子设备300的各种功能和处理数据,从而对电子设备300进行整体监控。The processor 301 is the control center of the electronic device 300. It uses various interfaces and lines to connect various parts of the entire electronic device 300, executes various functions of the electronic device 300 and processes data by running or loading software programs and/or modules stored in the memory 302, and calling data stored in the memory 302, thereby monitoring the electronic device 300 as a whole.
在本申请实施例中,电子设备300中的处理器301会按照如下的步骤,将一个或一个以上的应用程序的进程对应的指令加载到存储器302中,并由处理器301来运行存储在存储器302中的应用程序,从而实现各种功能:In the embodiment of the present application, the processor 301 in the electronic device 300 will load instructions corresponding to the processes of one or more application programs into the memory 302 according to the following steps, and the processor 301 will run the application programs stored in the memory 302 to implement various functions:
获取参考视频,参考视频包括指定的行为内容;Obtain a reference video, where the reference video includes specified behavior content;
获取需要进行行为识别的待识别视频;Obtain the video to be identified for behavior recognition;
通过基于孪生神经网络的视频判别模型,获取待识别视频与参考视频的相似度;The similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;
根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果。According to the similarity, it is determined whether the video to be identified includes the specified behavior content to obtain the identification result.
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。The specific implementation of the above operations can be found in the previous embodiments, which will not be described in detail here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
由上可知,本实施例提供的电子设备,通过构建基于孪生神经网络的视频判别模型,并通过对视频判别模型进行预训练的方式以减小训练模型所需的样本数据,且通过对样本数据进行成对组合以作为双分支网络的输入,能够极大程度地扩展样本数据的规模,以提高模型的泛化能力。且通过视频判别模型确定参考视频与待识别视频之间的相似度,以在时序维度对两者进行相似度判别,能够减小相似度判别耗用的计算量,从而快速地得到对待识别视频的判别结果。另外,还能够实现对不定时长的初始视频进行行为识别,提高了行为识别的灵活性。As can be seen from the above, the electronic device provided in this embodiment, by constructing a video discrimination model based on a twin neural network, and by pre-training the video discrimination model to reduce the sample data required for the training model, and by combining the sample data in pairs as the input of the dual-branch network, can greatly expand the scale of the sample data to improve the generalization ability of the model. And by determining the similarity between the reference video and the video to be identified through the video discrimination model, so as to perform similarity discrimination on the two in the time dimension, the amount of calculation consumed by the similarity discrimination can be reduced, so as to quickly obtain the discrimination result of the video to be identified. In addition, it is also possible to realize behavior recognition of initial videos of varying lengths, thereby improving the flexibility of behavior recognition.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。A person of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be completed by instructions, or by controlling related hardware through instructions. The instructions may be stored in a computer-readable storage medium and loaded and executed by a processor.
为此,本申请实施例提供一种计算机可读存储介质,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件完成,的程序可以存储于一计算机可读取存储介质中,该程序在执行时,包括如下步骤:To this end, the embodiment of the present application provides a computer-readable storage medium. A person skilled in the art can understand that all or part of the steps in the above-mentioned embodiment method can be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. When the program is executed, it includes the following steps:
获取参考视频,参考视频包括指定的行为内容;Obtain a reference video, where the reference video includes specified behavior content;
获取需要进行行为识别的待识别视频;Obtain the video to be identified for behavior recognition;
通过基于孪生神经网络的视频判别模型,获取待识别视频与参考视频的相似度;The similarity between the video to be identified and the reference video is obtained through a video discrimination model based on a twin neural network;
根据相似度识别待识别视频是否包括指定的行为内容,得到识别结果。According to the similarity, it is determined whether the video to be identified includes the specified behavior content to obtain the identification result.
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。The specific implementation of the above operations can be found in the previous embodiments, which will not be described in detail here.
上述的存储介质可以为ROM/RAM、磁碟、光盘等。由于该存储介质中所存储的计算机程序,可以执行本申请实施例所提供的任一种行为识别方法中的步骤,因此,可以实现本申请实施例所提供的任一种行为识别方法所能实现的有益效果,详见前面的实施例,在此不再赘述。The above-mentioned storage medium may be ROM/RAM, a magnetic disk, an optical disk, etc. Since the computer program stored in the storage medium can execute the steps in any behavior recognition method provided in the embodiments of the present application, the beneficial effects that can be achieved by any behavior recognition method provided in the embodiments of the present application can be achieved, as detailed in the previous embodiments, which will not be repeated here.
以上对本申请实施例所提供的一种行为识别方法、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to a behavior recognition method, storage medium and electronic device provided in the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for technical personnel in this field, according to the ideas of the present application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims (20)

  1. 一种行为识别方法,其中,包括:A behavior recognition method, comprising:
    获取参考视频,所述参考视频包括指定的行为内容;Acquire a reference video, wherein the reference video includes specified behavior content;
    获取需要进行行为识别的待识别视频;Obtain the video to be identified for behavior recognition;
    通过基于孪生神经网络的视频判别模型,获取所述待识别视频与所述参考视频的相似度;Obtaining the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;
    根据所述相似度识别所述待识别视频是否包括所述指定的行为内容,得到识别结果。It is determined whether the video to be identified includes the specified behavior content according to the similarity to obtain an identification result.
  2. 根据权利要求1所述的行为识别方法,其中,所述获取需要进行行为识别的待识别视频,包括:According to the behavior recognition method of claim 1, wherein the step of obtaining a video to be recognized that requires behavior recognition comprises:
    获取需要进行行为识别的初始视频;Obtain the initial video for behavior recognition;
    根据所述参考视频的长度从所述初始视频中截取至少一个视频片段;Extracting at least one video segment from the initial video according to the length of the reference video;
    将截取的所述至少一个视频片段确定为所述待识别视频。The at least one captured video segment is determined as the video to be identified.
  3. 根据权利要求2所述的行为识别方法,其中,当所述待识别视频包括至少两个视频片段时,所述至少两个视频片段中相邻的视频片段部分重合,所述根据所述相似度识别所述待识别视频是否包括所述指定的行为内容,得到识别结果之后,还包括:According to the behavior recognition method of claim 2, when the video to be recognized includes at least two video segments, adjacent video segments of the at least two video segments partially overlap, and the identifying whether the video to be recognized includes the specified behavior content according to the similarity, after obtaining the recognition result, further includes:
    根据同一视频帧在不同待识别视频中的识别结果,确定所述初始视频中各视频帧的评分;Determining the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;
    根据所述初始视频中各视频帧的评分,确定所述初始视频的总评分;Determining a total score of the initial video according to the score of each video frame in the initial video;
    若所述总评分大于预设阈值,则确定所述初始视频包括所述指定的行为内容;If the total score is greater than a preset threshold, determining that the initial video includes the specified behavioral content;
    若所述总评分不大于所述预设阈值,则确定所述初始视频不包括所述指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavior content.
  4. 根据权利要求3所述的行为识别方法,其中,所述确定所述初始视频包括所述指定的行为内容之后,还包括:The behavior recognition method according to claim 3, wherein after determining that the initial video includes the specified behavior content, it also includes:
    按照所述指定的行为内容对所述初始视频进行标识。The initial video is marked according to the specified behavior content.
  5. 根据权利要求2所述的行为识别方法,其中,当所述待识别视频包括至少两个视频片段时,所述至少两个视频片段中相邻的视频片段不重合,所述根据所述相似度识别所述待识别视频是否包括所述指定的行为内容,得到识别结果之后,还包括:According to the behavior recognition method of claim 2, when the video to be recognized includes at least two video segments, adjacent video segments of the at least two video segments do not overlap, and the step of identifying whether the video to be recognized includes the specified behavior content according to the similarity, after obtaining the recognition result, further comprises:
    根据各待识别视频的识别结果,确定初始视频的总评分;Determine the total score of the initial video according to the recognition results of each video to be recognized;
    若所述总评分大于预设阈值,则确定所述初始视频包括所述指定的行为内容;If the total score is greater than a preset threshold, determining that the initial video includes the specified behavioral content;
    若所述总评分不大于所述预设阈值,则确定所述初始视频不包括所述指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavior content.
  6. 根据权利要求2所述的行为识别方法,其中,所述获取需要进行行为识别的初始视频之前,还包括:The behavior recognition method according to claim 2, wherein, before obtaining the initial video for behavior recognition, the method further comprises:
    获取原始视频;Get the original video;
    确定所述原始视频所属的生物与所述参考视频所属的生物是否相同;Determining whether the organism to which the original video belongs is the same as the organism to which the reference video belongs;
    若是,则将所述原始视频确定为需要进行行为识别的初始视频。If so, the original video is determined as the initial video for which behavior recognition is required.
  7. 根据权利要求2所述的行为识别方法,其中,所述初始视频的长度小于所述参考视频的长度,所述根据所述参考视频的长度从所述初始视频中截取至少一个视频片段,包括:According to the behavior recognition method of claim 2, wherein the length of the initial video is less than the length of the reference video, and the extracting at least one video segment from the initial video according to the length of the reference video comprises:
    按照所述参考视频的长度对所述初始视频进行补帧处理,得到一个视频片段。The initial video is subjected to frame interpolation processing according to the length of the reference video to obtain a video clip.
  8. 根据权利要求1所述的行为识别方法,其中,所述视频判别模型包括第一特征提取分支、第二特征提取分支、特征融合模块以及相似度判别模块,所述第一特征提取分支和所述第二特征提取分支的结构相同,且网络参数共享,所述通过基于孪生神经网络的视频判别模型,获取所述待识别视频与所述参考视频的相似度,包括:According to the behavior recognition method of claim 1, wherein the video discrimination model includes a first feature extraction branch, a second feature extraction branch, a feature fusion module and a similarity discrimination module, the first feature extraction branch and the second feature extraction branch have the same structure and share network parameters, and the similarity between the video to be identified and the reference video is obtained by using the video discrimination model based on the twin neural network, including:
    将所述参考视频输入所述第一特征提取分支进行特征提取,得到第一帧序列特征;Inputting the reference video into the first feature extraction branch to perform feature extraction to obtain first frame sequence features;
    将所述待识别视频输入所述第二特征提取分支进行特征提取,得到第二帧序列特征;Inputting the to-be-recognized video into the second feature extraction branch to perform feature extraction to obtain a second frame sequence feature;
    将所述第一帧序列特征和所述第二帧序列特征输入所述特征融合模块进行特征融合处理,得到融合特征;Inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain a fusion feature;
    将所述融合特征输入所述相似度判别模块进行相似度判别,得到所述待识别视频与所述参考视频的相似度。The fusion feature is input into the similarity determination module for similarity determination to obtain the similarity between the video to be identified and the reference video.
  9. 根据权利要求8所述的行为识别方法,其中,所述第一特征提取分支包括基于自注意力机制的第一特征提取层,所述第二特征提取分支包括基于自注意力机制的第二特征提取层。According to the behavior recognition method according to claim 8, wherein the first feature extraction branch includes a first feature extraction layer based on a self-attention mechanism, and the second feature extraction branch includes a second feature extraction layer based on a self-attention mechanism.
  10. 根据权利要求8所述的行为识别方法,其中,所述将所述第一帧序列特征和所述第二帧序列特征输入所述特征融合模块进行特征融合处理,得到融合特征,包括:According to the behavior recognition method of claim 8, wherein the step of inputting the first frame sequence feature and the second frame sequence feature into the feature fusion module for feature fusion processing to obtain the fusion feature comprises:
    将所述第一帧序列特征和所述第二帧序列特征输入所述特征融合模块进行向量相减,得到所述融合特征。The first frame sequence feature and the second frame sequence feature are input into the feature fusion module to perform vector subtraction to obtain the fusion feature.
  11. 根据权利要求8所述的行为识别方法,其中,所述获取参考视频之前,还包括:The behavior recognition method according to claim 8, wherein before obtaining the reference video, the method further comprises:
    获取预训练的初始模型,所述初始模型根据不同行为内容的第一视频样本预训练得到;Obtaining a pre-trained initial model, where the initial model is pre-trained according to first video samples with different behavioral contents;
    获取对应指定行为内容的第二视频样本;Obtain a second video sample corresponding to the specified behavior content;
    根据所述第二视频样本对所述初始模型进行参数调整,得到所述视频判别模型。The parameters of the initial model are adjusted according to the second video sample to obtain the video discrimination model.
  12. 根据权利要求8所述的行为识别方法,其中,所述根据所述第二视频样本对所述初始模型进行参数调整,得到所述视频判别模型,包括:According to the behavior recognition method of claim 8, wherein the step of adjusting parameters of the initial model according to the second video sample to obtain the video discrimination model comprises:
    根据所述第二视频样本对所述特征融合模块和所述相似度判别模块进行参数调整,得到所述视频判别模型。The parameters of the feature fusion module and the similarity discrimination module are adjusted according to the second video sample to obtain the video discrimination model.
  13. 根据权利要求12所述的行为识别方法,其中,所述根据所述第二视频样本对所述特征融合模块和所述相似度判别模块进行参数调整,得到所述视频判别模型之前,还包括:According to the behavior recognition method of claim 12, before adjusting the parameters of the feature fusion module and the similarity discrimination module according to the second video sample to obtain the video discrimination model, the method further comprises:
    冻结所述第一特征提取分支和所述第二特征提取分支的参数。Parameters of the first feature extraction branch and the second feature extraction branch are frozen.
  14. 根据权利要求11所述的行为识别方法,其中,所述指定行为包括指定动物的指定行为,所述获取对应指定行为内容的第二视频样本,包括:According to the behavior recognition method of claim 11, wherein the specified behavior includes a specified behavior of a specified animal, and the step of obtaining a second video sample corresponding to the specified behavior content includes:
    获取标记为所述指定行为的指定动物视频样本;Obtaining a specified animal video sample marked with the specified behavior;
    对所述指定动物视频样本进行数据增广处理,得到所述第二视频样本。Perform data augmentation processing on the designated animal video sample to obtain the second video sample.
  15. 一种行为识别装置,其中,包括:A behavior recognition device, comprising:
    第一视频获取模块,用于获取参考视频,所述参考视频包括指定的行为内容;A first video acquisition module, used to acquire a reference video, wherein the reference video includes specified behavior content;
    第二视频获取模块,用于获取需要进行行为识别的待识别视频;The second video acquisition module is used to acquire a video to be identified that needs to be identified;
    相似度判别模块,用于通过基于孪生神经网络的视频判别模型,获取所述待识别视频与所述参考视频的相似度;A similarity discrimination module, used to obtain the similarity between the video to be identified and the reference video through a video discrimination model based on a twin neural network;
    行为识别模块,用于根据所述相似度识别所述待识别视频是否包括所述指定的行为内容,得到识别结果。The behavior recognition module is used to recognize whether the video to be recognized includes the specified behavior content according to the similarity, and obtain a recognition result.
  16. 根据权利要求15所述的行为识别装置,其中,第二视频获取模块还用于:The behavior recognition device according to claim 15, wherein the second video acquisition module is further used for:
    获取需要进行行为识别的初始视频;Obtain the initial video for behavior recognition;
    根据所述参考视频的长度从所述初始视频中截取至少一个视频片段;Extracting at least one video segment from the initial video according to the length of the reference video;
    将截取的所述至少一个视频片段确定为所述待识别视频。The at least one captured video segment is determined as the video to be identified.
  17. 根据权利要求16所述的行为识别装置,其中,当所述待识别视频包括至少两个视频片段时,所述至少两个视频片段中相邻的视频片段部分重合,所述根据所述相似度识别所述待识别视频是否包括所述指定的行为内容,得到识别结果之后,行为识别模块还用于:According to the behavior recognition device of claim 16, when the video to be recognized includes at least two video segments, adjacent video segments of the at least two video segments partially overlap, and after the similarity is used to identify whether the video to be recognized includes the specified behavior content, and the recognition result is obtained, the behavior recognition module is further used to:
    根据同一视频帧在不同待识别视频中的识别结果,确定所述初始视频中各视频帧的评分;Determining the score of each video frame in the initial video according to the recognition results of the same video frame in different videos to be recognized;
    根据所述初始视频中各视频帧的评分,确定所述初始视频的总评分;Determining a total score of the initial video according to the score of each video frame in the initial video;
    若所述总评分大于预设阈值,则确定所述初始视频包括所述指定的行为内容;If the total score is greater than a preset threshold, determining that the initial video includes the specified behavioral content;
    若所述总评分不大于所述预设阈值,则确定所述初始视频不包括所述指定的行为内容。If the total score is not greater than the preset threshold, it is determined that the initial video does not include the specified behavior content.
  18. 根据权利要求16所述的行为识别装置,其中,所述获取需要进行行为识别的初始视频之前,第二视频获取模块还用于:According to the behavior recognition device of claim 16, wherein, before acquiring the initial video for behavior recognition, the second video acquisition module is further used to:
    获取原始视频;Get the original video;
    确定所述原始视频所属的生物与所述参考视频所属的生物是否相同;Determining whether the organism to which the original video belongs is the same as the organism to which the reference video belongs;
    若是,则将所述原始视频确定为需要进行行为识别的初始视频。If so, the original video is determined as the initial video for which behavior recognition is required.
  19. 一种计算机可读存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至18任一项所述的行为识别方法。A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is run on a computer, the computer is caused to execute the behavior recognition method as claimed in any one of claims 1 to 18.
  20. 一种电子设备,包括处理器和存储器,所述存储器存储有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行如权利要求1至18任一项所述的行为识别方法。An electronic device comprises a processor and a memory, wherein the memory stores a computer program, wherein the processor is used to execute the behavior recognition method as claimed in any one of claims 1 to 18 by calling the computer program.
PCT/CN2022/133025 2022-11-18 2022-11-18 Behavior recognition method, storage medium and electronic device WO2024103417A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/133025 WO2024103417A1 (en) 2022-11-18 2022-11-18 Behavior recognition method, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/133025 WO2024103417A1 (en) 2022-11-18 2022-11-18 Behavior recognition method, storage medium and electronic device

Publications (1)

Publication Number Publication Date
WO2024103417A1 true WO2024103417A1 (en) 2024-05-23

Family

ID=91083541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133025 WO2024103417A1 (en) 2022-11-18 2022-11-18 Behavior recognition method, storage medium and electronic device

Country Status (1)

Country Link
WO (1) WO2024103417A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388888A (en) * 2018-03-23 2018-08-10 腾讯科技(深圳)有限公司 A kind of vehicle identification method, device and storage medium
CN111028222A (en) * 2019-12-11 2020-04-17 广州视源电子科技股份有限公司 Video detection method and device, computer storage medium and related equipment
CN111523510A (en) * 2020-05-08 2020-08-11 国家邮政局邮政业安全中心 Behavior recognition method, behavior recognition device, behavior recognition system, electronic equipment and storage medium
WO2020196985A1 (en) * 2019-03-27 2020-10-01 연세대학교 산학협력단 Apparatus and method for video action recognition and action section detection
CN113177450A (en) * 2021-04-20 2021-07-27 北京有竹居网络技术有限公司 Behavior recognition method and device, electronic equipment and storage medium
CN114202719A (en) * 2021-11-12 2022-03-18 中原动力智能机器人有限公司 Video sample labeling method and device, computer equipment and storage medium
CN115035463A (en) * 2022-08-09 2022-09-09 阿里巴巴(中国)有限公司 Behavior recognition method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388888A (en) * 2018-03-23 2018-08-10 腾讯科技(深圳)有限公司 A kind of vehicle identification method, device and storage medium
WO2020196985A1 (en) * 2019-03-27 2020-10-01 연세대학교 산학협력단 Apparatus and method for video action recognition and action section detection
CN111028222A (en) * 2019-12-11 2020-04-17 广州视源电子科技股份有限公司 Video detection method and device, computer storage medium and related equipment
CN111523510A (en) * 2020-05-08 2020-08-11 国家邮政局邮政业安全中心 Behavior recognition method, behavior recognition device, behavior recognition system, electronic equipment and storage medium
CN113177450A (en) * 2021-04-20 2021-07-27 北京有竹居网络技术有限公司 Behavior recognition method and device, electronic equipment and storage medium
CN114202719A (en) * 2021-11-12 2022-03-18 中原动力智能机器人有限公司 Video sample labeling method and device, computer equipment and storage medium
CN115035463A (en) * 2022-08-09 2022-09-09 阿里巴巴(中国)有限公司 Behavior recognition method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
He et al. Automatic depression recognition using CNN with attention mechanism from videos
Dalvi et al. A survey of ai-based facial emotion recognition: Features, ml & dl techniques, age-wise datasets and future directions
Zhang et al. Multimodal learning for facial expression recognition
Wang et al. Unsupervised learning of visual representations using videos
Sariyanidi et al. Automatic analysis of facial affect: A survey of registration, representation, and recognition
Chen et al. Classification of drinking and drinker-playing in pigs by a video-based deep learning method
Kollias et al. Training deep neural networks with different datasets in-the-wild: The emotion recognition paradigm
Chen et al. Automatic social signal analysis: Facial expression recognition using difference convolution neural network
Salunke et al. A new approach for automatic face emotion recognition and classification based on deep networks
Mici et al. A self-organizing neural network architecture for learning human-object interactions
KR102128158B1 (en) Emotion recognition apparatus and method based on spatiotemporal attention
CN113673244B (en) Medical text processing method, medical text processing device, computer equipment and storage medium
Wang et al. Cross-agent action recognition
CN114511912A (en) Cross-library micro-expression recognition method and device based on double-current convolutional neural network
Ousmane et al. Automatic recognition system of emotions expressed through the face using machine learning: Application to police interrogation simulation
Qiao et al. Automated individual cattle identification using video data: a unified deep learning architecture approach
Samadiani et al. A novel video emotion recognition system in the wild using a random forest classifier
Sara et al. A deep learning facial expression recognition based scoring system for restaurants
CN115713807A (en) Behavior recognition method, storage medium, and electronic device
Jagadeesh et al. Dynamic FERNet: Deep learning with optimal feature selection for face expression recognition in video
Wang et al. Deep learning (DL)-enabled system for emotional big data
WO2024103417A1 (en) Behavior recognition method, storage medium and electronic device
Liu Improved convolutional neural networks for course teaching quality assessment
Kumbhar et al. Gender and Age Detection using Deep Learning
Laptev Modeling and visual recognition of human actions and interactions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22965596

Country of ref document: EP

Kind code of ref document: A1