CN115035463A - Behavior recognition method, device, equipment and storage medium - Google Patents

Behavior recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN115035463A
CN115035463A CN202210952356.XA CN202210952356A CN115035463A CN 115035463 A CN115035463 A CN 115035463A CN 202210952356 A CN202210952356 A CN 202210952356A CN 115035463 A CN115035463 A CN 115035463A
Authority
CN
China
Prior art keywords
video
category
behavior
training sample
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210952356.XA
Other languages
Chinese (zh)
Other versions
CN115035463B (en
Inventor
岑俊
张士伟
吕逸良
赵德丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210952356.XA priority Critical patent/CN115035463B/en
Publication of CN115035463A publication Critical patent/CN115035463A/en
Application granted granted Critical
Publication of CN115035463B publication Critical patent/CN115035463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a behavior recognition method, a behavior recognition device, equipment and a storage medium, wherein the behavior recognition method comprises the following steps: inputting a video to be recognized into a recognition model to obtain video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category; obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to at least one behavior category; according to the video characteristics and the target sample characteristics corresponding to the obtained multiple training sample videos, determining the uncertainty corresponding to the video to be identified; determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an in-domain category containing at least one behavior category or an out-of-domain category not containing at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining the corresponding target behavior category of the video to be identified under the intra-domain category. According to the scheme, the domain type of the video to be recognized is determined according to the uncertainty, so that the accuracy of the behavior recognition result is improved.

Description

Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a behavior recognition method, a behavior recognition device, behavior recognition equipment and a storage medium.
Background
Classification recognition, such as behavior classification recognition, is an important application direction of deep learning technology, and is also a basic task in video analysis. Behavior recognition for a video refers to a category used to analyze the motion of a target person in the video. In order to complete the behavior recognition task, a recognition model is generally trained based on a large amount of labeled training data, so that the recognition model can learn feature information of different behavior categories to complete the behavior recognition task of the video.
The traditional behavior recognition task uses a large amount of labeled training data to train a model, wherein the labeled data belongs to in-domain data (in-distribution data), so that the trained model can only correctly classify the in-domain data, and can misclassify data which does not appear during training, namely out-of-domain data (out-of-distribution data), into a certain class of the in-domain data, so that the practical application is limited, and the accuracy of a recognition result is poor.
Disclosure of Invention
The embodiment of the invention provides a behavior recognition method, a behavior recognition device, equipment and a storage medium, which are used for improving the accuracy of a behavior recognition result.
In a first aspect, an embodiment of the present invention provides a behavior identification method, where the method includes:
inputting a video to be identified into an identification model to obtain video characteristics corresponding to the video to be identified, wherein the identification model is used for identifying at least one behavior category;
obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category;
according to the video characteristics and the target sample characteristics corresponding to the training sample videos, the uncertainty corresponding to the video to be recognized is determined;
determining a domain category of the video to be recognized according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;
and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
In a second aspect, an embodiment of the present invention provides a behavior recognition apparatus, where the apparatus includes:
the system comprises an extraction module, a recognition module and a processing module, wherein the extraction module is used for inputting a video to be recognized into a recognition model so as to obtain video characteristics corresponding to the video to be recognized, and the recognition model is used for recognizing at least one behavior category;
the determining module is used for acquiring target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category; according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;
the identification module is used for determining the domain category of the video to be identified according to the uncertainty, wherein the domain category is an in-domain category containing the at least one behavior category or an out-of-domain category not containing the at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the behaviour recognition method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the behavior recognition method according to the first aspect.
In a fifth aspect, an embodiment of the present invention provides a behavior identification method, where the method includes:
receiving a request triggered by user equipment by calling a behavior recognition service, wherein the request comprises a video to be recognized;
executing the following steps by utilizing the processing resource corresponding to the behavior recognition service:
inputting a video to be recognized into a recognition model to obtain video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category;
obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category;
according to the video characteristics and the target sample characteristics corresponding to the training sample videos, the uncertainty corresponding to the video to be recognized is determined;
determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;
and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
In the embodiment of the invention, when the action category of the target person in the video needs to be identified, firstly, the video to be identified is input into the identification model to obtain the video characteristics corresponding to the video to be identified. The recognition model is used for recognizing at least one behavior category, and target sample characteristics corresponding to a plurality of training sample videos can be obtained based on the trained recognition model, wherein the training sample videos correspond to the at least one behavior category recognizable by the recognition model. And then, according to the video characteristics of the video to be recognized and the target sample characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be recognized, according to the uncertainty, determining whether the video to be recognized belongs to the intra-domain category or the out-of-domain category, and if the video to be recognized belongs to the intra-domain category, determining the target behavior category corresponding to the video to be recognized under the intra-domain category.
In the above scheme, the trained recognition model is used to obtain the target sample characteristics corresponding to each of the multiple training sample videos of at least one behavior category, which means that the recognition model pays attention to the differences of the characteristics of different training sample videos in the same behavior category, and the learning of the differences is helpful to better and accurately recognize whether the video to be recognized belongs to the intra-domain category (i.e., whether the video belongs to one of the at least one behavior category). Specifically, based on the target sample characteristics and the video characteristics of the video to be recognized, the uncertainty index for distinguishing whether the video to be recognized belongs to the intra-domain category or the out-domain category can be determined, so that the domain classification of the video to be recognized is completed, the video to be recognized belonging to the out-domain category can be prevented from being classified to a certain intra-domain category in a wrong manner, and the accuracy of the behavior recognition result can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a behavior recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a calculation process of uncertainty provided by an embodiment of the present invention;
fig. 3 is a schematic application diagram of a behavior recognition method according to an embodiment of the present invention;
FIG. 4 is a flowchart of a recognition model training method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an implementation of the embodiment shown in FIG. 4;
FIG. 6 is a flowchart of another recognition model training method according to an embodiment of the present invention;
FIG. 7 is a schematic illustration of an implementation of the embodiment shown in FIG. 6;
fig. 8 is a schematic application diagram of a behavior recognition method according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device provided in this embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.
The following concepts involved in the embodiments of the present invention will be explained.
Behavior recognition, which is related to behavior recognition for video in the present embodiment, refers to analyzing the category of the motion of the target person in the video, that is, recognizing the behavior category of the target person in the video, such as: fist making, clapping, etc. To complete the behavior recognition task, a recognition model is usually trained based on a large amount of labeled training data, the recognition model can learn feature information of different behavior classes, and then the recognition of at least one behavior class can be completed by using the recognition model.
The intra-domain category refers to at least one behavior category that can be recognized by the trained recognition model, that is, at least one behavior category labeled corresponding to a large number of training sample videos in the model training process.
The out-of-domain category refers to a category of behavior other than at least one category of behavior that the recognition model can recognize, as opposed to the in-domain category.
The inter-class information is information required to distinguish different behavior classes. The more inter-class information extracted by the recognition model from a plurality of training sample videos means that the stronger the ability of the recognition model to distinguish classes in different domains, the better the classification effect.
The intra-class information refers to different information contained in different training sample videos of the same behavior class. Different training sample videos belonging to the same behavior category have respective special (differential) feature information. These differential information has no effect on the recognition classification of different behavior classes within the domain, but has an important effect on distinguishing between intra-domain and out-of-domain classes.
And (3) closed set behavior recognition, namely recognizing the corresponding category of the input video to be recognized in the at least one behavior category by using a recognition model trained by a large number of training sample videos corresponding to the at least one behavior category, and called closed set behavior recognition. The closed set refers to classification and recognition within the at least one behavior category (domain).
And (3) open set behavior recognition, wherein the closed set behavior recognition can only correctly classify various behavior classes in the domain, and can wrongly classify data which does not appear during training, namely data outside the domain, into a certain class in the domain, so that the practical application is limited. Therefore, open-set behavior recognition requires that the recognition model correctly classify the in-domain classes and recognize the out-of-domain data. That is to say, in practical applications, when behavior recognition is performed by using a recognition model, it is generally not guaranteed that videos to be recognized input into the recognition model all belong to intra-domain categories, and when an input video m does not belong to an intra-domain category, conventional closed set behavior recognition may erroneously classify the video m into a certain behavior category corresponding to the intra-domain category, resulting in an erroneous prediction of the behavior category of the video m.
The embodiment of the invention provides a behavior recognition scheme, which can also be called open set behavior recognition, and the open set behavior recognition not only can correctly classify the in-domain classes but also can recognize the out-of-domain classes through a trained recognition model. The overall idea of the scheme is as follows: firstly, based on the video characteristics corresponding to the video to be recognized and the target sample characteristics corresponding to a plurality of training sample videos for training the recognition model, the uncertainty corresponding to the video to be recognized, namely the probability that the video to be recognized belongs to the category outside the domain, is determined. And then, determining whether the video to be identified belongs to the intra-domain category or the out-of-domain category according to the uncertainty, and further determining the corresponding behavior category of the video to be identified under the intra-domain category if the video to be identified belongs to the intra-domain category. Therefore, the method can avoid the error classification of the class outside the domain into a class in a certain domain, and improve the accuracy of the behavior recognition result.
The behavior recognition method provided by the embodiment of the invention can be executed by an electronic device, the electronic device can be a server or a user terminal, and the server can be a physical server or a virtual server (virtual machine) of a cloud.
Fig. 1 is a flowchart of a behavior recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
101. and inputting the video to be recognized into the recognition model to obtain the video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category.
102. And acquiring target sample characteristics corresponding to the training sample videos respectively, wherein the training sample videos correspond to at least one behavior category.
103. And according to the video characteristics and the target sample characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified.
104. And determining the domain category of the video to be recognized according to the uncertainty, wherein the domain category is an in-domain category containing at least one behavior category recognizable by the recognition model or an out-of-domain category not containing at least one behavior category recognizable by the recognition model.
105. And if the video to be identified belongs to the intra-domain category, determining the corresponding target behavior category of the video to be identified under the intra-domain category.
In this embodiment, a recognition model capable of recognizing at least one behavior category may be trained in advance based on a plurality of training sample videos corresponding to at least one behavior category. Optionally, the recognition model is composed of three functional modules: sampler, characteristic extractor, classifier. The model training process for the recognition model will be described in detail below. Since the model training process is similar to the process of using the model to perform behavior recognition on the video to be recognized, only the process of using the model will be described first.
In practical applications, the recognition model may be, for example: model structures such as a Temporal Shift Module (TSM), an underflated 3D ConvNet (I3D), etc., but not limited thereto.
In open set behavior recognition, one important role that the recognition model plays is feature extraction, including: and extracting the video features of the video to be identified and extracting the video features corresponding to the training sample videos.
For convenience of distinguishing, in this embodiment, video features corresponding to a plurality of training sample videos are referred to as target sample features, and the video features hereinafter refer to video features of videos to be recognized.
When the video features of the video to be recognized are extracted, the working process of the recognition model is as follows:
sampling an input video to be identified through a sampler to obtain an image set containing a plurality of frames of images;
sampling a plurality of frames of images in the image set through a feature extractor to obtain a plurality of feature matrixes corresponding to the plurality of frames of images;
and performing pooling processing on the plurality of feature matrixes to obtain video features corresponding to the video to be identified.
Assuming that the length of the input video to be recognized is 10 seconds and the frame rate is 30fps, the original input video to be recognized has 300 frames of images in total, assuming that the preset sampling mode is fixed-interval sampling, 16 frames of images are sampled from the 300 frames of images based on the set fixed sampling interval, and expressed as:
ts = [1, 20, …, 300], where Ts represents an image set including a plurality of frame images sampled by the sampler, and 1, 20 … represents the image of the next frame. Therefore, the sampler can sample the multi-frame image reflecting the global information of the video to be identified from the global angle of the video to be identified. In an alternative embodiment, after the image set including the frame images is sampled, the size scaling process may be further performed on each frame image in the image set, so that the scaled image size is adapted to the set input image size required by the recognition model.
Then, each frame image included in the image set is input to a feature extractor of the recognition model to extract a feature matrix (which may also be referred to as a feature map) corresponding to each frame image, thereby obtaining 16 feature matrices corresponding to 16 frame images.
And then performing pooling (Pooling) processing such as global average pooling on the obtained 16 feature matrices to obtain a pooled feature matrix, wherein the pooled feature matrix is the video feature of the video to be identified.
The process of obtaining the target sample characteristics corresponding to each of the plurality of training sample videos is similar to the process of extracting the video characteristics of the video to be recognized, and specifically, the plurality of training sample videos can be respectively input into a trained recognition model to obtain the target sample characteristics of each of the plurality of training sample videos.
After the video features and the target sample features corresponding to the training sample videos are obtained, the probability that the video to be identified belongs to the class outside the domain can be determined according to the video features and the target sample features. In this embodiment, the uncertainty represents the probability that the video to be recognized belongs to the out-of-domain category.
In a specific implementation process, optionally, sample statistical characteristics corresponding to a plurality of training sample videos may be determined according to target sample characteristics corresponding to the plurality of training sample videos; and then, according to the video characteristics and the sample statistical characteristics corresponding to the plurality of training sample videos, determining the uncertainty corresponding to the video to be identified. Wherein, the sample statistical characteristics include: and the mean value and covariance matrix of the target sample characteristics corresponding to the training sample videos respectively.
Optionally, the method for determining the uncertainty corresponding to the video to be recognized according to the video features and the target sample features corresponding to the multiple training sample videos includes the following steps:
determining sample statistical characteristics corresponding to the training sample videos according to target sample characteristics corresponding to the training sample videos, wherein the sample statistical characteristics comprise mean values and covariance matrixes of the target sample characteristics;
determining a feature difference value between the video feature and a mean of the plurality of target sample features;
and determining the uncertainty corresponding to the video to be identified according to the characteristic difference and the covariance matrix.
Optionally, determining an uncertainty corresponding to the video to be identified according to the feature difference and the covariance matrix, including:
and determining the product of the transpose of the characteristic difference value, the inverse of the covariance matrix and the characteristic difference value as the corresponding uncertainty of the video to be identified.
Based on the above, the uncertainty corresponding to the video to be identified can be calculated based on equation (1) representing mahalanobis distance as follows:
Figure 840420DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 288719DEST_PATH_IMAGE002
representing the uncertainty of the video to be identified,
Figure 599614DEST_PATH_IMAGE003
representing the video characteristics of the video to be identified,
Figure 525982DEST_PATH_IMAGE004
represents the number of target sample features (i.e. the number of training sample videos),
Figure 74775DEST_PATH_IMAGE005
to represent
Figure 428396DEST_PATH_IMAGE004
The mean of the characteristics of the individual target samples,
Figure 492167DEST_PATH_IMAGE006
a difference value of the characteristic is represented,
Figure 222226DEST_PATH_IMAGE007
represent
Figure 625525DEST_PATH_IMAGE006
The transpose of (a) is performed,
Figure 170555DEST_PATH_IMAGE008
to represent
Figure 721622DEST_PATH_IMAGE004
The inverse of the covariance matrix for each target sample feature.
For ease of understanding, the process of calculating the uncertainty corresponding to the video to be identified is described with reference to fig. 2.
As shown in fig. 2, assuming that there are m training sample videos, by inputting the m training sample videos into the recognition model, m target sample features, E1, E2, …, Em, can be obtained. Then, the average value of the m target sample characteristics E1, E2, … and Em is obtained by carrying out the average value processing on the m target sample characteristics E1, E2, E26 and Em
Figure 989793DEST_PATH_IMAGE005
(ii) a Calculating covariance matrixes of m target sample features E1, E2, … and Em
Figure 44336DEST_PATH_IMAGE009
. Then, calculating the video characteristics of the video to be identified
Figure 5339DEST_PATH_IMAGE003
And mean value
Figure 778123DEST_PATH_IMAGE005
Difference in characteristics between
Figure 787667DEST_PATH_IMAGE006
. Finally, transposing the feature difference values
Figure 962297DEST_PATH_IMAGE010
Inverse of covariance matrix
Figure 94201DEST_PATH_IMAGE008
And a difference in the characteristics
Figure 557543DEST_PATH_IMAGE006
The product of the three is determined as the uncertainty of the video to be identified
Figure 433095DEST_PATH_IMAGE002
The uncertainty is used for representing the probability that the video to be identified belongs to the category outside the domain, in practical application, an uncertainty threshold value can be preset, and the video to be identified is determined to belong to the category inside the domain or the category outside the domain according to the actually calculated magnitude relation between the uncertainty and the uncertainty threshold value. Such as: when the uncertainty corresponding to the video to be recognized is larger than an uncertainty threshold value, determining that the video to be recognized belongs to the category outside the domain; and when the uncertainty corresponding to the video to be identified is less than or equal to the uncertainty threshold value, determining that the video to be identified belongs to the intra-domain category.
Optionally, if the video to be recognized belongs to the class outside the domain, outputting prompt information for prompting a new behavior class mark to the video to be recognized, so as to perform optimization training on the recognition model.
And if the video to be identified belongs to the intra-domain category, determining the corresponding target behavior category of the video to be identified under the intra-domain category.
The method for determining the target behavior category corresponding to the video to be recognized under the intra-domain category comprises the following steps:
acquiring target class characteristics corresponding to at least one behavior class, wherein the target class characteristics are parameters trained in the recognition model;
and determining the target behavior category corresponding to the video to be identified under the intra-domain category according to the similarity between the video characteristics and the target category characteristics corresponding to the at least one behavior category. In an alternative embodiment, the similarity corresponding to the video feature and the target class feature corresponding to each of the at least one behavior class may be determined by calculating a cosine distance between the video feature and the target class feature.
The target category characteristics of a certain behavior category can be understood as the general characteristics of the behavior category. When the recognition model is trained, a category feature is generally initialized randomly for each behavior category to represent the general features of each behavior category, and then the category feature is continuously adjusted in the training process, so that the target category feature which can truly and correctly represent the general features of each behavior category is obtained after the recognition model is trained.
Because the information contained in the target category features of different categories is different, the target behavior category corresponding to the video to be identified can be determined according to the similarity between the video features and the target category features.
The similarity between the target category characteristics corresponding to the target behavior categories and the video characteristics is greater than the similarity between the target category characteristics corresponding to the other behavior categories and the video characteristics.
In practical application, according to the uncertainty of the video to be recognized, the behavior of the video to be recognized can be recognized based on the following formula (2):
Figure 462231DEST_PATH_IMAGE011
wherein, the first and the second end of the pipe are connected with each other,
Figure 765036DEST_PATH_IMAGE012
representing the behavior recognition result of the video to be recognized,
Figure 450096DEST_PATH_IMAGE003
representing the video characteristics of the video to be identified,
Figure 863760DEST_PATH_IMAGE013
representing behavior classes
Figure 12981DEST_PATH_IMAGE014
The corresponding object class characteristics are set to be,
Figure 486688DEST_PATH_IMAGE015
representing video features
Figure 659043DEST_PATH_IMAGE003
And object class characteristics
Figure 610819DEST_PATH_IMAGE013
The degree of similarity between the two images is determined,
Figure 880126DEST_PATH_IMAGE002
representing the uncertainty of the video to be identified,
Figure 527664DEST_PATH_IMAGE016
a threshold value of the degree of uncertainty is indicated,
Figure 984053DEST_PATH_IMAGE017
the video to be identified is indicated to belong to the out-of-domain category, and may be a preset value, such as 0.
With respect to equation (2), specifically, if the uncertainty of the video to be identified
Figure 739519DEST_PATH_IMAGE002
Greater than a threshold of uncertainty
Figure 535437DEST_PATH_IMAGE016
Then the behavior recognition result of the video to be recognized
Figure 350946DEST_PATH_IMAGE012
Is composed of
Figure 560211DEST_PATH_IMAGE017
. If the uncertainty of the video to be identified
Figure 322630DEST_PATH_IMAGE002
Less than or equal to an uncertainty threshold
Figure 35371DEST_PATH_IMAGE016
Then the behavior recognition result of the video to be recognized
Figure 21782DEST_PATH_IMAGE012
Comprises the following steps: and video features
Figure 452763DEST_PATH_IMAGE003
And the behavior class corresponding to the target class characteristic with the maximum similarity.
For example, when the uncertainty of the video to be identified
Figure 753294DEST_PATH_IMAGE002
Less than or equal to the uncertainty threshold
Figure 851700DEST_PATH_IMAGE016
In the process, it is assumed that the recognition model can recognize N behavior classes, namely a behavior class 1, a behavior class 2, … and a behavior class N, and corresponding target class characteristics exist for any behavior class
Figure 743433DEST_PATH_IMAGE013
Wherein, in the process,
Figure 864973DEST_PATH_IMAGE014
the labels corresponding to the category of the behavior,
Figure 31512DEST_PATH_IMAGE014
is 1,2, …, N.
Hypothesis video features
Figure 453266DEST_PATH_IMAGE003
Object class features associated with behavior class 1
Figure 781479DEST_PATH_IMAGE018
Similarity between them
Figure 124736DEST_PATH_IMAGE019
= S1; video features
Figure 94966DEST_PATH_IMAGE003
Object class features associated with behavior class 2
Figure 840068DEST_PATH_IMAGE020
Similarity between them
Figure 339182DEST_PATH_IMAGE021
= S2, and so on, video features
Figure 232052DEST_PATH_IMAGE003
Object class features associated with behavior class N
Figure 678077DEST_PATH_IMAGE022
Similarity between them
Figure 337073DEST_PATH_IMAGE023
=SN。
In determining video characteristics
Figure 210351DEST_PATH_IMAGE003
And after N similarity degrees S1, S2 and … between N target category characteristics corresponding to the N behavior categories respectively are obtained, and after SN, assuming that S2 is the maximum similarity degree in the N similarity degrees, determining that the target behavior category corresponding to the video to be recognized under the intra-domain category is the behavior category 2.
The identification process of the behavior category in the video provided by the above embodiment may be executed with reference to fig. 3, and fig. 3 is an application schematic diagram of the behavior identification method provided by the embodiment of the present invention.
In summary, in the scheme provided by the embodiment of the present invention, the uncertainty of the video to be identified is determined according to the features of the video to be identified and the respective target sample features corresponding to the plurality of training sample videos; and then, determining whether the video to be identified belongs to the intra-domain category or the out-of-domain category according to the uncertainty, and further determining the corresponding target behavior category of the video to be identified under the intra-domain category when the video to be identified belongs to the intra-domain category. Therefore, the behavior type identification can be accurately carried out on the video to be identified which belongs to the intra-domain type, whether the video to be identified belongs to the out-of-domain type can be effectively detected, the video to be identified which belongs to the out-of-domain type can be prevented from being wrongly classified into a certain intra-domain type, and the accuracy of the behavior identification result can be improved.
The above describes the use process of the recognition model, and the following describes the training process of the recognition model.
Fig. 4 is a flowchart of a recognition model training method according to an embodiment of the present invention, as shown in fig. 4, which may include the following steps:
401. and acquiring a first training sample video and a plurality of second training sample videos corresponding to the first behavior category.
402. And extracting a first sample characteristic corresponding to the first training sample video and a plurality of second sample characteristics corresponding to the plurality of second training sample videos through the recognition model.
403. Determining a first similarity between the first sample feature and the class feature corresponding to the first behavior class currently learned by the recognition model, and a second similarity between the first sample feature and features in the first feature set respectively, wherein the first feature set comprises a plurality of second sample features.
404. Determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises at least one behavior category except the first behavior category.
405. A third similarity between the first sample feature and the features in the second feature set is determined.
406. Determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints; and training the recognition model according to the loss function.
Wherein, the recognition model is trained according to the loss function, i.e. the model parameters are adjusted by back propagation. The model parameters comprise category characteristics corresponding to all behavior categories, and the category characteristics of all behavior categories can be adjusted through continuous learning during training of the recognition model.
The recognition model trained by the embodiment can be used to execute the behavior recognition scheme provided by the embodiment shown in fig. 1.
Specifically, in order to train the recognition model, a large number of training sample videos (i.e., a plurality of training sample videos above) corresponding to at least one behavior class need to be obtained in advance. When training the recognition model, for each behavior category, a corresponding training set may be set, where the training set includes a plurality of training sample videos.
Since the processing procedure for any training sample video corresponding to any behavior class is the same when training the recognition model, this embodiment takes the processing procedure corresponding to one behavior class as an example, and is exemplarily described with reference to fig. 5.
Assuming that a recognition model is to be trained to perform behavior recognition on N behavior classes, for one of the N behavior classes, it is referred to as a first behavior class in this embodiment, and the N-1 behavior classes except the first behavior class are referred to as second behavior classes. Assuming that the training set corresponding to the first behavior category includes M training sample videos, for one of the M training sample videos, it is referred to as a first training sample video q in this embodiment, and all training sample videos except the first training sample are referred to as second training sample videos.
First, a first training sample video q and a plurality of second training sample videos corresponding to a first behavior category are obtained.
Optionally, a first training set corresponding to the first behavior category may be obtained first; then, carrying out image augmentation processing on training sample videos contained in the first training set to obtain a second training set; thereafter, a first training sample video q and a plurality of second training sample videos are obtained from a second training set. The image augmentation processing comprises at least one of the following processing modes: and the spatial dimensions such as turning, rotation, gray level adjustment, brightness adjustment and the like are increased. The number of samples can be enlarged by the image enlargement processing. It should be noted that, since the training data is a video, the image augmentation processing refers to performing image augmentation processing on multiple frames of images included in one video, and assuming that one training sample video includes 100 frames of images and that a certain image augmentation processing is inversion, uniform inversion processing is performed on all the 100 frames of images to obtain another augmented training sample video.
And then, extracting a first sample feature corresponding to the first training sample video q and a plurality of second sample features corresponding to the plurality of second training sample videos through the recognition model. The extraction process can refer to the related description in the foregoing embodiments, which is not repeated herein.
In fact, in the training process of the recognition model, the sample characteristics of the training sample video corresponding to the second behavior category are also extracted through the recognition model.
In this embodiment, for convenience of description, two feature sets are defined, namely a first feature set and a second feature set. Wherein the first feature set includes a plurality of second sample features extracted in step 402, that is, sample features corresponding to a plurality of second training sample videos belonging to the first behavior category; the second feature set comprises class features corresponding to a second behavior class learned currently by the recognition model and sample features of the training sample video corresponding to the second behavior class. It should be noted that the category features corresponding to the second behavior category currently learned by the recognition model are actually N-1 category features corresponding to N-1 behavior categories other than the first behavior category, and may be understood as a feature set.
And then, carrying out similarity calculation on the first sample characteristic and the class characteristic corresponding to the first action class currently learned by the recognition model so as to determine a first similarity between the first sample characteristic and the class characteristic corresponding to the first action class currently learned by the recognition model. Since the class features corresponding to different behavior classes are used to distinguish different behavior classes, the calculation of the first similarity is actually used to measure the probability that the first training sample video q belongs to the first behavior class.
And performing similarity calculation on the first sample characteristic and each characteristic in the first characteristic set, and determining second similarity between the first sample characteristic and each characteristic in the first characteristic set. Since the first feature set corresponds to a plurality of second training sample videos in the first behavior class, the calculation of the second similarity is actually used for distinguishing the differences between different training sample videos in the first behavior class.
And performing similarity calculation on the first sample characteristic and each characteristic in the second characteristic set, and determining a third similarity between the first sample characteristic and each characteristic in the second characteristic set. Since the second feature set includes class features of the second behavior classes and sample features of the training sample videos of the second behavior classes, the third similarity is actually calculated to measure differences between the first training sample video q in the first behavior class and the second behavior classes.
Alternatively, the similarity in the present embodiment may be determined by calculating the cosine distance between the features.
And then, determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints, and training the recognition model according to the loss function. Wherein the similarity threshold is set to a value less than 1.
It can be understood that, with the constraint of the set similarity threshold value being smaller than 1, the difference between the first sample feature and the class feature of the first behavior class and the difference between the first sample feature and each feature (i.e., the second sample features) in the first feature set can be retained, i.e., the model can learn richer intra-class information. The learning of the information in the class is very important for distinguishing the data in the domain and the data out of the domain.
Alternatively, the loss function in the present embodiment is expressed by equation (3):
Figure 324937DEST_PATH_IMAGE024
wherein, the first and the second end of the pipe are connected with each other,
Figure 371391DEST_PATH_IMAGE003
the first characteristic of the sample is shown,
Figure 91085DEST_PATH_IMAGE025
class features representing a first behavioral class currently learned by the recognition model,
Figure 197581DEST_PATH_IMAGE026
a first set of features is represented that,
Figure 799464DEST_PATH_IMAGE027
one feature of the first set of features is represented,
Figure 587291DEST_PATH_IMAGE028
a second set of features is represented that,
Figure 223809DEST_PATH_IMAGE029
one feature in the second set of features is represented,
Figure 235628DEST_PATH_IMAGE030
it is indicated that a threshold value of the similarity degree is set,
Figure 262489DEST_PATH_IMAGE031
the temperature coefficient is, for example, in the following range: (
Figure 916325DEST_PATH_IMAGE032
),
Figure 141770DEST_PATH_IMAGE033
A first degree of similarity is indicated, and,
Figure 527752DEST_PATH_IMAGE034
a second degree of similarity is expressed in terms of,
Figure 369806DEST_PATH_IMAGE035
indicating a third degree of similarity.
Based on the loss function of equation (3), the first similarity and the second similarity are close to a set similarity threshold value less than 1
Figure 499436DEST_PATH_IMAGE030
And not close to 1, so that the sample features of different training sample videos in the same behavior category are allowed to be different from the category features of the behavior category to which the different training sample videos belong, and the differences of the sample features of different training sample videos in the same behavior category are learned, so that the corresponding different intra-category information of each training sample video in the first behavior category can be retained.
In summary, in this embodiment, based on the setting of the similarity threshold smaller than 1, for a plurality of training sample videos corresponding to a certain behavior category, the model can learn the difference features of different training sample videos of the same behavior category, that is, the identification model learns the intra-class information, thereby being beneficial to improving the accuracy of the open-set behavior identification.
Fig. 6 is a flowchart of another recognition model training method according to an embodiment of the present invention, as shown in fig. 6, which may include the following steps:
601. and acquiring a first training sample video and a plurality of second training sample videos corresponding to the first behavior category.
602. And extracting a first sample characteristic corresponding to the first training sample video and a plurality of second sample characteristics corresponding to the plurality of second training sample videos through the recognition model.
603. Carrying out image frame disorder processing on the first training sample video to obtain a third training sample video; and extracting a third sample characteristic corresponding to the third training sample video through the recognition model.
604. Determining a first similarity between the first sample feature and a category feature corresponding to a first behavior category currently learned by the recognition model, and a second similarity between the first sample feature and a feature in the first feature set respectively, wherein the first feature set comprises a plurality of second sample features and third sample features.
605. Determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises at least one behavior category other than the first behavior category.
606. A third similarity between the first sample feature and the features in the second feature set is determined.
607. Determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints; and training the recognition model according to the loss function.
The specific processes of steps 601-602 and 605-607 can refer to the foregoing embodiments, which are not described herein again.
In this embodiment, in the model training process, a third training sample video is introduced. The third training sample video is obtained by the first training sample video after image frame disorder processing, and the training sample video is augmented in the time dimension through the image frame disorder processing, so that the capability of the recognition model for extracting time dimension information is enhanced, the learned features in the recognition model contain richer inter-class information, and the classification capability of the recognition model for the intra-domain classes and the classification capability of the domain classes (distinguishing the classes outside and in the domains) are enhanced.
Specifically, with reference to fig. 7, first, a first training sample video and a plurality of second training sample videos corresponding to a first behavior category are obtained, and image frame disordering processing is performed on the first training sample video to obtain a third training sample video. And then, extracting a first sample feature corresponding to the first training sample video, a plurality of second sample features corresponding to the plurality of second training sample videos and a third sample feature corresponding to the third training sample video through the recognition model. And adding the third sample features into the first feature set, wherein the first feature set comprises a plurality of second sample features and third sample features, and the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features of the training sample video corresponding to the second behavior class.
Then, a first similarity between the first sample feature and a class feature corresponding to the first behavior class currently learned by the recognition model is calculated, a second similarity between the first sample feature and features in the first feature set is calculated, and a third similarity between the first sample feature and features in the second feature set is calculated.
Finally, determining a loss function corresponding to the video of the first training sample according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints; and training the recognition model according to the loss function.
In summary, in this embodiment, on the basis of the embodiment shown in fig. 4, the third sample feature corresponding to the third training sample video obtained by performing image frame out-of-order processing on the first training sample video is added to the first feature set. By using the constraint that the similarity between the first sample feature and the third sample feature is close to the set similarity threshold, on one hand, the similarity between the spatial information of the third training sample video and the spatial information of the first training sample video obtained after the image frame disorder processing can be ensured, and on the other hand, the difference between the temporal information of the third training sample video and the temporal information of the first training sample video can be ensured. Therefore, the potential recognition model training method in the embodiment enables the features learned by the recognition model to include more intra-class information and rich inter-class information, so that the domain category of the video to be recognized can be determined, the classification effect of the intra-domain category can be enhanced, and the performance of open-set behavior recognition is greatly improved.
As described above, the behavior recognition method provided in the embodiment of the present invention may be executed in the cloud, where a plurality of computing nodes (cloud servers) may be deployed in the cloud, and each computing node has processing resources such as computation and storage. In the cloud, a plurality of computing nodes may be organized to provide a certain service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), Application Programming Interface (API), and other forms.
According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of the behavior recognition service, and a user calls the service interface through user equipment to trigger a behavior recognition request to the cloud end, wherein the request comprises the video to be recognized. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:
inputting a video to be identified into an identification model to obtain video characteristics corresponding to the video to be identified, wherein the identification model is used for identifying at least one behavior category;
obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category;
according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;
determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;
and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
The above implementation process may refer to the related descriptions in the foregoing other embodiments, which are not described herein.
For ease of understanding, the description is exemplified in conjunction with fig. 8. The user may invoke a behavior recognition service interface through the user device E1 illustrated in fig. 8, where the behavior recognition service interface includes a form of SDK interface, API interface, etc., and illustrated in fig. 8 is the API interface through which a service request containing a video to be recognized is uploaded. In the cloud, as shown in the figure, assuming that the behavior recognition service is provided by the service cluster E2, the service cluster E2 includes at least one compute node therein. After receiving the request, the service cluster E2 executes the steps described in the foregoing embodiment to obtain a target behavior category corresponding to the intra-domain category when the video to be identified belongs to the intra-domain category, and sends the target behavior category corresponding to the video to be identified to the user equipment E1. The user device E1 displays the target behavior category, on the basis of which the user can perform further editing and the like.
The behavior recognizing device according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and by performing the steps taught in this disclosure.
Fig. 9 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present invention, and as shown in fig. 9, the apparatus includes: the device comprises an extraction module 11, a determination module 12 and an identification module 13.
The extraction module 11 is configured to input a video to be recognized into a recognition model to obtain video features corresponding to the video to be recognized, where the recognition model is configured to recognize at least one behavior category.
A determining module 12, configured to obtain target sample features corresponding to a plurality of training sample videos, where the plurality of training sample videos correspond to the at least one behavior category; and according to the video characteristics and the target sample characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified.
An identifying module 13, configured to determine, according to the uncertainty, a domain category of the video to be identified, where the domain category is an intra-domain category that includes the at least one behavior category or an extra-domain category that does not include the at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
Optionally, the recognition module 13 is specifically configured to obtain target category features corresponding to the at least one behavior category, where the target category features are parameters trained in the recognition model; and determining the target behavior category corresponding to the video to be identified under the intra-domain category according to the similarity between the video characteristics and the target category characteristics corresponding to the at least one behavior category.
Optionally, the determining module 12 is specifically configured to determine, according to the target sample features corresponding to the multiple training sample videos, sample statistical features corresponding to the multiple training sample videos; and according to the video characteristics and the sample statistical characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified. Wherein the sample statistical features include: and the mean value and covariance matrix of the target sample characteristics corresponding to the training sample videos respectively.
Optionally, the determining module 12 is further specifically configured to input the multiple training sample videos into the trained recognition model respectively, so as to obtain target sample features corresponding to the multiple training sample videos respectively.
Optionally, the identifying module 13 is further specifically configured to, if the video to be identified belongs to the out-of-domain category, output a prompt message for prompting that a new behavior category label is performed on the video to be identified.
Optionally, the apparatus further comprises: the training module is used for acquiring a first training sample video and a plurality of second training sample videos corresponding to the first behavior category; extracting a first sample feature corresponding to the first training sample video and a plurality of second sample features corresponding to the plurality of second training sample videos through the recognition model; determining a first similarity between the first sample feature and a category feature corresponding to the first behavior category currently learned by the recognition model, and second similarities between the first sample feature and features in a first feature set respectively, wherein the first feature set comprises the plurality of second sample features; determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises a behavior category of the at least one behavior category other than the first behavior category; determining a third similarity between the first sample feature and features in the second feature set; determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as a constraint; and training the recognition model according to the loss function.
Optionally, the training module is further configured to perform image frame out-of-order processing on the first training sample video to obtain a third training sample video; extracting a third sample characteristic corresponding to the third training sample video through the recognition model; adding the third sample feature to the first feature set.
Optionally, the training module is further configured to obtain a first training set corresponding to the first behavior category; carrying out image augmentation processing on training sample videos contained in the first training set to obtain a second training set; the first training sample video and a plurality of second training sample videos are taken from the second training set.
The apparatus shown in fig. 9 may perform the steps in the foregoing embodiments, and for details of the performing process and the technical effect, reference is made to the description in the foregoing embodiments, which is not described herein again.
In one possible design, the structure of the behavior recognizing apparatus shown in fig. 9 may be implemented as an electronic device. As shown in fig. 10, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code, which when executed by the processor 21, causes the processor 21 to implement at least the behavior recognition method as provided in the previous embodiments.
In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to implement at least the behavior recognition method as provided in the foregoing embodiments.
In an optional embodiment, the electronic device for executing the behavior recognition method provided in the embodiment of the present invention may be an Extended Reality (XR) device. XR, which is a generic term for virtual reality, augmented reality, and other forms.
The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by a necessary general hardware platform, and may also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (13)

1. A behavior recognition method, comprising:
inputting a video to be identified into an identification model to obtain video characteristics corresponding to the video to be identified, wherein the identification model is used for identifying at least one behavior category;
obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category;
according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;
determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;
and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
2. The method according to claim 1, wherein the determining a target behavior category corresponding to the video to be recognized under the intra-domain category comprises:
acquiring target category characteristics corresponding to the at least one behavior category, wherein the target category characteristics are parameters trained in the recognition model;
and determining the target behavior category corresponding to the video to be identified under the intra-domain category according to the similarity between the video characteristics and the target category characteristics corresponding to the at least one behavior category.
3. The method according to claim 1, wherein the determining the uncertainty corresponding to the video to be recognized according to the video feature and the target sample feature corresponding to each of the plurality of training sample videos comprises:
determining sample statistical characteristics corresponding to the training sample videos according to target sample characteristics corresponding to the training sample videos;
and according to the video characteristics and the sample statistical characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified.
4. The method of claim 3, wherein the sample statistical features comprise: and the mean value and covariance matrix of the target sample characteristics corresponding to the training sample videos respectively.
5. The method according to any one of claims 1 to 4, wherein the obtaining of the target sample feature corresponding to each of the plurality of training sample videos comprises:
and respectively inputting the training sample videos into the trained recognition model to obtain target sample characteristics corresponding to the training sample videos.
6. The method according to any one of claims 1 to 4, wherein the training process of the recognition model comprises:
acquiring a first training sample video and a plurality of second training sample videos corresponding to a first behavior category;
extracting a first sample feature corresponding to the first training sample video and a plurality of second sample features corresponding to the plurality of second training sample videos through the recognition model;
determining a first similarity between the first sample feature and a class feature corresponding to the first behavior class currently learned by the recognition model, and a second similarity between the first sample feature and features in a first feature set respectively, wherein the first feature set comprises the plurality of second sample features;
determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises a behavior category of the at least one behavior category other than the first behavior category;
determining a third similarity between the first sample feature and features in the second feature set;
determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as a constraint;
and training the recognition model according to the loss function.
7. The method of claim 6, further comprising:
performing image frame disorder processing on the first training sample video to obtain a third training sample video;
extracting a third sample characteristic corresponding to the third training sample video through the recognition model;
adding the third sample feature to the first feature set.
8. The method of claim 6, further comprising:
acquiring a first training set corresponding to the first behavior category;
carrying out image augmentation processing on training sample videos contained in the first training set to obtain a second training set; the first training sample video and a plurality of second training sample videos are taken from the second training set.
9. The method of claim 1, further comprising:
and if the video to be recognized belongs to the out-of-domain category, outputting prompt information for prompting to perform new behavior category marking on the video to be recognized.
10. A behavior recognition apparatus, comprising:
the system comprises an extraction module, a recognition module and a processing module, wherein the extraction module is used for inputting a video to be recognized into a recognition model so as to obtain video characteristics corresponding to the video to be recognized, and the recognition model is used for recognizing at least one behavior category;
the determining module is used for acquiring target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category; according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;
the identification module is used for determining the domain category of the video to be identified according to the uncertainty, wherein the domain category is an in-domain category containing the at least one behavior category or an out-of-domain category not containing the at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
11. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to carry out the behaviour recognition method according to any one of claims 1 to 9.
12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the behavior recognition method of any one of claims 1 to 9.
13. A method of behavior recognition, comprising:
receiving a request triggered by user equipment by calling a behavior recognition service, wherein the request comprises a video to be recognized;
executing the following steps by utilizing the processing resource corresponding to the behavior recognition service:
inputting a video to be recognized into a recognition model to obtain video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category;
obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category;
according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;
determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;
and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.
CN202210952356.XA 2022-08-09 2022-08-09 Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium Active CN115035463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210952356.XA CN115035463B (en) 2022-08-09 2022-08-09 Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210952356.XA CN115035463B (en) 2022-08-09 2022-08-09 Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115035463A true CN115035463A (en) 2022-09-09
CN115035463B CN115035463B (en) 2023-01-17

Family

ID=83130037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210952356.XA Active CN115035463B (en) 2022-08-09 2022-08-09 Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115035463B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024103417A1 (en) * 2022-11-18 2024-05-23 中国科学院深圳先进技术研究院 Behavior recognition method, storage medium and electronic device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102237089A (en) * 2011-08-15 2011-11-09 哈尔滨工业大学 Method for reducing error identification rate of text irrelevant speaker identification system
CN111612093A (en) * 2020-05-29 2020-09-01 Oppo广东移动通信有限公司 Video classification method, video classification device, electronic equipment and storage medium
CN112949780A (en) * 2020-04-21 2021-06-11 佳都科技集团股份有限公司 Feature model training method, device, equipment and storage medium
CN113076994A (en) * 2021-03-31 2021-07-06 南京邮电大学 Open-set domain self-adaptive image classification method and system
CN114241260A (en) * 2021-12-14 2022-03-25 四川大学 Open set target detection and identification method based on deep neural network
CN114332529A (en) * 2021-12-21 2022-04-12 北京达佳互联信息技术有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN114330522A (en) * 2021-12-22 2022-04-12 上海高德威智能交通系统有限公司 Training method, device and equipment of image recognition model and storage medium
US20220147765A1 (en) * 2020-11-10 2022-05-12 Nec Laboratories America, Inc. Face recognition from unseen domains via learning of semantic features
WO2022105179A1 (en) * 2020-11-23 2022-05-27 平安科技(深圳)有限公司 Biological feature image recognition method and apparatus, and electronic device and readable storage medium
CN114612995A (en) * 2022-03-25 2022-06-10 新疆联海创智信息科技有限公司 Face feature recognition method and device
US20220245422A1 (en) * 2021-01-27 2022-08-04 Royal Bank Of Canada System and method for machine learning architecture for out-of-distribution data detection

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102237089A (en) * 2011-08-15 2011-11-09 哈尔滨工业大学 Method for reducing error identification rate of text irrelevant speaker identification system
CN112949780A (en) * 2020-04-21 2021-06-11 佳都科技集团股份有限公司 Feature model training method, device, equipment and storage medium
CN111612093A (en) * 2020-05-29 2020-09-01 Oppo广东移动通信有限公司 Video classification method, video classification device, electronic equipment and storage medium
US20220147765A1 (en) * 2020-11-10 2022-05-12 Nec Laboratories America, Inc. Face recognition from unseen domains via learning of semantic features
WO2022105179A1 (en) * 2020-11-23 2022-05-27 平安科技(深圳)有限公司 Biological feature image recognition method and apparatus, and electronic device and readable storage medium
US20220245422A1 (en) * 2021-01-27 2022-08-04 Royal Bank Of Canada System and method for machine learning architecture for out-of-distribution data detection
CN113076994A (en) * 2021-03-31 2021-07-06 南京邮电大学 Open-set domain self-adaptive image classification method and system
CN114241260A (en) * 2021-12-14 2022-03-25 四川大学 Open set target detection and identification method based on deep neural network
CN114332529A (en) * 2021-12-21 2022-04-12 北京达佳互联信息技术有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN114330522A (en) * 2021-12-22 2022-04-12 上海高德威智能交通系统有限公司 Training method, device and equipment of image recognition model and storage medium
CN114612995A (en) * 2022-03-25 2022-06-10 新疆联海创智信息科技有限公司 Face feature recognition method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KIMIN LEE 等: "A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks", 《ARXIV》 *
TECHAPANURAK E 等: "Hyperparameter-Free Out-of-Distribution Detection Using Softmax of Scaled Cosine Similarity", 《ARXIV》 *
YANG DONGHUN 等: "Ensemble-Based Out-of-Distribution Detection", 《ELECTRONICS》 *
杨柳 等: "Open-Music:基于度量学习与特征子空间投影的电磁目标开集识别算法", 《电子学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024103417A1 (en) * 2022-11-18 2024-05-23 中国科学院深圳先进技术研究院 Behavior recognition method, storage medium and electronic device

Also Published As

Publication number Publication date
CN115035463B (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN111898696A (en) Method, device, medium and equipment for generating pseudo label and label prediction model
CN111652290B (en) Method and device for detecting countermeasure sample
WO2018196718A1 (en) Image disambiguation method and device, storage medium, and electronic device
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN115035463B (en) Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN111738319A (en) Clustering result evaluation method and device based on large-scale samples
CN116311214A (en) License plate recognition method and device
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN113761282B (en) Video duplicate checking method and device, electronic equipment and storage medium
CN110795410A (en) Multi-field text classification method
CN111242114B (en) Character recognition method and device
CN114882334B (en) Method for generating pre-training model, model training method and device
CN115063858A (en) Video facial expression recognition model training method, device, equipment and storage medium
CN112989869B (en) Optimization method, device, equipment and storage medium of face quality detection model
CN113642443A (en) Model testing method and device, electronic equipment and storage medium
CN111325068A (en) Video description method and device based on convolutional neural network
CN112016540B (en) Behavior identification method based on static image
CN115988100B (en) Gateway management method for intelligent perception of Internet of things of equipment based on multi-protocol self-adaption
CN110909688B (en) Face detection small model optimization training method, face detection method and computer system
CN114942986B (en) Text generation method, text generation device, computer equipment and computer readable storage medium
CN116704244A (en) Course domain schematic diagram object detection method, system, equipment and storage medium
CN117037294A (en) Method, apparatus, device and medium for training and identifying living models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant