CN115035463A

CN115035463A - Behavior recognition method, device, equipment and storage medium

Info

Publication number: CN115035463A
Application number: CN202210952356.XA
Authority: CN
Inventors: 岑俊; 张士伟; 吕逸良; 赵德丽
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-09-09
Anticipated expiration: 2042-08-09
Also published as: CN115035463B

Abstract

The application provides a behavior recognition method, a behavior recognition device, equipment and a storage medium, wherein the behavior recognition method comprises the following steps: inputting a video to be recognized into a recognition model to obtain video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category; obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to at least one behavior category; according to the video characteristics and the target sample characteristics corresponding to the obtained multiple training sample videos, determining the uncertainty corresponding to the video to be identified; determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an in-domain category containing at least one behavior category or an out-of-domain category not containing at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining the corresponding target behavior category of the video to be identified under the intra-domain category. According to the scheme, the domain type of the video to be recognized is determined according to the uncertainty, so that the accuracy of the behavior recognition result is improved.

Description

Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a behavior recognition method, a behavior recognition device, behavior recognition equipment and a storage medium.

Background

Classification recognition, such as behavior classification recognition, is an important application direction of deep learning technology, and is also a basic task in video analysis. Behavior recognition for a video refers to a category used to analyze the motion of a target person in the video. In order to complete the behavior recognition task, a recognition model is generally trained based on a large amount of labeled training data, so that the recognition model can learn feature information of different behavior categories to complete the behavior recognition task of the video.

The traditional behavior recognition task uses a large amount of labeled training data to train a model, wherein the labeled data belongs to in-domain data (in-distribution data), so that the trained model can only correctly classify the in-domain data, and can misclassify data which does not appear during training, namely out-of-domain data (out-of-distribution data), into a certain class of the in-domain data, so that the practical application is limited, and the accuracy of a recognition result is poor.

Disclosure of Invention

The embodiment of the invention provides a behavior recognition method, a behavior recognition device, equipment and a storage medium, which are used for improving the accuracy of a behavior recognition result.

In a first aspect, an embodiment of the present invention provides a behavior identification method, where the method includes:

inputting a video to be identified into an identification model to obtain video characteristics corresponding to the video to be identified, wherein the identification model is used for identifying at least one behavior category;

obtaining target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category;

according to the video characteristics and the target sample characteristics corresponding to the training sample videos, the uncertainty corresponding to the video to be recognized is determined;

determining a domain category of the video to be recognized according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;

and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.

In a second aspect, an embodiment of the present invention provides a behavior recognition apparatus, where the apparatus includes:

the system comprises an extraction module, a recognition module and a processing module, wherein the extraction module is used for inputting a video to be recognized into a recognition model so as to obtain video characteristics corresponding to the video to be recognized, and the recognition model is used for recognizing at least one behavior category;

the determining module is used for acquiring target sample characteristics corresponding to a plurality of training sample videos respectively, wherein the plurality of training sample videos correspond to the at least one behavior category; according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;

the identification module is used for determining the domain category of the video to be identified according to the uncertainty, wherein the domain category is an in-domain category containing the at least one behavior category or an out-of-domain category not containing the at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the behaviour recognition method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the behavior recognition method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides a behavior identification method, where the method includes:

receiving a request triggered by user equipment by calling a behavior recognition service, wherein the request comprises a video to be recognized;

executing the following steps by utilizing the processing resource corresponding to the behavior recognition service:

inputting a video to be recognized into a recognition model to obtain video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category;

determining a domain category of the video to be identified according to the uncertainty, wherein the domain category is an intra-domain category containing the at least one behavior category or an extra-domain category not containing the at least one behavior category;

In the embodiment of the invention, when the action category of the target person in the video needs to be identified, firstly, the video to be identified is input into the identification model to obtain the video characteristics corresponding to the video to be identified. The recognition model is used for recognizing at least one behavior category, and target sample characteristics corresponding to a plurality of training sample videos can be obtained based on the trained recognition model, wherein the training sample videos correspond to the at least one behavior category recognizable by the recognition model. And then, according to the video characteristics of the video to be recognized and the target sample characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be recognized, according to the uncertainty, determining whether the video to be recognized belongs to the intra-domain category or the out-of-domain category, and if the video to be recognized belongs to the intra-domain category, determining the target behavior category corresponding to the video to be recognized under the intra-domain category.

In the above scheme, the trained recognition model is used to obtain the target sample characteristics corresponding to each of the multiple training sample videos of at least one behavior category, which means that the recognition model pays attention to the differences of the characteristics of different training sample videos in the same behavior category, and the learning of the differences is helpful to better and accurately recognize whether the video to be recognized belongs to the intra-domain category (i.e., whether the video belongs to one of the at least one behavior category). Specifically, based on the target sample characteristics and the video characteristics of the video to be recognized, the uncertainty index for distinguishing whether the video to be recognized belongs to the intra-domain category or the out-domain category can be determined, so that the domain classification of the video to be recognized is completed, the video to be recognized belonging to the out-domain category can be prevented from being classified to a certain intra-domain category in a wrong manner, and the accuracy of the behavior recognition result can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a calculation process of uncertainty provided by an embodiment of the present invention;

fig. 3 is a schematic application diagram of a behavior recognition method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a recognition model training method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an implementation of the embodiment shown in FIG. 4;

FIG. 6 is a flowchart of another recognition model training method according to an embodiment of the present invention;

FIG. 7 is a schematic illustration of an implementation of the embodiment shown in FIG. 6;

fig. 8 is a schematic application diagram of a behavior recognition method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided in this embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The following concepts involved in the embodiments of the present invention will be explained.

Behavior recognition, which is related to behavior recognition for video in the present embodiment, refers to analyzing the category of the motion of the target person in the video, that is, recognizing the behavior category of the target person in the video, such as: fist making, clapping, etc. To complete the behavior recognition task, a recognition model is usually trained based on a large amount of labeled training data, the recognition model can learn feature information of different behavior classes, and then the recognition of at least one behavior class can be completed by using the recognition model.

The intra-domain category refers to at least one behavior category that can be recognized by the trained recognition model, that is, at least one behavior category labeled corresponding to a large number of training sample videos in the model training process.

The out-of-domain category refers to a category of behavior other than at least one category of behavior that the recognition model can recognize, as opposed to the in-domain category.

The inter-class information is information required to distinguish different behavior classes. The more inter-class information extracted by the recognition model from a plurality of training sample videos means that the stronger the ability of the recognition model to distinguish classes in different domains, the better the classification effect.

The intra-class information refers to different information contained in different training sample videos of the same behavior class. Different training sample videos belonging to the same behavior category have respective special (differential) feature information. These differential information has no effect on the recognition classification of different behavior classes within the domain, but has an important effect on distinguishing between intra-domain and out-of-domain classes.

And (3) closed set behavior recognition, namely recognizing the corresponding category of the input video to be recognized in the at least one behavior category by using a recognition model trained by a large number of training sample videos corresponding to the at least one behavior category, and called closed set behavior recognition. The closed set refers to classification and recognition within the at least one behavior category (domain).

And (3) open set behavior recognition, wherein the closed set behavior recognition can only correctly classify various behavior classes in the domain, and can wrongly classify data which does not appear during training, namely data outside the domain, into a certain class in the domain, so that the practical application is limited. Therefore, open-set behavior recognition requires that the recognition model correctly classify the in-domain classes and recognize the out-of-domain data. That is to say, in practical applications, when behavior recognition is performed by using a recognition model, it is generally not guaranteed that videos to be recognized input into the recognition model all belong to intra-domain categories, and when an input video m does not belong to an intra-domain category, conventional closed set behavior recognition may erroneously classify the video m into a certain behavior category corresponding to the intra-domain category, resulting in an erroneous prediction of the behavior category of the video m.

The embodiment of the invention provides a behavior recognition scheme, which can also be called open set behavior recognition, and the open set behavior recognition not only can correctly classify the in-domain classes but also can recognize the out-of-domain classes through a trained recognition model. The overall idea of the scheme is as follows: firstly, based on the video characteristics corresponding to the video to be recognized and the target sample characteristics corresponding to a plurality of training sample videos for training the recognition model, the uncertainty corresponding to the video to be recognized, namely the probability that the video to be recognized belongs to the category outside the domain, is determined. And then, determining whether the video to be identified belongs to the intra-domain category or the out-of-domain category according to the uncertainty, and further determining the corresponding behavior category of the video to be identified under the intra-domain category if the video to be identified belongs to the intra-domain category. Therefore, the method can avoid the error classification of the class outside the domain into a class in a certain domain, and improve the accuracy of the behavior recognition result.

The behavior recognition method provided by the embodiment of the invention can be executed by an electronic device, the electronic device can be a server or a user terminal, and the server can be a physical server or a virtual server (virtual machine) of a cloud.

Fig. 1 is a flowchart of a behavior recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

101. and inputting the video to be recognized into the recognition model to obtain the video characteristics corresponding to the video to be recognized, wherein the recognition model is used for recognizing at least one behavior category.

102. And acquiring target sample characteristics corresponding to the training sample videos respectively, wherein the training sample videos correspond to at least one behavior category.

103. And according to the video characteristics and the target sample characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified.

104. And determining the domain category of the video to be recognized according to the uncertainty, wherein the domain category is an in-domain category containing at least one behavior category recognizable by the recognition model or an out-of-domain category not containing at least one behavior category recognizable by the recognition model.

105. And if the video to be identified belongs to the intra-domain category, determining the corresponding target behavior category of the video to be identified under the intra-domain category.

In this embodiment, a recognition model capable of recognizing at least one behavior category may be trained in advance based on a plurality of training sample videos corresponding to at least one behavior category. Optionally, the recognition model is composed of three functional modules: sampler, characteristic extractor, classifier. The model training process for the recognition model will be described in detail below. Since the model training process is similar to the process of using the model to perform behavior recognition on the video to be recognized, only the process of using the model will be described first.

In practical applications, the recognition model may be, for example: model structures such as a Temporal Shift Module (TSM), an underflated 3D ConvNet (I3D), etc., but not limited thereto.

In open set behavior recognition, one important role that the recognition model plays is feature extraction, including: and extracting the video features of the video to be identified and extracting the video features corresponding to the training sample videos.

For convenience of distinguishing, in this embodiment, video features corresponding to a plurality of training sample videos are referred to as target sample features, and the video features hereinafter refer to video features of videos to be recognized.

When the video features of the video to be recognized are extracted, the working process of the recognition model is as follows:

sampling an input video to be identified through a sampler to obtain an image set containing a plurality of frames of images;

sampling a plurality of frames of images in the image set through a feature extractor to obtain a plurality of feature matrixes corresponding to the plurality of frames of images;

and performing pooling processing on the plurality of feature matrixes to obtain video features corresponding to the video to be identified.

Assuming that the length of the input video to be recognized is 10 seconds and the frame rate is 30fps, the original input video to be recognized has 300 frames of images in total, assuming that the preset sampling mode is fixed-interval sampling, 16 frames of images are sampled from the 300 frames of images based on the set fixed sampling interval, and expressed as:

ts = [1, 20, …, 300], where Ts represents an image set including a plurality of frame images sampled by the sampler, and 1, 20 … represents the image of the next frame. Therefore, the sampler can sample the multi-frame image reflecting the global information of the video to be identified from the global angle of the video to be identified. In an alternative embodiment, after the image set including the frame images is sampled, the size scaling process may be further performed on each frame image in the image set, so that the scaled image size is adapted to the set input image size required by the recognition model.

Then, each frame image included in the image set is input to a feature extractor of the recognition model to extract a feature matrix (which may also be referred to as a feature map) corresponding to each frame image, thereby obtaining 16 feature matrices corresponding to 16 frame images.

And then performing pooling (Pooling) processing such as global average pooling on the obtained 16 feature matrices to obtain a pooled feature matrix, wherein the pooled feature matrix is the video feature of the video to be identified.

The process of obtaining the target sample characteristics corresponding to each of the plurality of training sample videos is similar to the process of extracting the video characteristics of the video to be recognized, and specifically, the plurality of training sample videos can be respectively input into a trained recognition model to obtain the target sample characteristics of each of the plurality of training sample videos.

After the video features and the target sample features corresponding to the training sample videos are obtained, the probability that the video to be identified belongs to the class outside the domain can be determined according to the video features and the target sample features. In this embodiment, the uncertainty represents the probability that the video to be recognized belongs to the out-of-domain category.

In a specific implementation process, optionally, sample statistical characteristics corresponding to a plurality of training sample videos may be determined according to target sample characteristics corresponding to the plurality of training sample videos; and then, according to the video characteristics and the sample statistical characteristics corresponding to the plurality of training sample videos, determining the uncertainty corresponding to the video to be identified. Wherein, the sample statistical characteristics include: and the mean value and covariance matrix of the target sample characteristics corresponding to the training sample videos respectively.

Optionally, the method for determining the uncertainty corresponding to the video to be recognized according to the video features and the target sample features corresponding to the multiple training sample videos includes the following steps:

determining sample statistical characteristics corresponding to the training sample videos according to target sample characteristics corresponding to the training sample videos, wherein the sample statistical characteristics comprise mean values and covariance matrixes of the target sample characteristics;

determining a feature difference value between the video feature and a mean of the plurality of target sample features;

and determining the uncertainty corresponding to the video to be identified according to the characteristic difference and the covariance matrix.

Optionally, determining an uncertainty corresponding to the video to be identified according to the feature difference and the covariance matrix, including:

and determining the product of the transpose of the characteristic difference value, the inverse of the covariance matrix and the characteristic difference value as the corresponding uncertainty of the video to be identified.

Based on the above, the uncertainty corresponding to the video to be identified can be calculated based on equation (1) representing mahalanobis distance as follows:

wherein the content of the first and second substances,

representing the uncertainty of the video to be identified,

representing the video characteristics of the video to be identified,

represents the number of target sample features (i.e. the number of training sample videos),

to represent

The mean of the characteristics of the individual target samples,

a difference value of the characteristic is represented,

represent

The transpose of (a) is performed,

to represent

The inverse of the covariance matrix for each target sample feature.

For ease of understanding, the process of calculating the uncertainty corresponding to the video to be identified is described with reference to fig. 2.

As shown in fig. 2, assuming that there are m training sample videos, by inputting the m training sample videos into the recognition model, m target sample features, E1, E2, …, Em, can be obtained. Then, the average value of the m target sample characteristics E1, E2, … and Em is obtained by carrying out the average value processing on the m target sample characteristics E1, E2, E26 and Em

(ii) a Calculating covariance matrixes of m target sample features E1, E2, … and Em

. Then, calculating the video characteristics of the video to be identified

And mean value

Difference in characteristics between

. Finally, transposing the feature difference values

Inverse of covariance matrix

And a difference in the characteristics

The product of the three is determined as the uncertainty of the video to be identified

。

The uncertainty is used for representing the probability that the video to be identified belongs to the category outside the domain, in practical application, an uncertainty threshold value can be preset, and the video to be identified is determined to belong to the category inside the domain or the category outside the domain according to the actually calculated magnitude relation between the uncertainty and the uncertainty threshold value. Such as: when the uncertainty corresponding to the video to be recognized is larger than an uncertainty threshold value, determining that the video to be recognized belongs to the category outside the domain; and when the uncertainty corresponding to the video to be identified is less than or equal to the uncertainty threshold value, determining that the video to be identified belongs to the intra-domain category.

Optionally, if the video to be recognized belongs to the class outside the domain, outputting prompt information for prompting a new behavior class mark to the video to be recognized, so as to perform optimization training on the recognition model.

And if the video to be identified belongs to the intra-domain category, determining the corresponding target behavior category of the video to be identified under the intra-domain category.

The method for determining the target behavior category corresponding to the video to be recognized under the intra-domain category comprises the following steps:

acquiring target class characteristics corresponding to at least one behavior class, wherein the target class characteristics are parameters trained in the recognition model;

and determining the target behavior category corresponding to the video to be identified under the intra-domain category according to the similarity between the video characteristics and the target category characteristics corresponding to the at least one behavior category. In an alternative embodiment, the similarity corresponding to the video feature and the target class feature corresponding to each of the at least one behavior class may be determined by calculating a cosine distance between the video feature and the target class feature.

The target category characteristics of a certain behavior category can be understood as the general characteristics of the behavior category. When the recognition model is trained, a category feature is generally initialized randomly for each behavior category to represent the general features of each behavior category, and then the category feature is continuously adjusted in the training process, so that the target category feature which can truly and correctly represent the general features of each behavior category is obtained after the recognition model is trained.

Because the information contained in the target category features of different categories is different, the target behavior category corresponding to the video to be identified can be determined according to the similarity between the video features and the target category features.

The similarity between the target category characteristics corresponding to the target behavior categories and the video characteristics is greater than the similarity between the target category characteristics corresponding to the other behavior categories and the video characteristics.

In practical application, according to the uncertainty of the video to be recognized, the behavior of the video to be recognized can be recognized based on the following formula (2):

wherein, the first and the second end of the pipe are connected with each other,

representing the behavior recognition result of the video to be recognized,

representing the video characteristics of the video to be identified,

representing behavior classes

The corresponding object class characteristics are set to be,

representing video features

And object class characteristics

The degree of similarity between the two images is determined,

representing the uncertainty of the video to be identified,

a threshold value of the degree of uncertainty is indicated,

the video to be identified is indicated to belong to the out-of-domain category, and may be a preset value, such as 0.

With respect to equation (2), specifically, if the uncertainty of the video to be identified

Greater than a threshold of uncertainty

Then the behavior recognition result of the video to be recognized

Is composed of

. If the uncertainty of the video to be identified

Less than or equal to an uncertainty threshold

Then the behavior recognition result of the video to be recognized

Comprises the following steps: and video features

And the behavior class corresponding to the target class characteristic with the maximum similarity.

For example, when the uncertainty of the video to be identified

Less than or equal to the uncertainty threshold

In the process, it is assumed that the recognition model can recognize N behavior classes, namely a behavior class 1, a behavior class 2, … and a behavior class N, and corresponding target class characteristics exist for any behavior class

Wherein, in the process,

the labels corresponding to the category of the behavior,

is 1,2, …, N.

Hypothesis video features

Object class features associated with behavior class 1

Similarity between them

= S1; video features

Object class features associated with behavior class 2

Similarity between them

= S2, and so on, video features

Object class features associated with behavior class N

Similarity between them

=SN。

In determining video characteristics

And after N similarity degrees S1, S2 and … between N target category characteristics corresponding to the N behavior categories respectively are obtained, and after SN, assuming that S2 is the maximum similarity degree in the N similarity degrees, determining that the target behavior category corresponding to the video to be recognized under the intra-domain category is the behavior category 2.

The identification process of the behavior category in the video provided by the above embodiment may be executed with reference to fig. 3, and fig. 3 is an application schematic diagram of the behavior identification method provided by the embodiment of the present invention.

In summary, in the scheme provided by the embodiment of the present invention, the uncertainty of the video to be identified is determined according to the features of the video to be identified and the respective target sample features corresponding to the plurality of training sample videos; and then, determining whether the video to be identified belongs to the intra-domain category or the out-of-domain category according to the uncertainty, and further determining the corresponding target behavior category of the video to be identified under the intra-domain category when the video to be identified belongs to the intra-domain category. Therefore, the behavior type identification can be accurately carried out on the video to be identified which belongs to the intra-domain type, whether the video to be identified belongs to the out-of-domain type can be effectively detected, the video to be identified which belongs to the out-of-domain type can be prevented from being wrongly classified into a certain intra-domain type, and the accuracy of the behavior identification result can be improved.

The above describes the use process of the recognition model, and the following describes the training process of the recognition model.

Fig. 4 is a flowchart of a recognition model training method according to an embodiment of the present invention, as shown in fig. 4, which may include the following steps:

401. and acquiring a first training sample video and a plurality of second training sample videos corresponding to the first behavior category.

402. And extracting a first sample characteristic corresponding to the first training sample video and a plurality of second sample characteristics corresponding to the plurality of second training sample videos through the recognition model.

403. Determining a first similarity between the first sample feature and the class feature corresponding to the first behavior class currently learned by the recognition model, and a second similarity between the first sample feature and features in the first feature set respectively, wherein the first feature set comprises a plurality of second sample features.

404. Determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises at least one behavior category except the first behavior category.

405. A third similarity between the first sample feature and the features in the second feature set is determined.

406. Determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints; and training the recognition model according to the loss function.

Wherein, the recognition model is trained according to the loss function, i.e. the model parameters are adjusted by back propagation. The model parameters comprise category characteristics corresponding to all behavior categories, and the category characteristics of all behavior categories can be adjusted through continuous learning during training of the recognition model.

The recognition model trained by the embodiment can be used to execute the behavior recognition scheme provided by the embodiment shown in fig. 1.

Specifically, in order to train the recognition model, a large number of training sample videos (i.e., a plurality of training sample videos above) corresponding to at least one behavior class need to be obtained in advance. When training the recognition model, for each behavior category, a corresponding training set may be set, where the training set includes a plurality of training sample videos.

Since the processing procedure for any training sample video corresponding to any behavior class is the same when training the recognition model, this embodiment takes the processing procedure corresponding to one behavior class as an example, and is exemplarily described with reference to fig. 5.

Assuming that a recognition model is to be trained to perform behavior recognition on N behavior classes, for one of the N behavior classes, it is referred to as a first behavior class in this embodiment, and the N-1 behavior classes except the first behavior class are referred to as second behavior classes. Assuming that the training set corresponding to the first behavior category includes M training sample videos, for one of the M training sample videos, it is referred to as a first training sample video q in this embodiment, and all training sample videos except the first training sample are referred to as second training sample videos.

First, a first training sample video q and a plurality of second training sample videos corresponding to a first behavior category are obtained.

Optionally, a first training set corresponding to the first behavior category may be obtained first; then, carrying out image augmentation processing on training sample videos contained in the first training set to obtain a second training set; thereafter, a first training sample video q and a plurality of second training sample videos are obtained from a second training set. The image augmentation processing comprises at least one of the following processing modes: and the spatial dimensions such as turning, rotation, gray level adjustment, brightness adjustment and the like are increased. The number of samples can be enlarged by the image enlargement processing. It should be noted that, since the training data is a video, the image augmentation processing refers to performing image augmentation processing on multiple frames of images included in one video, and assuming that one training sample video includes 100 frames of images and that a certain image augmentation processing is inversion, uniform inversion processing is performed on all the 100 frames of images to obtain another augmented training sample video.

And then, extracting a first sample feature corresponding to the first training sample video q and a plurality of second sample features corresponding to the plurality of second training sample videos through the recognition model. The extraction process can refer to the related description in the foregoing embodiments, which is not repeated herein.

In fact, in the training process of the recognition model, the sample characteristics of the training sample video corresponding to the second behavior category are also extracted through the recognition model.

In this embodiment, for convenience of description, two feature sets are defined, namely a first feature set and a second feature set. Wherein the first feature set includes a plurality of second sample features extracted in step 402, that is, sample features corresponding to a plurality of second training sample videos belonging to the first behavior category; the second feature set comprises class features corresponding to a second behavior class learned currently by the recognition model and sample features of the training sample video corresponding to the second behavior class. It should be noted that the category features corresponding to the second behavior category currently learned by the recognition model are actually N-1 category features corresponding to N-1 behavior categories other than the first behavior category, and may be understood as a feature set.

And then, carrying out similarity calculation on the first sample characteristic and the class characteristic corresponding to the first action class currently learned by the recognition model so as to determine a first similarity between the first sample characteristic and the class characteristic corresponding to the first action class currently learned by the recognition model. Since the class features corresponding to different behavior classes are used to distinguish different behavior classes, the calculation of the first similarity is actually used to measure the probability that the first training sample video q belongs to the first behavior class.

And performing similarity calculation on the first sample characteristic and each characteristic in the first characteristic set, and determining second similarity between the first sample characteristic and each characteristic in the first characteristic set. Since the first feature set corresponds to a plurality of second training sample videos in the first behavior class, the calculation of the second similarity is actually used for distinguishing the differences between different training sample videos in the first behavior class.

And performing similarity calculation on the first sample characteristic and each characteristic in the second characteristic set, and determining a third similarity between the first sample characteristic and each characteristic in the second characteristic set. Since the second feature set includes class features of the second behavior classes and sample features of the training sample videos of the second behavior classes, the third similarity is actually calculated to measure differences between the first training sample video q in the first behavior class and the second behavior classes.

Alternatively, the similarity in the present embodiment may be determined by calculating the cosine distance between the features.

And then, determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints, and training the recognition model according to the loss function. Wherein the similarity threshold is set to a value less than 1.

It can be understood that, with the constraint of the set similarity threshold value being smaller than 1, the difference between the first sample feature and the class feature of the first behavior class and the difference between the first sample feature and each feature (i.e., the second sample features) in the first feature set can be retained, i.e., the model can learn richer intra-class information. The learning of the information in the class is very important for distinguishing the data in the domain and the data out of the domain.

Alternatively, the loss function in the present embodiment is expressed by equation (3):

the first characteristic of the sample is shown,

class features representing a first behavioral class currently learned by the recognition model,

a first set of features is represented that,

one feature of the first set of features is represented,

a second set of features is represented that,

one feature in the second set of features is represented,

it is indicated that a threshold value of the similarity degree is set,

the temperature coefficient is, for example, in the following range: (

），

A first degree of similarity is indicated, and,

a second degree of similarity is expressed in terms of,

indicating a third degree of similarity.

Based on the loss function of equation (3), the first similarity and the second similarity are close to a set similarity threshold value less than 1

And not close to 1, so that the sample features of different training sample videos in the same behavior category are allowed to be different from the category features of the behavior category to which the different training sample videos belong, and the differences of the sample features of different training sample videos in the same behavior category are learned, so that the corresponding different intra-category information of each training sample video in the first behavior category can be retained.

In summary, in this embodiment, based on the setting of the similarity threshold smaller than 1, for a plurality of training sample videos corresponding to a certain behavior category, the model can learn the difference features of different training sample videos of the same behavior category, that is, the identification model learns the intra-class information, thereby being beneficial to improving the accuracy of the open-set behavior identification.

Fig. 6 is a flowchart of another recognition model training method according to an embodiment of the present invention, as shown in fig. 6, which may include the following steps:

601. and acquiring a first training sample video and a plurality of second training sample videos corresponding to the first behavior category.

602. And extracting a first sample characteristic corresponding to the first training sample video and a plurality of second sample characteristics corresponding to the plurality of second training sample videos through the recognition model.

603. Carrying out image frame disorder processing on the first training sample video to obtain a third training sample video; and extracting a third sample characteristic corresponding to the third training sample video through the recognition model.

604. Determining a first similarity between the first sample feature and a category feature corresponding to a first behavior category currently learned by the recognition model, and a second similarity between the first sample feature and a feature in the first feature set respectively, wherein the first feature set comprises a plurality of second sample features and third sample features.

605. Determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises at least one behavior category other than the first behavior category.

606. A third similarity between the first sample feature and the features in the second feature set is determined.

607. Determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints; and training the recognition model according to the loss function.

The specific processes of steps 601-602 and 605-607 can refer to the foregoing embodiments, which are not described herein again.

In this embodiment, in the model training process, a third training sample video is introduced. The third training sample video is obtained by the first training sample video after image frame disorder processing, and the training sample video is augmented in the time dimension through the image frame disorder processing, so that the capability of the recognition model for extracting time dimension information is enhanced, the learned features in the recognition model contain richer inter-class information, and the classification capability of the recognition model for the intra-domain classes and the classification capability of the domain classes (distinguishing the classes outside and in the domains) are enhanced.

Specifically, with reference to fig. 7, first, a first training sample video and a plurality of second training sample videos corresponding to a first behavior category are obtained, and image frame disordering processing is performed on the first training sample video to obtain a third training sample video. And then, extracting a first sample feature corresponding to the first training sample video, a plurality of second sample features corresponding to the plurality of second training sample videos and a third sample feature corresponding to the third training sample video through the recognition model. And adding the third sample features into the first feature set, wherein the first feature set comprises a plurality of second sample features and third sample features, and the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features of the training sample video corresponding to the second behavior class.

Then, a first similarity between the first sample feature and a class feature corresponding to the first behavior class currently learned by the recognition model is calculated, a second similarity between the first sample feature and features in the first feature set is calculated, and a third similarity between the first sample feature and features in the second feature set is calculated.

Finally, determining a loss function corresponding to the video of the first training sample according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as constraints; and training the recognition model according to the loss function.

In summary, in this embodiment, on the basis of the embodiment shown in fig. 4, the third sample feature corresponding to the third training sample video obtained by performing image frame out-of-order processing on the first training sample video is added to the first feature set. By using the constraint that the similarity between the first sample feature and the third sample feature is close to the set similarity threshold, on one hand, the similarity between the spatial information of the third training sample video and the spatial information of the first training sample video obtained after the image frame disorder processing can be ensured, and on the other hand, the difference between the temporal information of the third training sample video and the temporal information of the first training sample video can be ensured. Therefore, the potential recognition model training method in the embodiment enables the features learned by the recognition model to include more intra-class information and rich inter-class information, so that the domain category of the video to be recognized can be determined, the classification effect of the intra-domain category can be enhanced, and the performance of open-set behavior recognition is greatly improved.

As described above, the behavior recognition method provided in the embodiment of the present invention may be executed in the cloud, where a plurality of computing nodes (cloud servers) may be deployed in the cloud, and each computing node has processing resources such as computation and storage. In the cloud, a plurality of computing nodes may be organized to provide a certain service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), Application Programming Interface (API), and other forms.

According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of the behavior recognition service, and a user calls the service interface through user equipment to trigger a behavior recognition request to the cloud end, wherein the request comprises the video to be recognized. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:

according to the video features and the target sample features corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified;

The above implementation process may refer to the related descriptions in the foregoing other embodiments, which are not described herein.

For ease of understanding, the description is exemplified in conjunction with fig. 8. The user may invoke a behavior recognition service interface through the user device E1 illustrated in fig. 8, where the behavior recognition service interface includes a form of SDK interface, API interface, etc., and illustrated in fig. 8 is the API interface through which a service request containing a video to be recognized is uploaded. In the cloud, as shown in the figure, assuming that the behavior recognition service is provided by the service cluster E2, the service cluster E2 includes at least one compute node therein. After receiving the request, the service cluster E2 executes the steps described in the foregoing embodiment to obtain a target behavior category corresponding to the intra-domain category when the video to be identified belongs to the intra-domain category, and sends the target behavior category corresponding to the video to be identified to the user equipment E1. The user device E1 displays the target behavior category, on the basis of which the user can perform further editing and the like.

The behavior recognizing device according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and by performing the steps taught in this disclosure.

Fig. 9 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present invention, and as shown in fig. 9, the apparatus includes: the device comprises an extraction module 11, a determination module 12 and an identification module 13.

The extraction module 11 is configured to input a video to be recognized into a recognition model to obtain video features corresponding to the video to be recognized, where the recognition model is configured to recognize at least one behavior category.

A determining module 12, configured to obtain target sample features corresponding to a plurality of training sample videos, where the plurality of training sample videos correspond to the at least one behavior category; and according to the video characteristics and the target sample characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified.

An identifying module 13, configured to determine, according to the uncertainty, a domain category of the video to be identified, where the domain category is an intra-domain category that includes the at least one behavior category or an extra-domain category that does not include the at least one behavior category; and if the video to be identified belongs to the intra-domain category, determining a target behavior category corresponding to the video to be identified under the intra-domain category.

Optionally, the recognition module 13 is specifically configured to obtain target category features corresponding to the at least one behavior category, where the target category features are parameters trained in the recognition model; and determining the target behavior category corresponding to the video to be identified under the intra-domain category according to the similarity between the video characteristics and the target category characteristics corresponding to the at least one behavior category.

Optionally, the determining module 12 is specifically configured to determine, according to the target sample features corresponding to the multiple training sample videos, sample statistical features corresponding to the multiple training sample videos; and according to the video characteristics and the sample statistical characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified. Wherein the sample statistical features include: and the mean value and covariance matrix of the target sample characteristics corresponding to the training sample videos respectively.

Optionally, the determining module 12 is further specifically configured to input the multiple training sample videos into the trained recognition model respectively, so as to obtain target sample features corresponding to the multiple training sample videos respectively.

Optionally, the identifying module 13 is further specifically configured to, if the video to be identified belongs to the out-of-domain category, output a prompt message for prompting that a new behavior category label is performed on the video to be identified.

Optionally, the apparatus further comprises: the training module is used for acquiring a first training sample video and a plurality of second training sample videos corresponding to the first behavior category; extracting a first sample feature corresponding to the first training sample video and a plurality of second sample features corresponding to the plurality of second training sample videos through the recognition model; determining a first similarity between the first sample feature and a category feature corresponding to the first behavior category currently learned by the recognition model, and second similarities between the first sample feature and features in a first feature set respectively, wherein the first feature set comprises the plurality of second sample features; determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises a behavior category of the at least one behavior category other than the first behavior category; determining a third similarity between the first sample feature and features in the second feature set; determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as a constraint; and training the recognition model according to the loss function.

Optionally, the training module is further configured to perform image frame out-of-order processing on the first training sample video to obtain a third training sample video; extracting a third sample characteristic corresponding to the third training sample video through the recognition model; adding the third sample feature to the first feature set.

Optionally, the training module is further configured to obtain a first training set corresponding to the first behavior category; carrying out image augmentation processing on training sample videos contained in the first training set to obtain a second training set; the first training sample video and a plurality of second training sample videos are taken from the second training set.

The apparatus shown in fig. 9 may perform the steps in the foregoing embodiments, and for details of the performing process and the technical effect, reference is made to the description in the foregoing embodiments, which is not described herein again.

In one possible design, the structure of the behavior recognizing apparatus shown in fig. 9 may be implemented as an electronic device. As shown in fig. 10, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code, which when executed by the processor 21, causes the processor 21 to implement at least the behavior recognition method as provided in the previous embodiments.

In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to implement at least the behavior recognition method as provided in the foregoing embodiments.

In an optional embodiment, the electronic device for executing the behavior recognition method provided in the embodiment of the present invention may be an Extended Reality (XR) device. XR, which is a generic term for virtual reality, augmented reality, and other forms.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by a necessary general hardware platform, and may also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A behavior recognition method, comprising:

2. The method according to claim 1, wherein the determining a target behavior category corresponding to the video to be recognized under the intra-domain category comprises:

acquiring target category characteristics corresponding to the at least one behavior category, wherein the target category characteristics are parameters trained in the recognition model;

and determining the target behavior category corresponding to the video to be identified under the intra-domain category according to the similarity between the video characteristics and the target category characteristics corresponding to the at least one behavior category.

3. The method according to claim 1, wherein the determining the uncertainty corresponding to the video to be recognized according to the video feature and the target sample feature corresponding to each of the plurality of training sample videos comprises:

determining sample statistical characteristics corresponding to the training sample videos according to target sample characteristics corresponding to the training sample videos;

and according to the video characteristics and the sample statistical characteristics corresponding to the training sample videos, determining the uncertainty corresponding to the video to be identified.

4. The method of claim 3, wherein the sample statistical features comprise: and the mean value and covariance matrix of the target sample characteristics corresponding to the training sample videos respectively.

5. The method according to any one of claims 1 to 4, wherein the obtaining of the target sample feature corresponding to each of the plurality of training sample videos comprises:

and respectively inputting the training sample videos into the trained recognition model to obtain target sample characteristics corresponding to the training sample videos.

6. The method according to any one of claims 1 to 4, wherein the training process of the recognition model comprises:

acquiring a first training sample video and a plurality of second training sample videos corresponding to a first behavior category;

extracting a first sample feature corresponding to the first training sample video and a plurality of second sample features corresponding to the plurality of second training sample videos through the recognition model;

determining a first similarity between the first sample feature and a class feature corresponding to the first behavior class currently learned by the recognition model, and a second similarity between the first sample feature and features in a first feature set respectively, wherein the first feature set comprises the plurality of second sample features;

determining a second feature set, wherein the second feature set comprises class features corresponding to a second behavior class currently learned by the recognition model and sample features corresponding to a training sample video corresponding to the second behavior class; wherein the second behavior category comprises a behavior category of the at least one behavior category other than the first behavior category;

determining a third similarity between the first sample feature and features in the second feature set;

determining a loss function corresponding to the first training sample video according to the first similarity, the second similarity and the third similarity by taking the first similarity and the second similarity close to a set similarity threshold as a constraint;

and training the recognition model according to the loss function.

7. The method of claim 6, further comprising:

performing image frame disorder processing on the first training sample video to obtain a third training sample video;

extracting a third sample characteristic corresponding to the third training sample video through the recognition model;

adding the third sample feature to the first feature set.

8. The method of claim 6, further comprising:

acquiring a first training set corresponding to the first behavior category;

carrying out image augmentation processing on training sample videos contained in the first training set to obtain a second training set; the first training sample video and a plurality of second training sample videos are taken from the second training set.

9. The method of claim 1, further comprising:

and if the video to be recognized belongs to the out-of-domain category, outputting prompt information for prompting to perform new behavior category marking on the video to be recognized.

10. A behavior recognition apparatus, comprising:

11. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to carry out the behaviour recognition method according to any one of claims 1 to 9.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the behavior recognition method of any one of claims 1 to 9.

13. A method of behavior recognition, comprising: