CN113688729B

CN113688729B - Behavior recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113688729B
Application number: CN202110974723.1A
Authority: CN
Inventors: 李帅成; 杨昆霖; 侯军; 伊帅
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2023-04-07
Anticipated expiration: 2041-08-24
Also published as: WO2023024438A1; CN113688729A; TW202309772A

Abstract

The present disclosure relates to a behavior recognition method and apparatus, an electronic device, and a storage medium, the method including: receiving an input video frame and extracting character features in the video frame; clustering a plurality of character features in the video frame to obtain a clustering result; determining attention distribution weights of the human features in the video frames based on the clustering results; updating the personality characteristics based on the attention-assignment weights; extracting character spatiotemporal features based on the updated character features; and performing behavior recognition on the video frame based on the character space-time characteristics to obtain a recognition result. The embodiment of the disclosure can improve the accuracy of behavior recognition.

Description

Behavior recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a behavior recognition method and apparatus, an electronic device, and a storage medium.

Background

In the crowd Activity Recognition (Group Activity Recognition) technology, the action categories of different people in a video picture and the crowd Activity categories described by the picture are recognized through a computer vision algorithm, and the crowd Activity Recognition technology is commonly used for behavior Recognition of scenes such as sports events. For example, for a volleyball game video, the task requires identifying the action category of each volleyball player and the crowd activity category described by the video segment (left pass, right pass, left take, etc.). For this task, we can generally detect human bodies in the video, and then further infer the crowd behavior category of the video according to the Individual actions through Individual Action Recognition (Individual Action Recognition).

With the development of deep learning in computer vision in recent years, many past works generally utilize convolutional neural networks to detect the motion of each person in video and utilize global pooling to obtain population global features to identify population behavior classes. Crowd behavior recognition relies on relationship information between individual actions in addition to individual actions and video context information. In addition to using convolutional neural networks, some methods also utilize models such as convolutional networks, recurrent neural networks, transformers, etc., to capture and analyze relationship information between individual actions.

However, the traditional deep learning-based method tries to establish a larger-scale spatiotemporal relationship model and more diverse input features (video optical flow and human body key point information) to improve the accuracy of crowd behavior recognition, and the accuracy of the crowd behavior recognition needs to be further improved.

Disclosure of Invention

The present disclosure proposes a behavior recognition technical solution.

According to an aspect of the present disclosure, there is provided a behavior recognition method including:

receiving an input video frame and extracting character features in the video frame;

clustering a plurality of character features in the video frame to obtain a clustering result;

determining attention distribution weights of the human features in the video frames based on the clustering results;

updating the personality characteristics based on the attention-assignment weights;

extracting person spatiotemporal features based on the updated person features;

and performing behavior recognition on the video frame based on the character space-time characteristics to obtain a recognition result.

In one possible implementation, the determining attention allocation weights of the human features in the video frames based on the clustering result includes:

and determining attention distribution weight among the human characteristics based on the incidence relation among the human characteristics in the clustering result.

In a possible implementation manner, the determining, based on the association relationship between the human features in the clustering result, the attention distribution weight between the human features includes:

determining a first similarity between the character features in the same group obtained by clustering;

based on the first similarity, a first attention allocation weight between the features of the characters in the group is determined.

In one possible implementation manner, determining a first similarity between the person features in the same group obtained by clustering includes:

dividing the character feature matrix into N parts;

respectively and correspondingly calculating the similarity of the N characteristics of different character characteristics to obtain N first similarities;

the determining a first attention allocation weight between the features of the human beings in the group based on the first similarity comprises:

based on the N first similarities, N first attention distribution weights among the human features in the group are determined.

determining the overall characteristics of each group obtained by clustering;

determining a second similarity between the overall characteristics of each group obtained by clustering;

determining a second attention allocation weight between the human features based on the second similarity.

In one possible implementation, the updating the human character feature based on the attention distribution weight includes:

and aiming at the target character features in the character features of the single group, weighting and summing the character features in the group by using the first attention distribution weights of the target character features and the character features in the group to obtain the in-group updated features corresponding to the characters in the group as the updated character features.

aiming at the target overall characteristics of the target groups in each group, obtaining the inter-group updating characteristics corresponding to each group by using the second attention distribution weight of the target overall characteristics and the overall characteristics of each group;

and respectively adding the inter-group updating characteristics to the intra-group updating characteristics corresponding to the characters in the target group to obtain updated character characteristics.

In one possible implementation manner, the extracting human spatiotemporal features based on the updated human features includes:

and carrying out space decoding on the updated character characteristics to obtain character space-time characteristics.

In one possible implementation, the spatially decoding the updated human features to obtain human spatio-temporal features includes:

carrying out space decoding on the updated character features to obtain character space features;

performing time domain coding and decoding on the character features of the plurality of video frames to obtain character time domain features;

and fusing the character space characteristics and the character time domain characteristics to obtain character space-time characteristics.

In a possible implementation manner, the time-domain coding and decoding the human character features of the multiple video frames to obtain the human character time-domain features includes:

coding the character features of a plurality of video frames based on an attention mechanism to obtain time domain coding features;

decoding the time domain coding features based on a self-attention mechanism, and/or decoding the time domain coding features based on spatial coding features to obtain character time domain features; wherein the spatial coding feature is the updated character feature.

In a possible implementation manner, the spatially decoding the updated human figure feature to obtain a human figure spatial feature includes:

and decoding the spatial coding features based on a self-attention mechanism, and/or decoding the spatial coding features based on the time domain coding features to obtain the character spatial features.

In one possible implementation, the method further includes:

extracting global features of the video frame;

determining a third attention allocation weight in the global feature using the human spatiotemporal feature;

updating the global feature with the third attention allocation weight;

the behavior recognition is carried out on the video frame based on the character space-time characteristics to obtain a recognition result, and the recognition result comprises the following steps:

and carrying out crowd behavior recognition on the video frame based on the updated global features to obtain a crowd behavior recognition result.

In one possible implementation, after updating the global feature with the third attention allocation weight, the method further includes:

the updated global features are used as new global features, the character spatiotemporal features are used as new character features, the global features and the character spatiotemporal features are updated iteratively until iteration stop conditions are met, and the iteratively updated global features and the character spatiotemporal features are obtained;

the behavior recognition is carried out on the video frame based on the character space-time characteristics to obtain a recognition result, and the method comprises the following steps:

and carrying out crowd behavior recognition on the video frame based on the global features after iterative updating to obtain a crowd behavior recognition result.

In one possible implementation, after obtaining the iteratively updated global features and the human spatio-temporal features, the method further includes:

and performing character behavior recognition based on the character spatiotemporal characteristics after iterative updating to obtain a character behavior recognition result.

In one possible implementation manner, the extracting human features in the video frame includes:

identifying human bodies in the video frames to obtain target rectangular frames of all the people;

and extracting the features in the video frame, and matching the extracted features of the video frame by using the target rectangular frame in the video frame to obtain the corresponding character features.

According to an aspect of the present disclosure, there is provided a behavior recognition apparatus including:

the character feature extraction unit is used for receiving an input video frame and extracting character features in the video frame;

the clustering unit is used for clustering a plurality of character features in the video frame to obtain a clustering result;

an attention distribution unit for determining attention distribution weights of the human features in the video frames based on the clustering result;

a human character feature updating unit for updating the human character feature based on the attention distribution weight;

a figure spatiotemporal feature extraction unit, configured to extract figure spatiotemporal features based on the updated figure features;

and the behavior identification unit is used for carrying out behavior identification on the video frame based on the person space-time characteristics to obtain an identification result.

In a possible implementation manner, the attention allocation unit is configured to determine an attention allocation weight between the human features based on an association relationship between the human features in the clustering result.

In one possible implementation, the attention distribution unit includes:

the first similarity determining unit is used for determining first similarity between character features in the same group obtained by clustering;

a first attention determination unit for determining a first attention distribution weight between the features of the persons in the group based on the first similarity.

In a possible implementation manner, the first similarity determining unit is configured to divide the feature matrix of the character features into N parts; respectively and correspondingly calculating the similarity of N characteristics of different character characteristics to obtain N first similarities;

the first attention determining unit is used for determining N first attention distribution weights among the human features in the group based on the N first similarities.

In one possible implementation, the attention distribution unit includes:

the overall characteristic determining unit is used for determining the overall characteristics of each group obtained by clustering;

the second similarity determining unit is used for determining second similarity between the overall characteristics of each group obtained by clustering;

a second attention determination unit for determining a second attention distribution weight between the human features based on the second similarity.

In a possible implementation manner, the human character feature updating unit is configured to, for a target human character in the human characters of a single group, perform weighted summation on each human character in the group by using a first attention assignment weight of the target human character and each human character in the group, to obtain an intra-group updated feature corresponding to each human character in the group, as an updated human character.

In a possible implementation manner, the human character feature updating unit includes:

an inter-group update feature determination unit, configured to obtain, for a target overall feature of a target group in each group, an inter-group update feature corresponding to each group by using a second attention distribution weight of the target overall feature and the overall feature of each group;

and the character feature updating subunit is used for respectively adding the inter-group updating features to the intra-group updating features corresponding to the characters in the target group to obtain updated character features.

In a possible implementation manner, the human spatio-temporal feature extraction unit is configured to perform spatial decoding on the updated human features to obtain human spatio-temporal features.

In one possible implementation manner, the human spatio-temporal feature extraction unit includes:

the space decoding unit is used for carrying out space decoding on the updated character features to obtain character space features;

the time domain coding and decoding unit is used for carrying out time domain coding and decoding on the character features of the plurality of video frames to obtain character time domain features;

and the fusion unit is used for fusing the character space characteristics and the character time domain characteristics to obtain character space-time characteristics.

In a possible implementation manner, the time-domain coding and decoding unit includes:

the time domain coding unit is used for coding the character features of the video frames based on a self-attention mechanism to obtain time domain coding features;

the time domain decoding unit is used for decoding the time domain coding features based on a self-attention mechanism and/or decoding the time domain coding features based on the space coding features to obtain character time domain features; wherein the spatial coding feature is the updated character feature.

In a possible implementation manner, the spatial decoding unit is configured to decode the spatial coding features based on a self-attention mechanism, and/or decode the spatial coding features based on the time-domain coding features, so as to obtain the spatial character features.

In one possible implementation, the method further includes:

the global feature extraction unit is used for extracting global features of the video frames;

a third attention determining unit, configured to determine a third attention distribution weight in the global feature by using the human spatiotemporal feature;

a global feature updating unit for updating the global feature with the third attention allocation weight;

the behavior recognition unit includes:

and the crowd behavior identification unit is used for identifying the crowd behavior of the video frame based on the updated global features to obtain a crowd behavior identification result.

In one possible implementation, the apparatus further includes:

the iteration updating unit is used for updating the global feature and the character spatiotemporal feature in an iteration mode by taking the updated global feature as a new global feature and the character spatiotemporal feature as a new character feature until an iteration stop condition is met to obtain the updated global feature and the character spatiotemporal feature;

and the behavior identification unit is used for identifying the crowd behavior of the video frame based on the global characteristics after iterative update to obtain a crowd behavior identification result.

In one possible implementation, the apparatus further includes:

and the figure behavior identification unit is used for identifying the figure behaviors based on the figure space-time characteristics after iterative updating to obtain a figure behavior identification result.

In one possible implementation manner, the human feature extraction unit includes:

the target rectangular frame determining unit is used for identifying human bodies in the video frames to obtain target rectangular frames of all people;

and the character feature extraction subunit is used for extracting the features in the video frame and matching the extracted features of the video frame with the target rectangular frame in the video frame to obtain the corresponding character features.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, after an input video frame is received, character features in the video frame are extracted, a clustering result is obtained by clustering a plurality of character features in the video frame, attention distribution weights of the character features in the video frame are determined based on the clustering result, the character features are updated based on the attention distribution weights, character spatio-temporal features are extracted based on the updated character features, and behavior recognition is performed on the video frame based on the character spatio-temporal features to obtain a recognition result. Therefore, the relation between each individual feature (action) is obtained based on clustering, the attention distribution weight of the character features is obtained based on the clustering result, so that the importance among different character features is highlighted, the important action information in the character features can be highlighted, and then the character spatiotemporal features extracted based on the updated character features are subjected to behavior recognition, so that the calculation redundancy and information interference caused by analyzing the relation among unimportant individual actions are reduced, and the accuracy of behavior recognition is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a behavior recognition method according to an embodiment of the present disclosure.

Fig. 2 illustrates a block diagram of a behavior recognition device according to an embodiment of the present disclosure.

Fig. 3 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Fig. 4 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, and C, and may mean including any one or more elements selected from the group consisting of a, B, and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

In the related art, a crowd behavior identification method based on deep learning is often to try to establish a larger-scale space-time relationship model and more various input features (video optical flow and human body key point information) to improve the accuracy of crowd behavior identification.

The embodiment of the disclosure provides a behavior recognition method, which further considers that in the crowd behavior recognition, not the action of each person is important for the crowd behavior recognition. For example, in volleyball games, it is often only players who approach or touch the volleyball that are decisive for the identification of the crowd behavior category. Therefore, the method includes clustering a plurality of character features in a video frame to obtain a clustering result, determining attention distribution weights of the character features in the video frame based on the clustering result, and updating the character features based on the attention distribution weights; extracting person spatiotemporal features based on the updated person features; and performing behavior recognition on the video frame based on the character space-time characteristics to obtain a recognition result. Therefore, the relation between each individual characteristic (action) is obtained based on clustering, the attention distribution weight of the character characteristics is obtained based on the clustering result, so that the importance among different character characteristics is highlighted, the important action information can be highlighted, the calculation redundancy and information interference caused by analyzing the relation among unimportant individual actions are reduced, and the accuracy of behavior recognition is improved.

In one possible implementation, the behavior recognition method may be performed by an electronic device such as a terminal device or a server, the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory.

For convenience of description, in one or more embodiments of the present specification, an execution subject of the behavior recognition method may be a server, and hereinafter, an implementation of the method will be described by taking the execution subject as the server as an example. It is understood that the implementation of the method by the server is merely an exemplary illustration and should not be construed as a limitation of the method.

Fig. 1 shows a flowchart of a behavior recognition method according to an embodiment of the present disclosure, as shown in fig. 1, the behavior recognition method includes:

in step S11, an input video frame is received, and human features in the video frame are extracted.

The video frame may be any one of the video frames in the video frame sequence, or may be a plurality of video frames in the video frame sequence. The video frames may be input in the form of a sequence of video frames, and the length of a single sequence of video frames may be predetermined, for example 20 frames.

The video frame may be a video frame stored in the local storage space, and then the video frame may be read from the terminal local storage space to implement input of the video frame, for example, the video frame may be a video frame in a video of a locally stored sports event, and for example, the video frame may be a video frame in a locally stored mall management video.

Alternatively, the video frame may also be a video frame acquired by an image acquisition device in real time, for example, a video frame in a live video of a sports event, or, for example, a video frame acquired by an image acquisition device located at an entrance of a shopping mall in real time.

In one possible implementation, extracting human features from the video frames includes: identifying human bodies in the video frames to obtain target rectangular frames of all the people; and extracting the features in the video frame, and matching the extracted features of the video frame by using the target rectangular frame in the video frame to obtain the corresponding character features.

Specifically, for the person in the video frame, the region where the person in the video frame is located may be identified through a human body identification technology, and the region is often represented by a rectangular frame, where the region framed by the rectangular frame is the region where the identified person is located. Since a plurality of rectangular boxes may be obtained when the same person is identified, the plurality of rectangular boxes may be deduplicated by a Non-Maximum Suppression (NMS) algorithm, and one rectangular box is reserved for one person in a single video frame as an area where the identified person is located.

For multiple frames of video frames in the sequence of video frames, the region where the person in the multiple frames of video frames is located can be obtained in the above manner.

In the embodiment of the present disclosure, global feature extraction may be performed on a video frame to obtain global features of the entire video frame, for example, an expanded 3D convnet (I3D) may be used to extract global features of the video frame, and output of the last layer of the I3D network is used as the global features. And then, performing feature extraction on the intermediate features output by the intermediate layer of the I3D network to obtain each character feature, specifically, the positions of characters in the video frame are the positions of a plurality of rectangular frames obtained after the NMS is deduplicated, corresponding the positions of the rectangular frames in the video frame to the intermediate features extracted by the I3D network, and extracting the features corresponding to the positions of the rectangular frames in the intermediate features by using the RoIAlign technology to obtain the character features in the video frame. In addition, the manner of acquiring the character features may be other manners, and the present disclosure does not limit the specific manner of acquiring the character features.

In step S12, clustering is performed on the plurality of character features in the video frame to obtain a clustering result.

In the Clustering process, a plurality of character features are divided into different groups according to a certain standard (such as similarity), so that the similarity of the character features in the same group is as large as possible, and the difference between the character features which are not in the same group is also as large as possible. That is, after clustering, the character features of the same group are gathered together as much as possible, and the character features of different groups are separated as much as possible.

In the embodiment of the present disclosure, the human features may be clustered based on a K-means algorithm, and specifically, the human features to be classified may form a feature set and specify the number K of classes to be classified (for example, may be 3), K human features are randomly selected from the feature set as initial clustering centers of K classes, for each human feature except the K initial clustering centers in the feature set, a distance between the human feature and a feature of each of the K initial clustering centers (for example, an euclidean distance, which is used to characterize a similarity between the features) is respectively calculated, and the human feature is classified into a class corresponding to the initial clustering center closest to the human feature, and then new clustering centers of the K classes are recalculated according to the human features included in the K classes, and then the human features in the feature set are reclassified until a distance between centers of two adjacent clustering of each of the K classes is within a preset distance.

After the character features are clustered, the obtained clustering result is that a plurality of character features are divided into a plurality of groups, and finally, the clustering center of each group is determined after being updated for a plurality of times. For example, the group a includes character features { a, b, c }; the group B comprises character features { d, e, f, g }; the group C comprises character features { h, i }; and the finally determined cluster centers of the group A are characterized by alpha, the cluster centers of the group B are characterized by beta, and the cluster centers of the group C are characterized by gamma.

In step S13, attention assignment weights of the character features in the video frames are determined based on the clustering results.

The clustering result can represent potential relations among multiple character features, is beneficial to capturing more key potential features, clusters the key features into one group, enhances the representing capability of the key features, and can improve the accuracy of behavior recognition of the video frame.

For example, the attention distribution weight may be determined according to similarity between features of human beings in a group, or for example, may also be determined according to similarity between features of groups, which may be specifically referred to in the possible implementation manners provided by the present disclosure, and details are not described here.

In step S14, updating the character features based on the attention assignment weights;

after determining the attention assignment weight, the character features may be updated to add attention information in the character features. Specifically, the attention allocation weight may be used to weight the human character to update the human character, for example, in a computer language, a feature matrix may be used to characterize the human character, so that the attention allocation weight may be added to the human character by multiplying the attention weight by the feature matrix to update the human character.

In the embodiment of the present disclosure, a first attention allocation weight can be obtained based on a first similarity of features of human beings in a group obtained by clustering, and a second attention allocation weight can be obtained based on a similarity of overall features between groups. Specifically, the process of updating the character features by using the first attention distribution weight and the second attention distribution weight may refer to possible implementation manners provided in the present disclosure, and details are not described herein.

In step S15, based on the updated character features, character spatiotemporal features are extracted;

the character space-time characteristics can characterize the characteristics of the characters in the time dimension and the space dimension. The character spatial distribution information of the character can be represented aiming at the character features in the same video frame, and the character spatial features can be obtained based on the updated character features. The person spatial features may be fused with the person time domain features to obtain the person spatio-temporal features, and for a specific fusion process, reference may be made to possible implementation manners provided in the present disclosure, which are not described herein again.

In step S16, based on the human spatio-temporal features, performing behavior recognition on the video frame to obtain a recognition result.

The character features after updating the attention distribution weights enhance the expression ability of the key features, and then the expression ability of the key features is enhanced in the character spatio-temporal features extracted based on the character features. Therefore, the behavior of the video frame is identified based on the extracted person spatiotemporal characteristics, and the accuracy of the identification result can be improved.

The behavior recognition in the embodiment of the present disclosure may include recognition of individual character behaviors and/or recognition of crowd behaviors of people, and there may be various ways of performing behavior recognition on video frames based on character spatiotemporal features, for example, the behavior recognition of individual characters may be directly performed on extracted character spatiotemporal features; for another example, attention weighting may be performed on global features of video frames by using human spatiotemporal features, and crowd behavior recognition may be performed by using the attention-weighted global features. Reference may be made in detail to possible implementations provided by the present disclosure, which are not described in detail herein.

In the embodiment of the disclosure, after an input video frame is received, character features in the video frame are extracted, a clustering result is obtained by clustering a plurality of character features in the video frame, attention distribution weights of the character features in the video frame are determined based on the clustering result, the character features are updated based on the attention distribution weights, character spatiotemporal features are extracted based on the updated character features, and behavior recognition is performed on the video frame based on the character spatiotemporal features to obtain a recognition result. Therefore, the relation between each individual feature (action) is obtained based on clustering, the attention distribution weight of the character features is obtained based on the clustering result, so that the importance among different character features is highlighted, the important action information in the character features can be highlighted, and then the character spatiotemporal features extracted based on the updated character features are subjected to behavior recognition, so that the calculation redundancy and information interference caused by analyzing the relation among unimportant individual actions are reduced, and the accuracy of behavior recognition is improved.

In one possible implementation, the determining attention allocation weights of the human features in the video frames based on the clustering result includes: and determining attention distribution weights among the human characteristics based on the incidence relation among the human characteristics in the clustering result.

The association relationship here is used to characterize the correlation between the human features, and for example, may be the similarity between the human features, and then the attention assignment weight between the human features may be determined based on the similarity between the human features.

In the embodiment of the present disclosure, two implementations of determining attention allocation weights based on the association relationship between character features in the clustering result are provided, namely, determining a first attention allocation weight based on the similarity of the character features in a group, and determining a second attention allocation weight based on the similarity between the overall features of the groups, which are described in detail below.

In a possible implementation manner, the determining, based on the association relationship between the human features in the clustering result, the attention distribution weight between the human features includes: determining a first similarity between the character features in the same group obtained by clustering; based on the first similarity, a first attention allocation weight between the features of the persons in the group is determined.

The first similarity between the human features in the same group may be a similarity between two human features in the same group, and the specific way of calculating the feature similarity may be various, for example, a similarity calculation way based on euclidean distance, or a similarity calculation way based on cosine similarity, and so on. The calculation method of the similarity is not particularly limited in the present disclosure.

After determining the similarity between the human features, a normalization process may be performed on the similarity, and specifically, a normalization function (e.g., softmax function) may be performed on the similarity, so that the first attention allocation weight of the human features may be obtained after the normalization process.

The first attention distribution weight can be added to the character features to update the character features so as to add attention information to the character features, the updated character features can be obtained after the character features are updated by the first attention distribution weight, and the expression capacity of the key features is enhanced based on the character features updated by the attention distribution weight, so that the accuracy of the identification result can be improved by performing behavior identification on the video frame based on the updated character features.

In the embodiment of the disclosure, a first similarity between character features in the same group obtained by determining clustering; based on the first similarity, a first attention allocation weight between the features of the persons in the group is determined. Therefore, based on the similarity between the human features of the same group, the incidence relation between the human features in the group can be determined, the first attention distribution weight determined according to the similarity can be used for enhancing the expression capacity of the key features in the human features, and therefore the behavior recognition is carried out on the video frame based on the updated human features, and the accuracy of the recognition result can be improved.

In one possible implementation manner, determining a first similarity between the person features in the same group obtained by clustering includes: dividing the character feature matrix into N parts; respectively and correspondingly calculating the similarity of the N characteristics of different character characteristics to obtain N first similarities; the determining a first attention allocation weight between the features of the human beings in the group based on the first similarity comprises: based on the N first similarities, N first attention distribution weights between the human features in the group are determined.

In computer technology, the human features are embodied in a feature matrix, for example, the size of the feature matrix of the human features is T × 1024. Then, the feature matrix may be divided into N parts, where N is an integer greater than 1, for example, when N is 8, that is, when T × 1024 matrix is divided into 8 parts, the divided matrix may be represented as 8 × T × 128; for another example, when N is 4, i.e., the T × 1024 matrix is divided into 4 parts, the divided matrix may be represented as 4 × T × 256.

Then, when calculating the similarity for different human features, the corresponding similarity may be calculated for N features, so that N first similarities are obtained, for example, for a feature matrix with a size of T × 1024, when N is 8, the similarity between 8 sub-feature matrices with a size of T × 128 and a sub-feature matrix with a size of T × 128 in another feature matrix is calculated, so that 8 similarities are obtained, and may be represented by one matrix with a size of 8; and only 1 similarity can be obtained by calculating two feature matrixes of T multiplied by 1024. Therefore, the 8 similarity degrees can enhance the diversity of the relationship between the character features relative to the 1 similarity degree, and can describe the relationship between the character features more accurately.

After the N first similarities are obtained, N first attention distribution weights between the features of the people in the group may be determined based on the N first similarities, and for a specific manner of determining the attention distribution weights, reference may be made to the foregoing description, and details are not repeated herein.

In the embodiment of the present disclosure, the feature matrix of the person features is divided into N parts, then the similarities are calculated correspondingly for N parts of features of different person features, so as to obtain N first similarities, and based on the N first similarities, N first attention distribution weights between the person features in the group are determined. Therefore, the diversity of the relationship among the character features can be enhanced, the relationship among the character features can be described more accurately, and the accuracy of behavior recognition based on the character features can be improved.

In a possible implementation manner, the determining, based on the association relationship between the human features in the clustering result, the attention distribution weight between the human features includes: determining the overall characteristics of each group obtained by clustering; determining a second similarity between the overall characteristics of each group obtained by clustering; determining a second attention allocation weight between the human features based on the second similarity.

The overall characteristics of a certain group of the clustering result can be used to characterize the human features of the group as a whole, the overall characteristics are calculated based on the human features in the group, for example, the overall characteristics may be obtained by performing an average pooling operation on the human features in the group, or may be obtained by performing a random pooling operation on the human features in the group, and the embodiment of the present disclosure is not limited in particular to the way of determining the overall characteristics.

In addition, the overall characteristic of each group may also be a characteristic of a cluster center of each group in the clustering result, and then, in a possible implementation manner, the determining the overall characteristic of each group obtained by clustering includes: and taking the characteristics of the clustering centers of all groups in the clustering result as the overall characteristics of all groups. In the clustering process, other human body features determine whether the human body features belong to the class by calculating the distance (similarity) from the clustering center, so that the similarity between the features of the clustering center and all the human body features in the group is high, and the features of the clustering center can be used for accurately representing the overall features of the group.

The second similarity of the overall features of different groups may be a similarity between each two overall features of each group, and the specific way of calculating the second similarity of the overall features may be various, for example, a similarity calculation way based on euclidean distance, and for example, a similarity calculation way based on cosine similarity, and the like. The present disclosure does not specifically limit the manner of calculating the second similarity of the overall features.

After determining the second similarity between the sets of global features, the second similarity may be normalized (normalized), and specifically, a normalization function (e.g., softmax function) may be used to obtain a second attention allocation weight of the human feature.

The second attention distribution weight can be applied to the character features to update the character features so as to add attention information to the character features, the updated character features can be obtained after the plurality of character features are updated by the second attention distribution weight, and the expression capability of the key features is enhanced based on the character features updated by the attention distribution weight, so that the accuracy of the identification result can be improved by performing behavior identification on the video frame based on the updated character features.

In the disclosed embodiment, the overall characteristics of each group are obtained according to the character characteristics in each group; determining a second similarity between the overall characteristics of each group obtained by clustering; based on the second similarity, a second attention allocation weight between the human features is determined. Therefore, based on the similarity between the overall characteristics of different groups, the incidence relation between the human characteristics among the groups can be determined, the weight is distributed according to the second attention determined by the second similarity of the overall characteristics, the expression capacity of the key characteristics in the human characteristics can be enhanced, and therefore, the behavior recognition is carried out on the video frame based on the updated human characteristics, and the accuracy of the recognition result can be improved.

It will be appreciated that "first" and "second" in the embodiments of the present disclosure are used to distinguish the described objects and should not be construed as limiting the order in which the objects are described, indicating or implying relative importance or the like.

After determining the attention allocation weights, the human features may be updated to add attention information to the human features, and as described above, the first attention allocation weight and the second attention allocation weight can be obtained in the present disclosure. The following describes a process of performing character feature update using the first attention allocation weight and the second attention weight, respectively.

In one possible implementation, the updating the human character features based on the attention distribution weight includes: and aiming at the target character features in the character features of the single group, weighting and summing the character features in the group by using the first attention distribution weights of the target character features and the character features in the group to obtain the updated character features corresponding to the characters in the group, wherein the updated character features are used as the updated character features.

For a target person feature in a group of person features, weights are assigned by using the first attention of the target person feature and each person feature in the group, and each person feature in the group (including the target person feature itself) is subjected to weighted summation to obtain an intra-group updated feature corresponding to each person in the group as an updated person feature.

For example, target person feature V ₁ And n personal physical characteristics (V) ₁ 、V ₂ 、V ₃ ……V _n ) Is sequentially P ₁ 、P ₂ 、P ₃ ……P _n The first attention distribution weight obtained by normalizing the first similarity is W ₁ 、W ₂ 、W ₃ ……W _n Then the updated target person characteristic V ₁ ' can be determined by the following formula (1):

V ₁ ’＝V ₁ ·W ₁ +V ₂ ·W ₂ +V ₃ ·W ₃ +……+V _n ·W _n (1)

thus, the updated target person feature V ₁ The similarity degree of all the character features in the group is comprehensively considered, and the expression capability of the key features can be enhanced under the condition of higher similarity degree of all the character features in the group, so that the behavior recognition is carried out on the video frame based on the updated character features, and the accuracy of the recognition result can be improved.

In one possible implementation, the updating the human character feature based on the attention distribution weight includes: aiming at the target overall characteristics of the target groups in each group, obtaining the inter-group updating characteristics corresponding to each group by using the second attention distribution weight of the target overall characteristics and the overall characteristics of each group; and respectively adding the inter-group updating characteristics to the intra-group updating characteristics corresponding to the characters in the target group to obtain updated character characteristics.

The inter-group updated features of the target group can be obtained by performing weighted summation on the overall features of each group by using the second attention distribution weight, and then adding the weighted summation result to each human feature in the target group respectively.

For example, the target overall characteristic of the target group is C ₁ With m overall characteristics (C) of each packet ₁ 、C ₂ 、C ₃ ……C _m ) Has a second similarity of Q ₁ 、Q ₂ 、Q ₃ ……Q _m The first attention distribution weight obtained by normalizing the second similarity is U ₁ 、U ₂ 、U ₃ ……U _m Then the component updates feature C ₁ ' can be determined by equation (2):

C ₁ ’＝C ₁ ·U ₁ +C ₂ ·U ₂ +C ₃ ·U ₃ +……+C _m ·U _m (2)

further, the inter-group update features are respectively added to the intra-group update features corresponding to the characters in the target group, that is, the value of the n character features in the target group is V ₁ ’+C ₁ ’、V ₂ ’+C ₁ ’、V ₃ ’+C ₁ ’……V _n ’+C ₁ ’。

The updated character features can be obtained after the character features are updated by the second attention distribution weight, and the expression capability of the key features is enhanced based on the character features updated by the second attention distribution weight, so that the accuracy of the identification result can be improved by performing behavior identification on the video frame based on the updated character features.

In one possible implementation manner, the extracting human spatiotemporal features based on the updated human features includes: and carrying out space decoding on the updated character characteristics to obtain character space-time characteristics.

The human space-time characteristics comprise human space characteristics, and the human characteristics in the same video frame can represent the space distribution in the video frame, so that the human space characteristics can be obtained by carrying out space decoding on the human characteristics updated based on the attention distribution weight, and the human space characteristics can represent the space distribution information of the human characteristics in the video frame. For a specific way of performing spatial decoding on the updated character features, reference may be made to possible implementation manners provided in the present disclosure, which are not described herein again.

In addition, the human space-time characteristics further include human space characteristics, and in one possible implementation, the method performs spatial decoding on the updated human characteristics to obtain the human space-time characteristics, including: performing space decoding on the updated character features to obtain character space features; performing time domain coding and decoding on the character features of the plurality of video frames to obtain character time domain features; and fusing the character space characteristics and the character time domain characteristics to obtain character space-time characteristics.

In the embodiment of the disclosure, a human temporal feature may also be determined, where the human feature of the same person in a video frame sequence changes in different video frames, and the human temporal feature may represent information that the human feature of the same person changes in different video frames with time. Therefore, the character features of the same character in different video frames can be subjected to time domain coding and decoding to obtain the character time domain features.

The person time domain feature may be fused with the person space feature to obtain the person spatio-temporal feature, and the specific fusion manner may be, for example, adding the person time feature and the person space feature. And then, performing behavior recognition on the video frame based on the person space-time characteristics, so that the accuracy of the recognition result can be improved by considering not only the space characteristics of the person but also the time domain characteristics of the person during the behavior recognition.

In a possible implementation manner, the time-domain encoding and decoding the human character features of a plurality of video frames to obtain human character time-domain features includes: coding the character features of a plurality of video frames based on a self-attention mechanism to obtain time domain coding features; decoding the time domain coding features based on a self-attention mechanism, and/or decoding the time domain coding features based on spatial coding features to obtain character time domain features; wherein the spatial coding feature is the updated character feature.

After the character features of the plurality of video frames are extracted, the character features of the plurality of video frames can be input into a time-domain encoder and encoded based on an attention mechanism to obtain time-domain encoding features. In the encoding process, the self-attention mechanism can be used to calculate the alignment probability of the human features in each video frame and the human features at other moments, and then the probability and the human features at the corresponding moments are subjected to weighted summation to serve as time-domain encoding features.

In the process of decoding the time domain coding features, the time domain coding features can be decoded based on an attention mechanism to obtain first time domain features, the time domain coding features can also be decoded based on the spatial coding features to obtain second time domain features, and then the first time domain features and the second time domain features are fused to obtain character time domain features.

In the process of decoding the time-domain coding features based on the spatial coding features, the similarity between the spatial coding features and the time-series coding features can be determined, and the time-series coding features are weighted by using the similarity to obtain second time-domain features. And decoding the time domain coding features based on the spatial coding features, so that spatial context information is fused in the obtained second time domain features, the feature representation in the second time domain features is enhanced, and the accuracy of the final behavior recognition result can be improved.

In addition, there may be multiple specific decoding manners for the spatial coding features, and in a possible implementation manner, the performing spatial decoding on the updated human figure features to obtain the human figure space features includes: and decoding the spatial coding features based on a self-attention mechanism, and/or decoding the spatial coding features based on the time domain coding features to obtain the character spatial features.

In the process of decoding the spatial coding, the spatial coding features can be decoded based on an attention mechanism to obtain first spatial features, the spatial coding features can be decoded based on time domain coding features to obtain second spatial features, and then the first spatial features and the second spatial features are fused to obtain character spatial features.

In the process of decoding the spatial coding features based on the self-attention mechanism, the degree of association (for example, similarity) between the spatial coding features of different people can be determined, and then the degree of association is used to weight each spatial coding feature, so as to obtain the first spatial feature.

In the process of decoding the spatial coding features based on the time domain coding features, the similarity between the time sequence coding features and the spatial coding features can be determined, and then the spatial coding features are weighted by using the similarity to obtain second spatial features. And decoding the spatial coding features based on the time domain coding features, so that time domain context information is fused in the obtained second spatial features, the feature representation in the second spatial features is enhanced, and the accuracy of the final behavior recognition result can be improved.

In the embodiment of the disclosure, the time-domain coding features are decoded based on the spatial coding features to obtain character time-domain features, the spatial coding features are decoded based on the time-domain coding features to obtain character spatial features, the character spatial features and the character time-domain features are fused, and in the obtained character space-time features, the feature representation of characters is enhanced by semantic association based on a spatial context and a temporal context, so that behavior recognition is performed based on the character space-time features, and the accuracy of a behavior recognition result can be improved.

In one possible implementation, the method further includes: extracting global features of the video frame; determining a third attention allocation weight in the global feature using the person spatiotemporal feature; updating the global feature with the third attention allocation weight; the behavior recognition is carried out on the video frame based on the character space-time characteristics to obtain a recognition result, and the method comprises the following steps: and carrying out crowd behavior recognition on the video frame based on the updated global features to obtain a crowd behavior recognition result.

The global feature of the video frame may be obtained by feature extraction of the entire picture of the video frame, and may also be referred to as a scene feature of the video frame, and the specific extraction manner may be, for example, extraction through an I3D network, and input an output of a last layer of the I3D network into a Group Representation Generator (GRG), where the GRG is a preprocessing component and is used to initialize a feature of the I3D network to obtain an initialized global feature, and the specific extraction manner is not described herein again.

For the extracted global feature, attention distribution may be performed on the global feature, specifically, attention distribution may be performed on the global feature by using a human spatio-temporal feature, and therefore, a third attention distribution weight in the global feature may be determined by using the human spatio-temporal feature, and then the global feature may be updated by using the third attention distribution weight.

Specifically, attention can be allocated to global features by using human spatiotemporal features through a transform model. Specifically, the global feature may be used as a query parameter in the transform model, and the human spatiotemporal feature may be used as a key parameter in the transform model to obtain a third attention distribution weight, so as to realize attention distribution in the global feature, and thus, the global feature may be optimized according to the feature of each person in the video frame.

Therefore, the optimized global features are the features subjected to attention distribution through the human spatiotemporal features, the key features in the global features can be highlighted, and irrelevant features in the global features are weakened. Then, based on the updated global features, behavior recognition is performed on the video frames, and accuracy of recognition results can be improved.

In the process of identifying the crowd behaviors of the video frame, the updated global features can be input into a full connection layer of a neural network, classification is carried out by using the full connection layer, a plurality of crowd behavior categories can be preset in the full connection layer, the full connection layer can output confidence coefficients of the global features as the crowd behavior categories according to the global features, and the crowd behavior category with the highest confidence coefficient can be used as a crowd behavior identification result.

For example, in a video frame in a volleyball game, the extracted global features are input into a full connection layer, the confidence that the motion type of a person is 'left serve' is 0.9, the confidence that the left pass 'is 0.3, the confidence that the left pass' is 0.4, the confidence that the right serve 'is 0.1, and the confidence that the right block' is 0.1 are obtained, so that the 'left serve' with the highest confidence can be output as a recognition result.

In one possible implementation, after updating the global feature with the third attention allocation weight, the method further includes: the updated global feature is used as a new global feature, the character spatiotemporal feature is used as a new character feature, the global feature and the character spatiotemporal feature are updated iteratively until an iteration stop condition is met, and the iteratively updated global feature and the character spatiotemporal feature are obtained; the behavior recognition is carried out on the video frame based on the character space-time characteristics to obtain a recognition result, and the recognition result comprises the following steps: and carrying out crowd behavior identification on the video frame based on the global features after iterative update to obtain a crowd behavior identification result.

Taking the human spatiotemporal features as new human features, and iteratively updating the new human features, specifically, iteratively executing steps S12-S15 to obtain iteratively updated human spatiotemporal features, but of course, the iterative process may also be one or more possible implementations of steps S12-S15 in the embodiment of the present disclosure, for example, an implementation of updating human features based on a first attention allocation weight between human features in a group; an implementation of updating the personality characteristics based on a second attention allocation weight between inter-group characteristics; the updated human features are spatially decoded to obtain implementation manners of human spatio-temporal features, and the like, which are not listed here, and refer to possible implementation manners of steps S12 to S15 provided in this disclosure.

Similarly, the updated global feature may be used as a new global feature, the global feature is iteratively updated, and when an iteration stop condition is satisfied, the iteration may be stopped to obtain the iteratively updated global feature and the character spatio-temporal feature, where the condition for stopping the iteration may be, for example, a preset number of iterations.

After iterative updating, crowd behavior identification can be performed on the video frame based on the overall features after iterative updating to obtain crowd behavior identification results.

In one possible implementation, after obtaining the iteratively updated global features and the human spatio-temporal features, the method further comprises: and performing character behavior recognition based on the character spatiotemporal characteristics after iterative updating to obtain a character behavior recognition result.

In the disclosed embodiment, the behavior of the individual character can be identified, for example, in a volleyball game, the character will have corresponding game actions, such as: serve, bolster, pass, pinball, block, etc. Therefore, the behavior of a single person can also be identified based on the person spatiotemporal features.

Specifically, the character spatiotemporal characteristics are input into a full connection layer of the neural network, the full connection layer is used for classification, a plurality of match actions are preset in the full connection layer, the full connection layer can output the character spatiotemporal characteristics as confidence degrees of all actions according to the character spatiotemporal characteristics, and the match action with the highest confidence degree can be used as a character action recognition result.

For example, in a video frame in a volleyball game, the extracted human spatio-temporal features are input into a full connection layer, the confidence that the human movement is taken as 'serve' is 0.9, the confidence that the human is taken as 'bolster' is 0.3, the confidence that the human is taken as 'pass' is 0.4, the confidence that the human is taken as 'buckle' is 0.1, and the confidence that the human is taken as 'block' is 0.1, so that the 'serve' movement with the highest confidence can be output as a recognition result.

In the embodiment of the disclosure, because the person spatiotemporal features are features after attention distribution, the key features in the global features can be highlighted, and irrelevant features in the global features are weakened, so that the accuracy of obtaining a person action identification result is high by performing person action identification based on the person spatiotemporal features.

An application scenario of the embodiment of the present disclosure is explained below. In the application scenario, the behavior recognition method of the present disclosure may be implemented based on an end-to-end network, where the network may input a video frame sequence, and in the application scenario, the input video frame is a video frame in a volleyball game, and the video frame sequence may include a plurality of video frames. Then, global features and human features of each video frame are extracted.

Clustering the character features in the same video frame aiming at the extracted character features to obtain a clustering result, and determining the attention distribution weight of each character feature based on the clustering result; and updating the character features based on the attention distribution weight to obtain the spatial coding features, and then decoding the spatial coding features to obtain the character spatial features. In addition, the character time domain characteristics can be obtained based on character characteristics in different video frames. And aiming at the obtained character space characteristic and the character time domain characteristic, the character space characteristic and the character time domain characteristic can be fused to obtain the fused character space-time characteristic. The character spatiotemporal features can be used as new character features to be iteratively updated until the obtained character spatiotemporal features meet preset convergence conditions.

Aiming at the extracted global features, attention distribution can be carried out on the global features based on the human spatiotemporal features, the human spatiotemporal features are updated, and optimization of the human spatiotemporal features is achieved. And the optimized human spatiotemporal features can be used as input again to perform iterative optimization of the global features, and the process can be executed repeatedly until the obtained global features meet the preset convergence condition.

The updated character space-time characteristics can be input into the full-link layer, classification is carried out by utilizing the full-link layer, the action of a single character is identified, and the action identification result of the single character is obtained. And inputting the updated global features into the full-link layer, classifying by using the full-link layer, and identifying the crowd behavior to obtain a crowd behavior identification result of the whole video frame sequence.

For example, the global features after iterative update are input into the full connection layer, and the confidence coefficient of the character action type of "left serve" is 0.9, the confidence coefficient of "left pass" is 0.3, the confidence coefficient of "left pass" is 0.4, the confidence coefficient of "right serve" is 0.1, and the confidence coefficient of "right block" is 0.1, so that the "left serve" with the highest confidence coefficient can be used as the crowd behavior recognition result of the video frame in the volleyball game to be output. Inputting the extracted person space-time characteristics into the full connection layer, obtaining the confidence coefficient of the person action as 'serve', the confidence coefficient of 'bolster' is 0.9, the confidence coefficient of 'pass' is 0.3, the confidence coefficient of 'pass' is 0.4, the confidence coefficient of 'smash' is 0.1, and the confidence coefficient of 'block' is 0.1, so that the 'serve' action with the highest confidence coefficient can be output as a person action recognition result.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a behavior recognition apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any behavior recognition method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 2 shows a block diagram of a behavior recognition apparatus according to an embodiment of the present disclosure, and as shown in fig. 2, the apparatus 20 includes:

a human feature extraction unit 21, configured to receive an input video frame and extract human features in the video frame;

the clustering unit 22 is configured to cluster the plurality of character features in the video frame to obtain a clustering result;

an attention allocation unit 23, configured to determine an attention allocation weight of a human feature in the video frame based on the clustering result;

a human character feature updating unit 24 for updating the human character feature based on the attention assignment weight;

a human spatiotemporal feature extraction unit 25 for extracting human spatiotemporal features based on the updated human features;

and the behavior identification unit 26 is used for performing behavior identification on the video frames based on the character space-time characteristics to obtain an identification result.

In a possible implementation manner, the attention allocation unit 23 is configured to determine an attention allocation weight between the human features based on an association relationship between the human features in the clustering result.

In one possible implementation, the attention allocating unit 23 includes:

a first attention determination unit for determining a first attention distribution weight between the features of the human beings in the group based on the first similarity.

In a possible implementation manner, the first similarity determining unit is configured to divide the feature matrix of the character features into N parts; respectively and correspondingly calculating the similarity of the N characteristics of different character characteristics to obtain N first similarities;

In one possible implementation, the attention allocating unit 23 includes:

In a possible implementation manner, the human character feature updating unit 24 is configured to, for a target human character in the human characters of a single group, perform weighted summation on each human character in the group by using the first attention assignment weight of the target human character and each human character in the group, to obtain an intra-group updated feature corresponding to each human character in the group, as an updated human character.

In a possible implementation manner, the human character feature updating unit 24 includes:

the inter-group updating characteristic determining unit is used for obtaining the inter-group updating characteristics corresponding to each group by using the second attention distribution weight of the target overall characteristics and the overall characteristics of each group aiming at the target overall characteristics of the target group in each group;

and the character characteristic updating subunit is used for respectively adding the inter-group updating characteristics to the intra-group updating characteristics corresponding to the characters in the target group to obtain the updated character characteristics.

In one possible implementation manner, the human spatiotemporal feature extraction unit includes:

In a possible implementation manner, the time-domain coding/decoding unit includes:

the time domain decoding unit is used for decoding the time domain coding features based on a self-attention mechanism and/or decoding the time domain coding features based on the spatial coding features to obtain character time domain features; wherein the spatial coding feature is the updated character feature.

In one possible implementation, the method further includes:

a third attention determining unit, configured to determine a third attention allocation weight in the global feature by using the human spatiotemporal feature;

the behavior recognition unit includes:

In one possible implementation, the apparatus further includes:

the iteration updating unit is used for updating the global features and the character spatiotemporal features iteratively by taking the updated global features as new global features and the character spatiotemporal features as new character features until an iteration stop condition is met to obtain the iteratively updated global features and character spatiotemporal features;

In one possible implementation, the apparatus further includes:

In one possible implementation manner, the human feature extraction unit 21 includes:

the target rectangular frame determining unit is used for identifying human bodies in the video frames to obtain target rectangular frames of all the people;

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and when executed by a processor, the computer program instructions implement the above method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 3 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 3, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 4 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 4, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may further include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as a Microsoft Server operating system (Windows Server), stored in the memory 1932 ^TM ) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X) ^TM ) Multi-user, multi-process computer operating system (Unix) ^TM ) Free and open native code Unix-like operating System (Linux) ^TM ) Open native code Unix-like operating System (FreeBSD) ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as a memory 1932, is also provided that includes computer program instructions executable by a processing component 1922 of an electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the disclosure are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A behavior recognition method, comprising:

receiving an input video frame, and extracting character features in the video frame;

extracting person spatiotemporal features based on the updated person features;

performing behavior recognition on the video frame based on the character spatio-temporal characteristics to obtain a recognition result;

the determining attention allocation weights of the human features in the video frames based on the clustering results comprises:

2. The method of claim 1, wherein the determining attention distribution weights among the human features based on the association relations among the human features in the clustering results comprises:

determining a first similarity between character features in the same group obtained by clustering;

based on the first similarity, a first attention allocation weight between the features of the persons in the group is determined.

3. The method of claim 2, wherein determining the first similarity between the clustered human features in the same group comprises:

dividing the character feature matrix into N parts;

respectively and correspondingly calculating the similarity of N characteristics of different character characteristics to obtain N first similarities;

based on the N first similarities, N first attention distribution weights between the human features in the group are determined.

4. The method according to any one of claims 2 to 3, wherein the determining the attention distribution weight among the human features based on the association relationship among the human features in the clustering result comprises:

determining the overall characteristics of each group obtained by clustering;

5. The method of any of claims 1-3, wherein said updating the personality traits based on the attention-assignment weights comprises:

and aiming at the target character features in the character features of the single group, weighting and summing the character features in the group by using the first attention distribution weights of the target character features and the character features in the group to obtain the updated character features corresponding to the characters in the group, wherein the updated character features are used as the updated character features.

6. The method of any of claims 1-3, wherein updating the human character features based on the attention-assignment weights comprises:

and respectively adding the inter-group updating characteristics to the intra-group updating characteristics corresponding to each character in the target grouping to obtain updated character characteristics.

7. The method of claim 1, wherein extracting spatiotemporal features of the human based on the updated human features comprises:

8. The method of claim 7, wherein spatially decoding the updated human features to obtain human spatio-temporal features, comprises:

9. The method of claim 8, wherein the temporally encoding and decoding the human character features of the plurality of video frames to obtain human character temporal features comprises:

10. The method of claim 9, wherein spatially decoding the updated human character features to obtain human character spatial features comprises:

11. The method according to any one of claims 7-10, further comprising:

extracting global features of the video frame;

updating the global feature with the third attention allocation weight;

12. The method of claim 11, wherein after updating the global feature with the third attention allocation weight, the method further comprises:

13. The method of claim 12, wherein after obtaining the iteratively updated global features and the human spatio-temporal features, the method further comprises:

14. The method of claim 1, wherein the extracting the human feature in the video frame comprises:

identifying a human body in the video frame to obtain a target rectangular frame of each figure;

15. A behavior recognition apparatus, comprising:

an attention distribution unit for determining attention distribution weight of human features in the video frames based on the clustering result;

the behavior identification unit is used for carrying out behavior identification on the video frame based on the character space-time characteristics to obtain an identification result;

the attention allocation unit is used for determining attention allocation weights among the character features based on the incidence relation among the character features in the clustering result.

16. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 14.

17. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 14.