CN114842411A

CN114842411A - Group behavior identification method based on complementary space-time information modeling

Info

Publication number: CN114842411A
Application number: CN202210342854.2A
Authority: CN
Inventors: 韩鸣飞; 王亚立; 乔宇
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-08-02
Also published as: WO2023185074A1

Abstract

The invention discloses a group behavior identification method based on complementary spatiotemporal information modeling. The method comprises the following steps: acquiring an individual feature vector in a target video; inputting the individual feature vector into a trained group behavior recognition model to obtain a group behavior recognition result, wherein the group behavior recognition model comprises a first modeling branch and a second modeling branch, the first modeling branch obtains enhanced individual features from the input individual features sequentially through a first space self-attention module and a first time self-attention module, and then recognizes all the enhanced individual features to obtain a first group behavior recognition result; and the second modeling branch sequentially passes through a second time self-attention module and a second space self-attention module to obtain enhanced individual features for the input individual features, and then identifies all the enhanced individual features to obtain a second group behavior identification result. The method improves the group behavior recognition accuracy and enhances the model robustness.

Description

Group behavior identification method based on complementary space-time information modeling

Technical Field

The invention relates to the technical field of deep learning, in particular to a group behavior identification method based on complementary spatiotemporal information modeling.

Background

The concept of deep learning is derived from the research of artificial neural networks, for example, a multi-layer perceptron including multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The deep machine learning method is divided into supervised learning and unsupervised learning. Learning models built under different learning frameworks are different, for example, a Convolutional Neural Network (CNN) is a machine learning model under deep supervised learning.

Group behavior identification is an important research problem which is widely applied and urgently needs to be solved in the field of computer vision. With the development of deep learning technology, the breadth and depth of group behavior identification and understanding are also expanding continuously. The group behavior recognition is a technology for judging and classifying the behavior state of a target group in a video by analyzing the individual behavior content of an object contained in the video. In the analysis and understanding of the video, the spatio-temporal information modeling means that information interaction is simultaneously carried out in frames and between frames on pixel points of each frame of the video or feature points obtained by deep learning.

In the prior art, a paper (DOI: 10.1109/CVPR42600.2020.00092, Actor-transducers for Group Activity Recognition) proposes that a Transformer model is used for similarity modeling of different target objects in a video, and features of the target objects are enhanced; and further improve the accuracy of group behavior identification. The basic idea of this paper is: 1) extracting depth features of different frames of a video by using a deep neural network, and then extracting the depth features of all targets in different frames by using a target enclosure frame; 2) the obtained depth features of all targets are subjected to embedded operation and position coding and input into an autocorrelation attention module to obtain enhanced individual features of the targets; 3) distinguishing different targets respectively by using a classifier to obtain behavior categories of individuals; and fusing the characteristics of different targets to obtain the group depth characteristics, and distinguishing by using different classifiers to obtain the behavior category of the group. By testing on a VolleyballDataset, Collective Dataset and NBAdataset (International general video population behavior recognition data set), the accuracy of the population behavior recognition in the video is improved. However, the existing video group behavior recognition only considers one order of spatio-temporal modeling, neglecting that the spatio-temporal modeling can be realized through two orders, and neglecting that the two modeling modes have strong complementarity.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a group behavior identification method based on complementary spatiotemporal information modeling. The method comprises the following steps:

acquiring an individual feature vector in a target video;

inputting the individual characteristic vector into a trained group behavior recognition model to obtain a group behavior recognition result, wherein the group behavior recognition model comprises a first modeling branch and a second modeling branch, the first modeling branch obtains enhanced individual characteristics for the input individual characteristics sequentially through a first space self-attention module and a first time self-attention module, and then recognizes all the enhanced individual characteristics to obtain a first group behavior recognition result; the second modeling branch obtains enhanced individual features for the input individual features sequentially through a second time self-attention module and a second space self-attention module, and then identifies all the enhanced individual features to obtain a second group behavior identification result; the group behavior recognition result is obtained by fusing the first group behavior result and the second group behavior recognition result.

Compared with the prior art, the method has the advantages that the problem of group behavior category identification errors is solved by modeling two complementary space-time relations for the first time, and the robustness of the model is enhanced. In addition, by designing contrast loss functions between frames, between frames and videos and between videos, consistency of feature expressions learned by two modeling branches of time-space and space-time can be restrained, and the accuracy rate of group behavior identification is further improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram of a method for group behavior recognition based on complementary spatiotemporal information modeling, according to one embodiment of the present invention;

FIG. 2 is an overall architecture diagram of a population behavior recognition method based on complementary spatiotemporal information modeling, according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-level contrast loss function according to one embodiment of the present invention;

fig. 4 is a schematic diagram of an application scenario according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Referring to fig. 1, the group behavior recognition method based on complementary spatiotemporal information modeling includes the following steps.

Step S110, constructing a group behavior recognition model, wherein the model comprises two modeling branches with dual spatiotemporal relations.

As shown in fig. 2, the constructed group behavior recognition model includes a space-time modeling branch and a time-space modeling branch, which respectively model the time and space relationships by using a self-attention mechanism, and are stacked in two different orders, i.e., space-time and time-space, to construct complementary spatio-temporal relationships, with duality.

For the space-time branch (or referred to as ST branch), the input individual features are sequentially subjected to a space attention module and a time attention module to obtain enhanced individual features, and can be further divided into group features and individual behaviors, so as to obtain a group behavior recognition result by using the group features. For the time-space branch (or TS branch), the input individual features are sequentially subjected to a time attention module and a space attention module to obtain enhanced individual features, and are divided into group features and individual behaviors, and then a group behavior recognition result is obtained based on the group features. And a final group behavior recognition result can be obtained by fusing the behavior recognition results of the two modeling branches. Dual space-time modeling is realized through the designed two branches.

In one embodiment, the individual features input into the two modeling branches are obtained by: for an input video, extracting K video frames, wherein each video frame comprises N individuals, and obtaining a feature vector X of all the individuals through a deep neural network and RoiAlign (region of interest alignment), namely K multiplied by N multiplied by C dimensional features, wherein C represents the latitude of the depth feature vector. ROI Align is a region feature clustering method for determining the target bounding box, and is not described in detail herein.

The temporal self-attention module is used for temporal relationship modeling, for example, the specific temporal relationship modeling manner is: for K frame features of each individual, namely K multiplied by C features, establishing K inter-feature relationships through a self-attention mechanism, and inputting feed-forward neural network (FFN) enhanced feature expression to obtain individual features after time relationship modeling enhancement.

The spatial self-attention module is used for spatial relationship modeling, for example, the specific spatial relationship modeling method is as follows: and (3) for N individual characteristics, namely characteristics of NxC, in each video frame, constructing N inter-characteristic relationships through a self-attention mechanism, and inputting FFN network enhanced characteristic expression to obtain the individual characteristics after spatial relationship modeling enhancement.

In the embodiment of the invention, the specific process of dual spatio-temporal modeling comprises the following steps:

step S1, inputting the original individual characteristics X into the space relation modeling and the time relation modeling in sequence to obtain all the individual characteristics X after the space-time relation modeling _ST And inputting the group behavior recognition result into a classifier to obtain a group behavior recognition result of the space-time branch.

Step S2, inputting the original individual characteristics X into time relation modeling and space relation modeling in sequence to obtain all the individual characteristics X after the time-space relation modeling _TS And inputting the information into a classifier to obtain a group behavior recognition result of the time-space branch.

And step S3, fusing the recognition results of the two modeling branches to obtain a final group behavior recognition result.

It should be noted that, in the embodiment of the present invention, the two modeling branches have dual complementary relationships, the temporal self-attention modules on each branch may have the same or different structures, and the spatial self-attention modules may also have the same or slightly different structures. For example, the number of self-attention module stacks may be different.

And step S120, training a group behavior recognition model based on the video sample set, and designing a multi-level contrast loss function to constrain the consistency of the two modeling branch feature expressions in the training process.

In order to further improve the group behavior recognition accuracy, consistency of feature expressions learned by the time-space modeling branch and the space-time modeling branch can be restricted in the model training process. Considering that the characteristics of the same individual under different space-time modeling are consistent as much as possible and are different from the characteristics of other individuals, local-to-global multi-level characteristic constraints are designed for two modeling branches.

For example, frame-to-frame, frame-to-video, and video-to-video contrast loss functions are designed to constrain the consistency of the feature expressions learned by the two modeling branches. Fig. 3 is a schematic diagram of a contrast loss function, where fig. 3(a) is an individual contrast loss function between frames, fig. 3(b) is an individual contrast loss function between frames and video, and fig. 3(c) is an individual contrast loss function between video and video.

In one embodiment, the frame-to-frame contrast loss function is set to:

wherein h represents the cosine similarity CosSim,

representing the characteristics of the nth individual kth frame on the ST branch,

the characteristics of the nth frame of the TS branch are shown, and t represents the t frame of the current video. For the nth individual of the ST branch, the feature consistency in its kth frame with its feature in the kth frame of the TS branch can be maximized using this function.

In one embodiment, the contrast loss function between a frame and a video:

wherein the content of the first and second substances,

video level characteristics representing the nth individual of the TS branch,

represents the ith individual video level feature of all videos in the TS branch Batch, B represents the Batch size, i.e. the number of videos contained in the data of one Batch, N represents the individual number,

the video level feature of the nth individual, representing the current video in the TS branch, is obtained by the K frame features max _ pooling of the nth individual. For the nth individual of the ST branch, the function can be used to maximize the correspondence of its features in the kth frame with its video-level features in the TS branch.

In one embodiment, the contrast loss function between video and video is set to:

wherein the content of the first and second substances,

respectively representing the video level characteristics of the nth individual in the current video. The function can be used for maximizing the consistency of video characteristics of the same individual in ST and TS branches.

In the model training process, preferably, three kinds of loss functions can be fused to obtain an overall loss function, so as to maximize the distribution consistency of the two modeling branches between frames, between frames and between videos as the optimal parameters of the model obtained by the goal, for example, the overall loss function is designed to be the direct addition or weighted fusion of the three kinds of loss functions.

It should be understood that, on the premise of designing dual space-time modeling, the recognition accuracy of the model can be improved to a certain extent by adopting one loss function or the fusion of two loss functions as an overall loss function. In addition, in the model training process, different parameters can be set according to training efficiency and model precision requirements, for example, K is 3, N is 12, and C is 1024.

And step S130, performing behavior recognition on the target video by using the trained group behavior recognition model.

The trained model can be used for group behavior recognition of an actual target video, and during actual recognition, a multi-level contrast loss function is not calculated, and the rest is basically consistent with the training process, so that the details are not repeated.

The trained model obtained by the invention can be applied to a client or a server to realize group behavior recognition or analysis in a target video under various scenes, and is shown in fig. 4. The server may be a computer, a cluster, or a cloud server. The client can be an electronic device such as a smart phone, a tablet computer, a smart wearable device (smart watch, virtual reality glasses, virtual reality helmet, etc.), a smart car-mounted device, and a computer. The computer may be, for example, a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, or the like.

To further verify the effect of the present invention, experiments were performed. The experimental results show that the invention achieves the optimal recognition results on the data of Volleyball Dataset, Collective Dataset and NBA Dataset compared with the prior art. Meanwhile, the dual space-time relation modeling module provided by the invention can be combined with a group behavior recognition method of any space-time modeling, and can further improve the group behavior recognition accuracy of the algorithm and reduce the dependence on the training data volume by combining with the characteristic constraint based on the multi-level comparison loss function.

The invention can be applied to various scenes, such as security of smart cities. The crowd behavior patterns in the monitoring video are different, so that high-risk behaviors of the crowd, such as fighting, illegal gathering and the like, can be more effectively identified. Meanwhile, the method has low demand on training data, and is more suitable for being used for recognizing the rare high-risk group behaviors. For example, in an urban intelligent security scene, the crowd behaviors in a monitoring range are accurately judged, and identification and alarm for the behaviors harmful to the city are assisted. In automatic driving, at a traffic crossing, the behavior mode of a crowd is judged, and the safety automatic driving behavior is very important. For another example, in field animal monitoring, a large number of cameras are arranged in some important animal protection and wild animal research areas, and a group behavior recognition technology is an important basis for automation of wild animal state detection.

In summary, compared with the prior art, the invention has the following advantages:

1) and the behavior modeling is carried out on the individual in the video by utilizing two complementary dual space-time modeling methods for the first time, so that the accuracy rate of behavior identification is improved. The modeling of the space-time relation in the prior art is only carried out in a time-space or space-time mode, and the complementarity of two sequences of the space-time modeling is neglected.

2) According to the invention, the consistency of individual characteristics between two modeling branches is restricted from three levels of frames, frames and videos, namely, a multi-level contrast loss function is adopted, so that the design further improves the recognition accuracy and the robustness of the model, and reduces the dependence on training data. Whereas conventional contrast loss functions only focus on video-to-video consistency.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A group behavior identification method based on complementary spatiotemporal information modeling comprises the following steps:

acquiring an individual feature vector in a target video;

2. The method of claim 1, wherein the individual feature vectors are obtained by:

extracting K video frames aiming at a target video, wherein each video frame comprises N individuals;

and aligning the N individuals to RoiAlign through a deep neural network and the region of interest to obtain the feature vectors of the N individuals.

3. The method of claim 1, wherein in the training of the group behavior recognition model, a frame-to-frame contrast loss function, a frame-to-video contrast loss function, and a video-to-video contrast loss function are fused, and consistency of features between the first modeling branch and the second modeling branch is constrained from three levels.

4. The method of claim 3, wherein the frame-to-frame contrast loss function is set to:

wherein h represents cosine similarity CosSim,

features of the nth individual kth frame on the first modeling branch are represented,

features of the nth individual kth frame on the second modeling branch are represented, t represents an index of the frame, and K represents the number of frames.

5. The method of claim 3, wherein the contrast loss function between the frame and the video is set to:

Wherein the content of the first and second substances,

representing the video level characteristics of the nth individual on the second modeled branch,

represents the video level characteristics of the ith individual in all videos in a batch on the second modeling branch, B represents the number of videos contained in a batch, N represents the number of individuals,

representing features of the nth individual kth frame on the first modeling branch.

6. The method of claim 3, wherein the contrast loss function between the video and the video is set as:

wherein the content of the first and second substances,

video level characteristics of the nth individual in the video on the second modeling branch and the first modeling branch are respectively represented.

7. The approach of claim 1, wherein the first modeling branch and the second modeling branch have dual complementary relationships, the first spatial self-attention module and the second spatial self-attention module have the same or different structures, and the first temporal self-attention module and the second temporal self-attention module have the same or different structures.

8. The method according to claim 1, wherein the first time self-attention module and the second time self-attention module are used for time relationship modeling, and for multi-frame features of each individual, the relationships among the features are constructed through a self-attention mechanism and then input to a feed-forward neural network to enhance feature expression, so as to obtain individual features after time relationship modeling enhancement; the first space self-attention module and the second space self-attention module are used for space relation modeling, for a plurality of individual features in each video frame, the relation among the plurality of features is constructed through a self-attention mechanism, and a feedforward neural network enhanced feature expression is input to obtain the individual features after the space relation modeling is enhanced.

9. A computer-readable storage medium, on which a computer program is stored, wherein the computer program realizes the steps of the method according to any one of claims 1 to 8 when executed by a processor.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor realizes the steps of the method according to any one of claims 1 to 8 when executing the computer program.