CN114842411A - Group behavior identification method based on complementary space-time information modeling - Google Patents

Group behavior identification method based on complementary space-time information modeling Download PDF

Info

Publication number
CN114842411A
CN114842411A CN202210342854.2A CN202210342854A CN114842411A CN 114842411 A CN114842411 A CN 114842411A CN 202210342854 A CN202210342854 A CN 202210342854A CN 114842411 A CN114842411 A CN 114842411A
Authority
CN
China
Prior art keywords
modeling
individual
video
group behavior
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210342854.2A
Other languages
Chinese (zh)
Inventor
韩鸣飞
王亚立
乔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202210342854.2A priority Critical patent/CN114842411A/en
Publication of CN114842411A publication Critical patent/CN114842411A/en
Priority to PCT/CN2022/136900 priority patent/WO2023185074A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a group behavior identification method based on complementary spatiotemporal information modeling. The method comprises the following steps: acquiring an individual feature vector in a target video; inputting the individual feature vector into a trained group behavior recognition model to obtain a group behavior recognition result, wherein the group behavior recognition model comprises a first modeling branch and a second modeling branch, the first modeling branch obtains enhanced individual features from the input individual features sequentially through a first space self-attention module and a first time self-attention module, and then recognizes all the enhanced individual features to obtain a first group behavior recognition result; and the second modeling branch sequentially passes through a second time self-attention module and a second space self-attention module to obtain enhanced individual features for the input individual features, and then identifies all the enhanced individual features to obtain a second group behavior identification result. The method improves the group behavior recognition accuracy and enhances the model robustness.

Description

Group behavior identification method based on complementary space-time information modeling
Technical Field
The invention relates to the technical field of deep learning, in particular to a group behavior identification method based on complementary spatiotemporal information modeling.
Background
The concept of deep learning is derived from the research of artificial neural networks, for example, a multi-layer perceptron including multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The deep machine learning method is divided into supervised learning and unsupervised learning. Learning models built under different learning frameworks are different, for example, a Convolutional Neural Network (CNN) is a machine learning model under deep supervised learning.
Group behavior identification is an important research problem which is widely applied and urgently needs to be solved in the field of computer vision. With the development of deep learning technology, the breadth and depth of group behavior identification and understanding are also expanding continuously. The group behavior recognition is a technology for judging and classifying the behavior state of a target group in a video by analyzing the individual behavior content of an object contained in the video. In the analysis and understanding of the video, the spatio-temporal information modeling means that information interaction is simultaneously carried out in frames and between frames on pixel points of each frame of the video or feature points obtained by deep learning.
In the prior art, a paper (DOI: 10.1109/CVPR42600.2020.00092, Actor-transducers for Group Activity Recognition) proposes that a Transformer model is used for similarity modeling of different target objects in a video, and features of the target objects are enhanced; and further improve the accuracy of group behavior identification. The basic idea of this paper is: 1) extracting depth features of different frames of a video by using a deep neural network, and then extracting the depth features of all targets in different frames by using a target enclosure frame; 2) the obtained depth features of all targets are subjected to embedded operation and position coding and input into an autocorrelation attention module to obtain enhanced individual features of the targets; 3) distinguishing different targets respectively by using a classifier to obtain behavior categories of individuals; and fusing the characteristics of different targets to obtain the group depth characteristics, and distinguishing by using different classifiers to obtain the behavior category of the group. By testing on a VolleyballDataset, Collective Dataset and NBAdataset (International general video population behavior recognition data set), the accuracy of the population behavior recognition in the video is improved. However, the existing video group behavior recognition only considers one order of spatio-temporal modeling, neglecting that the spatio-temporal modeling can be realized through two orders, and neglecting that the two modeling modes have strong complementarity.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a group behavior identification method based on complementary spatiotemporal information modeling. The method comprises the following steps:
acquiring an individual feature vector in a target video;
inputting the individual characteristic vector into a trained group behavior recognition model to obtain a group behavior recognition result, wherein the group behavior recognition model comprises a first modeling branch and a second modeling branch, the first modeling branch obtains enhanced individual characteristics for the input individual characteristics sequentially through a first space self-attention module and a first time self-attention module, and then recognizes all the enhanced individual characteristics to obtain a first group behavior recognition result; the second modeling branch obtains enhanced individual features for the input individual features sequentially through a second time self-attention module and a second space self-attention module, and then identifies all the enhanced individual features to obtain a second group behavior identification result; the group behavior recognition result is obtained by fusing the first group behavior result and the second group behavior recognition result.
Compared with the prior art, the method has the advantages that the problem of group behavior category identification errors is solved by modeling two complementary space-time relations for the first time, and the robustness of the model is enhanced. In addition, by designing contrast loss functions between frames, between frames and videos and between videos, consistency of feature expressions learned by two modeling branches of time-space and space-time can be restrained, and the accuracy rate of group behavior identification is further improved.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow diagram of a method for group behavior recognition based on complementary spatiotemporal information modeling, according to one embodiment of the present invention;
FIG. 2 is an overall architecture diagram of a population behavior recognition method based on complementary spatiotemporal information modeling, according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of a multi-level contrast loss function according to one embodiment of the present invention;
fig. 4 is a schematic diagram of an application scenario according to an embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Referring to fig. 1, the group behavior recognition method based on complementary spatiotemporal information modeling includes the following steps.
Step S110, constructing a group behavior recognition model, wherein the model comprises two modeling branches with dual spatiotemporal relations.
As shown in fig. 2, the constructed group behavior recognition model includes a space-time modeling branch and a time-space modeling branch, which respectively model the time and space relationships by using a self-attention mechanism, and are stacked in two different orders, i.e., space-time and time-space, to construct complementary spatio-temporal relationships, with duality.
For the space-time branch (or referred to as ST branch), the input individual features are sequentially subjected to a space attention module and a time attention module to obtain enhanced individual features, and can be further divided into group features and individual behaviors, so as to obtain a group behavior recognition result by using the group features. For the time-space branch (or TS branch), the input individual features are sequentially subjected to a time attention module and a space attention module to obtain enhanced individual features, and are divided into group features and individual behaviors, and then a group behavior recognition result is obtained based on the group features. And a final group behavior recognition result can be obtained by fusing the behavior recognition results of the two modeling branches. Dual space-time modeling is realized through the designed two branches.
In one embodiment, the individual features input into the two modeling branches are obtained by: for an input video, extracting K video frames, wherein each video frame comprises N individuals, and obtaining a feature vector X of all the individuals through a deep neural network and RoiAlign (region of interest alignment), namely K multiplied by N multiplied by C dimensional features, wherein C represents the latitude of the depth feature vector. ROI Align is a region feature clustering method for determining the target bounding box, and is not described in detail herein.
The temporal self-attention module is used for temporal relationship modeling, for example, the specific temporal relationship modeling manner is: for K frame features of each individual, namely K multiplied by C features, establishing K inter-feature relationships through a self-attention mechanism, and inputting feed-forward neural network (FFN) enhanced feature expression to obtain individual features after time relationship modeling enhancement.
The spatial self-attention module is used for spatial relationship modeling, for example, the specific spatial relationship modeling method is as follows: and (3) for N individual characteristics, namely characteristics of NxC, in each video frame, constructing N inter-characteristic relationships through a self-attention mechanism, and inputting FFN network enhanced characteristic expression to obtain the individual characteristics after spatial relationship modeling enhancement.
In the embodiment of the invention, the specific process of dual spatio-temporal modeling comprises the following steps:
step S1, inputting the original individual characteristics X into the space relation modeling and the time relation modeling in sequence to obtain all the individual characteristics X after the space-time relation modeling ST And inputting the group behavior recognition result into a classifier to obtain a group behavior recognition result of the space-time branch.
Step S2, inputting the original individual characteristics X into time relation modeling and space relation modeling in sequence to obtain all the individual characteristics X after the time-space relation modeling TS And inputting the information into a classifier to obtain a group behavior recognition result of the time-space branch.
And step S3, fusing the recognition results of the two modeling branches to obtain a final group behavior recognition result.
It should be noted that, in the embodiment of the present invention, the two modeling branches have dual complementary relationships, the temporal self-attention modules on each branch may have the same or different structures, and the spatial self-attention modules may also have the same or slightly different structures. For example, the number of self-attention module stacks may be different.
And step S120, training a group behavior recognition model based on the video sample set, and designing a multi-level contrast loss function to constrain the consistency of the two modeling branch feature expressions in the training process.
In order to further improve the group behavior recognition accuracy, consistency of feature expressions learned by the time-space modeling branch and the space-time modeling branch can be restricted in the model training process. Considering that the characteristics of the same individual under different space-time modeling are consistent as much as possible and are different from the characteristics of other individuals, local-to-global multi-level characteristic constraints are designed for two modeling branches.
For example, frame-to-frame, frame-to-video, and video-to-video contrast loss functions are designed to constrain the consistency of the feature expressions learned by the two modeling branches. Fig. 3 is a schematic diagram of a contrast loss function, where fig. 3(a) is an individual contrast loss function between frames, fig. 3(b) is an individual contrast loss function between frames and video, and fig. 3(c) is an individual contrast loss function between video and video.
In one embodiment, the frame-to-frame contrast loss function is set to:
Figure BDA0003579996630000051
wherein h represents the cosine similarity CosSim,
Figure BDA0003579996630000052
representing the characteristics of the nth individual kth frame on the ST branch,
Figure BDA0003579996630000053
the characteristics of the nth frame of the TS branch are shown, and t represents the t frame of the current video. For the nth individual of the ST branch, the feature consistency in its kth frame with its feature in the kth frame of the TS branch can be maximized using this function.
In one embodiment, the contrast loss function between a frame and a video:
Figure BDA0003579996630000054
wherein the content of the first and second substances,
Figure BDA0003579996630000055
video level characteristics representing the nth individual of the TS branch,
Figure BDA0003579996630000056
represents the ith individual video level feature of all videos in the TS branch Batch, B represents the Batch size, i.e. the number of videos contained in the data of one Batch, N represents the individual number,
Figure BDA0003579996630000057
the video level feature of the nth individual, representing the current video in the TS branch, is obtained by the K frame features max _ pooling of the nth individual. For the nth individual of the ST branch, the function can be used to maximize the correspondence of its features in the kth frame with its video-level features in the TS branch.
In one embodiment, the contrast loss function between video and video is set to:
Figure BDA0003579996630000061
wherein the content of the first and second substances,
Figure BDA0003579996630000062
respectively representing the video level characteristics of the nth individual in the current video. The function can be used for maximizing the consistency of video characteristics of the same individual in ST and TS branches.
In the model training process, preferably, three kinds of loss functions can be fused to obtain an overall loss function, so as to maximize the distribution consistency of the two modeling branches between frames, between frames and between videos as the optimal parameters of the model obtained by the goal, for example, the overall loss function is designed to be the direct addition or weighted fusion of the three kinds of loss functions.
It should be understood that, on the premise of designing dual space-time modeling, the recognition accuracy of the model can be improved to a certain extent by adopting one loss function or the fusion of two loss functions as an overall loss function. In addition, in the model training process, different parameters can be set according to training efficiency and model precision requirements, for example, K is 3, N is 12, and C is 1024.
And step S130, performing behavior recognition on the target video by using the trained group behavior recognition model.
The trained model can be used for group behavior recognition of an actual target video, and during actual recognition, a multi-level contrast loss function is not calculated, and the rest is basically consistent with the training process, so that the details are not repeated.
The trained model obtained by the invention can be applied to a client or a server to realize group behavior recognition or analysis in a target video under various scenes, and is shown in fig. 4. The server may be a computer, a cluster, or a cloud server. The client can be an electronic device such as a smart phone, a tablet computer, a smart wearable device (smart watch, virtual reality glasses, virtual reality helmet, etc.), a smart car-mounted device, and a computer. The computer may be, for example, a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, or the like.
To further verify the effect of the present invention, experiments were performed. The experimental results show that the invention achieves the optimal recognition results on the data of Volleyball Dataset, Collective Dataset and NBA Dataset compared with the prior art. Meanwhile, the dual space-time relation modeling module provided by the invention can be combined with a group behavior recognition method of any space-time modeling, and can further improve the group behavior recognition accuracy of the algorithm and reduce the dependence on the training data volume by combining with the characteristic constraint based on the multi-level comparison loss function.
The invention can be applied to various scenes, such as security of smart cities. The crowd behavior patterns in the monitoring video are different, so that high-risk behaviors of the crowd, such as fighting, illegal gathering and the like, can be more effectively identified. Meanwhile, the method has low demand on training data, and is more suitable for being used for recognizing the rare high-risk group behaviors. For example, in an urban intelligent security scene, the crowd behaviors in a monitoring range are accurately judged, and identification and alarm for the behaviors harmful to the city are assisted. In automatic driving, at a traffic crossing, the behavior mode of a crowd is judged, and the safety automatic driving behavior is very important. For another example, in field animal monitoring, a large number of cameras are arranged in some important animal protection and wild animal research areas, and a group behavior recognition technology is an important basis for automation of wild animal state detection.
In summary, compared with the prior art, the invention has the following advantages:
1) and the behavior modeling is carried out on the individual in the video by utilizing two complementary dual space-time modeling methods for the first time, so that the accuracy rate of behavior identification is improved. The modeling of the space-time relation in the prior art is only carried out in a time-space or space-time mode, and the complementarity of two sequences of the space-time modeling is neglected.
2) According to the invention, the consistency of individual characteristics between two modeling branches is restricted from three levels of frames, frames and videos, namely, a multi-level contrast loss function is adopted, so that the design further improves the recognition accuracy and the robustness of the model, and reduces the dependence on training data. Whereas conventional contrast loss functions only focus on video-to-video consistency.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. A group behavior identification method based on complementary spatiotemporal information modeling comprises the following steps:
acquiring an individual feature vector in a target video;
inputting the individual characteristic vector into a trained group behavior recognition model to obtain a group behavior recognition result, wherein the group behavior recognition model comprises a first modeling branch and a second modeling branch, the first modeling branch obtains enhanced individual characteristics for the input individual characteristics sequentially through a first space self-attention module and a first time self-attention module, and then recognizes all the enhanced individual characteristics to obtain a first group behavior recognition result; the second modeling branch obtains enhanced individual features for the input individual features sequentially through a second time self-attention module and a second space self-attention module, and then identifies all the enhanced individual features to obtain a second group behavior identification result; the group behavior recognition result is obtained by fusing the first group behavior result and the second group behavior recognition result.
2. The method of claim 1, wherein the individual feature vectors are obtained by:
extracting K video frames aiming at a target video, wherein each video frame comprises N individuals;
and aligning the N individuals to RoiAlign through a deep neural network and the region of interest to obtain the feature vectors of the N individuals.
3. The method of claim 1, wherein in the training of the group behavior recognition model, a frame-to-frame contrast loss function, a frame-to-video contrast loss function, and a video-to-video contrast loss function are fused, and consistency of features between the first modeling branch and the second modeling branch is constrained from three levels.
4. The method of claim 3, wherein the frame-to-frame contrast loss function is set to:
Figure FDA0003579996620000011
wherein h represents cosine similarity CosSim,
Figure FDA0003579996620000012
features of the nth individual kth frame on the first modeling branch are represented,
Figure FDA0003579996620000013
features of the nth individual kth frame on the second modeling branch are represented, t represents an index of the frame, and K represents the number of frames.
5. The method of claim 3, wherein the contrast loss function between the frame and the video is set to:
Figure FDA0003579996620000021
Wherein the content of the first and second substances,
Figure FDA0003579996620000022
representing the video level characteristics of the nth individual on the second modeled branch,
Figure FDA0003579996620000023
represents the video level characteristics of the ith individual in all videos in a batch on the second modeling branch, B represents the number of videos contained in a batch, N represents the number of individuals,
Figure FDA0003579996620000024
representing features of the nth individual kth frame on the first modeling branch.
6. The method of claim 3, wherein the contrast loss function between the video and the video is set as:
Figure FDA0003579996620000025
wherein the content of the first and second substances,
Figure FDA0003579996620000026
video level characteristics of the nth individual in the video on the second modeling branch and the first modeling branch are respectively represented.
7. The approach of claim 1, wherein the first modeling branch and the second modeling branch have dual complementary relationships, the first spatial self-attention module and the second spatial self-attention module have the same or different structures, and the first temporal self-attention module and the second temporal self-attention module have the same or different structures.
8. The method according to claim 1, wherein the first time self-attention module and the second time self-attention module are used for time relationship modeling, and for multi-frame features of each individual, the relationships among the features are constructed through a self-attention mechanism and then input to a feed-forward neural network to enhance feature expression, so as to obtain individual features after time relationship modeling enhancement; the first space self-attention module and the second space self-attention module are used for space relation modeling, for a plurality of individual features in each video frame, the relation among the plurality of features is constructed through a self-attention mechanism, and a feedforward neural network enhanced feature expression is input to obtain the individual features after the space relation modeling is enhanced.
9. A computer-readable storage medium, on which a computer program is stored, wherein the computer program realizes the steps of the method according to any one of claims 1 to 8 when executed by a processor.
10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor realizes the steps of the method according to any one of claims 1 to 8 when executing the computer program.
CN202210342854.2A 2022-04-02 2022-04-02 Group behavior identification method based on complementary space-time information modeling Pending CN114842411A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210342854.2A CN114842411A (en) 2022-04-02 2022-04-02 Group behavior identification method based on complementary space-time information modeling
PCT/CN2022/136900 WO2023185074A1 (en) 2022-04-02 2022-12-06 Group behavior recognition method based on complementary spatio-temporal information modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210342854.2A CN114842411A (en) 2022-04-02 2022-04-02 Group behavior identification method based on complementary space-time information modeling

Publications (1)

Publication Number Publication Date
CN114842411A true CN114842411A (en) 2022-08-02

Family

ID=82564688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210342854.2A Pending CN114842411A (en) 2022-04-02 2022-04-02 Group behavior identification method based on complementary space-time information modeling

Country Status (2)

Country Link
CN (1) CN114842411A (en)
WO (1) WO2023185074A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023185074A1 (en) * 2022-04-02 2023-10-05 深圳先进技术研究院 Group behavior recognition method based on complementary spatio-temporal information modeling

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523462B (en) * 2020-04-22 2024-02-09 南京工程学院 Video sequence expression recognition system and method based on self-attention enhanced CNN
US11195024B1 (en) * 2020-07-10 2021-12-07 International Business Machines Corporation Context-aware action recognition by dual attention networks
CN112131943B (en) * 2020-08-20 2023-07-11 深圳大学 Dual-attention model-based video behavior recognition method and system
CN113947714B (en) * 2021-09-29 2022-09-13 广州赋安数字科技有限公司 Multi-mode collaborative optimization method and system for video monitoring and remote sensing
CN114842411A (en) * 2022-04-02 2022-08-02 深圳先进技术研究院 Group behavior identification method based on complementary space-time information modeling

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023185074A1 (en) * 2022-04-02 2023-10-05 深圳先进技术研究院 Group behavior recognition method based on complementary spatio-temporal information modeling

Also Published As

Publication number Publication date
WO2023185074A1 (en) 2023-10-05

Similar Documents

Publication Publication Date Title
US20180114071A1 (en) Method for analysing media content
CN112424769A (en) System and method for geographic location prediction
CN111325319B (en) Neural network model detection method, device, equipment and storage medium
CN111444744A (en) Living body detection method, living body detection device, and storage medium
EP3249610B1 (en) A method, an apparatus and a computer program product for video object segmentation
KR20220076398A (en) Object recognition processing apparatus and method for ar device
US11055572B2 (en) System and method of training an appearance signature extractor
CN113111782A (en) Video monitoring method and device based on salient object detection
CN111652181B (en) Target tracking method and device and electronic equipment
CN113591758A (en) Human behavior recognition model training method and device and computer equipment
CN113869205A (en) Object detection method and device, electronic equipment and storage medium
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN114898266A (en) Training method, image processing method, device, electronic device and storage medium
CN114842411A (en) Group behavior identification method based on complementary space-time information modeling
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium
CN113269307A (en) Neural network training method and target re-identification method
Pandiaraja et al. An analysis of abnormal event detection and person identification from surveillance cameras using motion vectors with deep learning
CN111310595A (en) Method and apparatus for generating information
CN115937742A (en) Video scene segmentation and visual task processing method, device, equipment and medium
CN113792569B (en) Object recognition method, device, electronic equipment and readable medium
CN115393755A (en) Visual target tracking method, device, equipment and storage medium
CN115131291A (en) Object counting model training method, device, equipment and storage medium
CN112883868B (en) Training method of weak supervision video motion positioning model based on relational modeling
Xu et al. Deep Neural Network‐Based Sports Marketing Video Detection Research
Liu et al. Integrated multiscale appearance features and motion information prediction network for anomaly detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination