WO2023185074A1 - 一种基于互补时空信息建模的群体行为识别方法 - Google Patents

一种基于互补时空信息建模的群体行为识别方法 Download PDF

Info

Publication number
WO2023185074A1
WO2023185074A1 PCT/CN2022/136900 CN2022136900W WO2023185074A1 WO 2023185074 A1 WO2023185074 A1 WO 2023185074A1 CN 2022136900 W CN2022136900 W CN 2022136900W WO 2023185074 A1 WO2023185074 A1 WO 2023185074A1
Authority
WO
WIPO (PCT)
Prior art keywords
modeling
individual
behavior recognition
group behavior
video
Prior art date
Application number
PCT/CN2022/136900
Other languages
English (en)
French (fr)
Inventor
韩鸣飞
王亚立
乔宇
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2023185074A1 publication Critical patent/WO2023185074A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present invention relates to the technical field of deep learning, and more specifically, to a group behavior recognition method based on complementary spatiotemporal information modeling.
  • Deep learning originates from the research of artificial neural networks. For example, a multi-layer perceptron containing multiple hidden layers is a deep learning structure. Deep learning discovers distributed feature representations of data by combining low-level features to form more abstract high-level representation attribute categories or features. Deep machine learning methods are divided into supervised learning and unsupervised learning. Different learning models are established under different learning frameworks. For example, convolutional neural network (CNN) is a deep machine learning model under supervised learning.
  • CNN convolutional neural network
  • Group behavior recognition is an important research problem that is widely used in the field of computer vision and needs to be solved urgently. With the development of deep learning technology, the width and depth of group behavior recognition and understanding are also constantly expanding.
  • Group behavior recognition refers to the technology of judging and classifying the behavioral status of the target group in the video by analyzing the individual behavioral content of the objects contained in the video.
  • spatiotemporal information modeling refers to the simultaneous information interaction within and between frames of the pixels of each frame of the video or the feature points obtained by deep learning.
  • the paper proposes to use the Transformer model to model the similarity of different target objects in the video and enhance their features; thereby improving the Accuracy of group behavior recognition.
  • the purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a group behavior recognition method based on complementary spatiotemporal information modeling.
  • the method includes the following steps:
  • the group behavior recognition model includes a first modeling branch and a second modeling branch, and the first modeling branch pairs the input
  • the individual features of are sequentially obtained through the first spatial self-attention module and the first temporal self-attention module to obtain enhanced individual features, and then identify all the enhanced individual features to obtain the first group behavior recognition result;
  • the second modeling branch The input individual characteristics are sequentially passed through the second temporal self-attention module and the second spatial self-attention module to obtain enhanced individual characteristics, and then all the enhanced individual characteristics are identified to obtain the second group behavior recognition result;
  • the group The behavior recognition result is obtained by fusing the first group behavior result and the second group behavior recognition result.
  • the advantage of the present invention is that, for the first time, two complementary spatio-temporal relationship modeling is used to solve the problem of incorrect recognition of group behavior categories, thereby enhancing the robustness of the model.
  • the contrast loss function between frames, between frames and videos, and between videos the consistency of the feature expressions learned by the two modeling branches of time-space and space-time can be constrained, and further Improved group behavior recognition accuracy.
  • Figure 1 is a flow chart of a group behavior recognition method based on complementary spatiotemporal information modeling according to an embodiment of the present invention
  • Figure 2 is an overall architecture diagram of a group behavior recognition method based on complementary spatiotemporal information modeling according to an embodiment of the present invention
  • Figure 3 is a schematic diagram of a multi-level contrast loss function according to an embodiment of the present invention.
  • Figure 4 is a schematic diagram of an application scenario according to an embodiment of the present invention.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • the provided group behavior recognition method based on complementary spatiotemporal information modeling includes the following steps.
  • Step S110 Construct a group behavior recognition model, which includes two modeling branches with dual spatiotemporal relationships.
  • the constructed group behavior recognition model includes a space-time modeling branch and a time-space modeling branch. These two branches use the self-attention mechanism to model the time and space relationships respectively, and use two Different sequential stackings, namely space-time and time-space, build a complementary spatio-temporal relationship and have duality.
  • the input individual features are sequentially obtained through the spatial attention module and the temporal attention module to obtain enhanced individual features, and can be further divided into group characteristics and individual behaviors, and then the group characteristics are used to obtain Group behavior recognition results.
  • the input individual characteristics are sequentially obtained through the temporal attention module and the spatial attention module to obtain enhanced individual characteristics, and are divided into group characteristics and individual behaviors, and then group behaviors are obtained based on the group characteristics.
  • Recognition results The final group behavior recognition result can be obtained by fusing the behavior recognition results of the two modeling branches. Dual space-time modeling is achieved through the two designed branches.
  • the individual features input to the two modeling branches are obtained in the following way: for the input video, K video frames are extracted, and there are N individuals in each video frame, and through the deep neural network and RoiAlign (interesting Regional alignment) to obtain the feature vectors X of all individuals, that is, K ⁇ N ⁇ C dimensional features, where C represents the latitude of the depth feature vector.
  • ROI Align is a regional feature aggregation method used to determine the target bounding box, which will not be described in detail here.
  • the temporal self-attention module is used to model temporal relationships.
  • the specific temporal relationship modeling method is: for each individual's K frame features, that is, K ⁇ C features, the self-attention mechanism is used to construct the relationships between K features. , input feedforward neural network (FFN) to enhance feature expression, and obtain individual features enhanced by temporal relationship modeling.
  • FNN feedforward neural network
  • the spatial self-attention module is used for spatial relationship modeling.
  • the specific spatial relationship modeling method is: for N individual features in each video frame, that is, N ⁇ C features, N features are constructed through the self-attention mechanism.
  • N features are constructed through the self-attention mechanism.
  • For the relationship between features input the FFN network to enhance feature expression, and obtain individual features enhanced by spatial relationship modeling.
  • the specific process of dual space-time modeling includes:
  • Step S1 Input the original individual features X sequentially into the spatial relationship modeling and the temporal relationship modeling to obtain all individual features .
  • Step S2 input the original individual features X sequentially into the temporal relationship modeling and spatial relationship modeling to obtain all individual features .
  • Step S3 fuse the recognition results of the two modeling branches to obtain the final group behavior recognition result.
  • the two modeling branches have a dual complementary relationship.
  • the temporal self-attention modules on each branch can have the same or different structures, and the spatial self-attention modules can also have the same or slightly different structures. Different structures. For example, the number of stacks of self-attention modules can be different.
  • Step S120 Train a group behavior recognition model based on the video sample set.
  • a multi-level contrast loss function is designed to constrain the consistency of feature expressions of the two modeling branches.
  • the consistency of the feature expressions learned by the two modeling branches of time-space and space-time can be constrained. Considering that the characteristics of the same individual under different spatio-temporal modeling should be as consistent as possible and at the same time different from the characteristics of other individuals, local to global multi-level characteristic constraints are designed for the two modeling branches.
  • Figure 3 is a schematic diagram of the contrast loss function, where Figure 3(a) is the individual contrast loss function between frames, Figure 3(b) is the individual contrast loss function between frames and videos, Figure 3(c) is Individual comparison loss function between videos.
  • the contrast loss function between frames is set to:
  • h represents the cosine similarity CosSim
  • t represents the t-th frame of the current video.
  • this function can be used to maximize the consistency of the features in the kth frame with its features in the kth frame of the TS branch.
  • the contrast loss function between frames and videos is the contrast loss function between frames and videos
  • B represents the Batch size, that is, the number of videos included in a batch of data
  • N represents the number of individuals
  • this function can be used to maximize the consistency of the features in the kth frame with its video-level features in the TS branch.
  • the contrast loss function between videos is set to:
  • the three loss functions can be fused to obtain an overall loss function to maximize the consistent distribution of the two modeling branches between frames, between frames and videos, and between videos.
  • the overall loss function is designed as a direct addition or weighted fusion of three loss functions.
  • Step S130 Use the trained group behavior recognition model to perform behavior recognition on the target video.
  • the trained model can be used for group behavior recognition in actual target videos. During actual recognition, the multi-level contrast loss function is not calculated. The rest is basically the same as the training process and will not be described again here.
  • Servers can be computers, clusters, or cloud servers.
  • the client can be a smartphone, tablet, smart wearable device (smart watch, virtual reality glasses, virtual reality helmet, etc.), smart vehicle equipment, computer and other electronic devices.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, or the like.
  • the present invention can be applied to various scenarios, such as smart city security.
  • the crowd behavior patterns in surveillance videos are different, which can more effectively identify high-risk behaviors of the crowd, such as fights, illegal gatherings, etc.
  • the present invention has low demand for training data and is more suitable for use in identifying rare high-risk group behaviors.
  • the behavior of the crowd within the monitoring range can be accurately judged to assist in the identification and warning of behaviors harmful to the city.
  • autonomous driving it is very important to identify the behavior patterns of the crowd at traffic intersections for safe autonomous driving behavior.
  • the present invention has the following advantages:
  • the present invention constrains the consistency of individual features between two modeling branches from three levels: between frames, between frames and videos, and between videos. That is, a multi-level contrast loss function is used , this design further improves the recognition accuracy and model robustness, and reduces the dependence on training data.
  • the traditional contrastive loss function only focuses on the consistency between videos.
  • the invention may be a system, method and/or computer program product.
  • a computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for causing a processor to implement various aspects of the invention.
  • Computer-readable storage media may be tangible devices that can retain and store instructions for use by an instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) or Flash memory), Static Random Access Memory (SRAM), Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Mechanical Coding Device, such as a printer with instructions stored on it.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • Flash memory Static Random Access Memory
  • CD-ROM Compact Disk Read Only Memory
  • DVD Digital Versatile Disk
  • Memory Stick
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or through electrical wires. transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium in the respective computing/processing device .
  • Computer program instructions for performing operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • the computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through the Internet). connect).
  • LAN local area network
  • WAN wide area network
  • an external computer such as an Internet service provider through the Internet. connect
  • an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA)
  • the electronic circuit can Computer readable program instructions are executed to implement various aspects of the invention.
  • These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine that, when executed by the processor of the computer or other programmable data processing apparatus, , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executed on a computer, other programmable data processing apparatus, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that embody one or more elements for implementing the specified logical function(s).
  • Executable instructions may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions. It is well known to those skilled in the art that implementation through hardware, implementation through software, and implementation through a combination of software and hardware are all equivalent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

一种基于互补时空信息建模的群体行为识别方法。该方法包括:获取目标视频中的个体特征向量;将个体特征向量输入到经训练的群体行为识别模型,获得群体行为识别结果,其中群体行为识别模型包含第一建模分支和第二建模分支,第一建模分支对输入的个体特征依次通过第一空间自注意力模块和第一时间自注意力模块获得增强的个体特征,进而对所有增强的个体特征进行识别,获得第一群体行为识别结果;第二建模分支对输入的所述个体特征依次通过第二时间自注意力模块和第二空间自注意力模块获得增强的个体特征,进而对所有增强的个体特征进行识别,获得第二群体行为识别结果。提升了群体行为识别准确率并增强了模型鲁棒性。

Description

一种基于互补时空信息建模的群体行为识别方法 技术领域
本发明涉及深度学习技术领域,更具体地,涉及一种基于互补时空信息建模的群体行为识别方法。
背景技术
深度学习的概念源于人工神经网络的研究,例如,包含多隐层的多层感知器就是一种深度学习结构。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示。深度机器学习方法分为有监督学习与无监督学习。不同学习框架下建立的学习模型不同,例如,卷积神经网络(CNN)是一种深度的监督学习下的机器学习模型。
群体行为识别是计算机视觉领域应用广泛且亟待解决的重要研究问题。伴随着深度学习技术的发展,群体行为识别与理解的宽度与深度也在不断扩展。群体行为识别是指通过分析视频中所包含对象的个体行为内容,来对视频中目标群体的行为状态进行判断和分类的技术。在视频的分析理解中,时空信息建模是指对视频每一帧的像素点或者深度学习所得到的特征点,在帧内和帧间同时进行信息交互。
在现有技术中,论文(DOI:10.1109/CVPR42600.2020.00092,Actor-Transformers for Group Activity Recognition)提出使用Transformer模型对视频中的不同目标对象进行相似性建模,对其特征进行增强;进而提升对群体行为识别的准确率。该论文的基本思想是:1)使用深度神经网络对视频不同帧提取深度特征,然后使用目标包围框提取不同帧中所有目标的深度特征;2)将得到的所有目标的深度特征经过嵌入式操作和位置编码,输入自相关注意力模块,得到增强后的目标个体特征;3)利用分类器对不同目标分别进行判别,得到个体的行为类别;融合不同目标的特征得到群 体深度特征,利用不同的分类器进行判别,得到群体的行为类别。通过在国际通用视频群体行为识别数据集(VolleyballDataset,Collective Dataset和NBAdataset)上进行测试,提高了视频中群体行为识别的准确率。然而,现有的视频群体行为识别仅考虑一种顺序的时空建模,忽略了时空建模可以通过两种顺序实现,并且忽略了这两种建模方式存在着很强的互补性。
发明内容
本发明的目的是克服上述现有技术的缺陷,提供一种基于互补时空信息建模的群体行为识别方法。该方法包括以下步骤:
获取目标视频中的个体特征向量;
将所述个体特征向量输入到经训练的群体行为识别模型,获得群体行为识别结果,其中,所述群体行为识别模型包含第一建模分支和第二建模分支,第一建模分支对输入的个体特征依次通过第一空间自注意力模块和第一时间自注意力模块获得增强的个体特征,进而对所有增强的个体特征进行识别,获得第一群体行为识别结果;第二建模分支对输入的所述个体特征依次通过第二时间自注意力模块和第二空间自注意力模块获得增强的个体特征,进而对所有增强的个体特征进行识别,获得第二群体行为识别结果;所述群体行为识别结果是通过融合第一群体行为结果和第二群体行为识别结果获得。
与现有技术相比,本发明的优点在于,第一次利用两个互补的时空关系建模来解决群体行为类别识别错误的问题,增强了模型鲁棒性。此外,通过设计帧与帧之间,帧与视频之间以及视频与视频之间的对比损失函数,可以约束时间-空间和空间-时间两个建模分支学到的特征表达的一致性,进一步提升了群体行为识别准确率。
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。
附图说明
被结合在说明书中并构成说明书的一部分的附图示出了本发明的实 施例,并且连同其说明一起用于解释本发明的原理。
图1是根据本发明一个实施例的基于互补时空信息建模的群体行为识别方法的流程图;
图2是根据本发明一个实施例的基于互补时空信息建模的群体行为识别方法的总体架构图;
图3是根据本发明一个实施例的多级对比损失函数的示意图;
图4是根据本发明一个实施例的应用场景示意图。
具体实施方式
现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
参见图1所示,所提供的基于互补时空信息建模的群体行为识别方法包括以下步骤。
步骤S110,构建群体行为识别模型,该模型包含具有对偶时空关系的两个建模分支。
结合图2所示,所构建的群体行为识别模型包含空间-时间建模分支和时间-空间建模分支,该两个分支通过使用自注意力机制对时间和空间关系分别建模,并以两种不同的顺序堆叠,即空间-时间以及时间-空间,构建 互补的时空关系,具有对偶性。
对于空间-时间分支(或称为ST分支),输入的个体特征依次经由空间注意力模块、时间注意力模块获得增强的个体特征,并可进一步划分为群体特征和个体行为,进而利用群体特征获得群体行为识别结果。对于时间-空间分支(或称为TS分支),输入的个体特征依次经由时间注意力模块、空间注意力模块获得增强的个体特征,并划分为群体特征和个体行为,进而基于群体特征获得群体行为识别结果。通过融合两个建模分支的行为识别结果可获得最终的群体行为识别结果。通过所设计的两个分支实现了对偶时空建模。
在一个实施例中,输入两个建模分支的个体特征通过以下方式获得:对输入的视频,抽取K个视频帧,每个视频帧中有N个个体,通过深度神经网络和RoiAlign(感兴趣区域对齐)得到所有个体的特征向量X,即K×N×C维特征,C表示深度特征向量的纬度。ROI Align是一种区域特征聚集方式,用于确定目标包围框,在此不再详细描述。
时间自注意力模块用于时间关系建模,例如,具体的时间关系建模方式是:对每个个体的K帧特征,即K×C的特征,通过自注意力机制构建K个特征间关系,输入前馈神经网络(FFN)增强特征表达,得到时间关系建模增强后的个体特征。
空间自注意力模块用于空间关系建模,例如,具体的空间关系建模方式是:对每个视频帧内的N个个体特征,即N×C的特征,通过自注意力机制构建N个特征间关系,输入FFN网络增强特征表达,得到空间关系建模增强后的个体特征。
在本发明实施例中,对偶时空建模的具体过程包括:
步骤S1,将原始的个体特征X顺序输入空间关系建模和时间关系建模,得到空间-时间关系建模后的所有个体特征X ST,输入分类器,得到空间-时间分支的群体行为识别结果。
步骤S2,将原始的个体特征X顺序输入时间关系建模和空间关系建模,得到时间-空间关系建模后的所有个体特征X TS,输入分类器,得到时间-空间分支的群体行为识别结果。
步骤S3,将两个建模分支的识别结果融合,得到最终的群体行为识别结果。
需说明的是,在本发明实施例中,两个建模分支具有对偶互补关系,各分支上的时间自注意力模块可以具有相同或不同的结构,空间自注意力模块也可以具有相同或稍微不同的结构。例如,自注意力模块堆叠的数量可以不同。
步骤S120,基于视频样本集训练群体行为识别模型,在训练过程中,设计多层级的对比损失函数来约束两个建模分支特征表达的一致性。
为了进一步提升群体行为识别准确率,在模型训练过程中,可以约束时间-空间和空间-时间两个建模分支学到的特征表达的一致性。考虑到同一个体在不同时空建模下的特征应尽量一致,同时与其他个体的特征不同,针对两个建模分支设计局部至全局的多级别特征约束。
例如,设计帧与帧之间,帧与视频之间以及视频与视频之间的对比损失函数来约束两个建模分支学到的特征表达的一致性。图3是对比损失函数的示意图,其中图3(a)是帧与帧之间的个体对比损失函数,图3(b)是帧与视频之间的个体对比损失函数,图3(c)是视频与视频之间的个体对比损失函数。
在一个实施例中,帧与帧之间的对比损失函数设置为:
Figure PCTCN2022136900-appb-000001
其中,h代表余弦相似度CosSim,
Figure PCTCN2022136900-appb-000002
表示ST分支上第n个体第k帧的特征,
Figure PCTCN2022136900-appb-000003
表示TS分支第n个体第k帧的特征,t表示当前视频的第t帧。对ST分支的第n个体,利用该函数可以最大化其第k帧中的特征与其在TS分支的第k帧中的特征一致性。
在一个实施例中,帧与视频之间的对比损失函数:
Figure PCTCN2022136900-appb-000004
其中,
Figure PCTCN2022136900-appb-000005
表示TS分支第n个体的视频级特征,
Figure PCTCN2022136900-appb-000006
表示TS分支batch内所有视频中第i个体的视频级特征,B表示Batch size,即一个批次的数据中包含的视频数量,N表示个体数量,
Figure PCTCN2022136900-appb-000007
表示TS分支中当前视频的第 n个体的视频级特征,通过第n个体的K个帧特征max pooling(最大池化)得到。对ST分支的第n个体,利用该函数可以最大化其第k帧中的特征与其在TS分支的视频级特征的一致性。
在一个实施例中,视频与视频之间的对比损失函数设置为:
Figure PCTCN2022136900-appb-000008
其中,
Figure PCTCN2022136900-appb-000009
分别表示当前视频中第n个体的视频级特征。利用该函数可以最大化同一个个体在ST和TS分支的视频特征的一致性。
在模型训练过程中,优选地,可以将三种损失函数融合获得总体损失函数,以最大化两个建模分支在帧与帧之间,帧与视频之间以及视频与视频之间的分布一致性为目标获得模型的最优参数,例如,总体损失函数设计为三种损失函数的直接相加或加权融合。
应理解的是,在设计了对偶时空建模的前提下,采用一种损失函数或两种损失函数的融合作为总体损失函数一定程度上也能够提升模型的识别精确度。此外,在模型训练过程中,可根据训练效率和模型精度要求设置不同的参数,例如,K=3,N=12,C=1024。
步骤S130,利用经训练的群体行为识别模型对目标视频进行行为识别。
经训练的模型可用于实际目标视频的群体行为识别,在实际识别时,不计算多级对比损失函数,其余与训练过程基本一致,在此不再赘述。
本发明获得的训练好的模型可应用到客户端或服务器实现多种场景下目标视频中群体行为识别或分析,参见图4所示。服务器可以是计算机、集群或云服务器。客户端可以是智能手机、平板电脑、智能可穿戴设备(智能手表、虚拟现实眼镜、虚拟现实头盔等)、智能车载设备、计算机等电子设备。计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机等。
为进一步验证本发明的效果,进行了实验。实验结果表明,在Volleyball Dataset,Collective Dataset和NBA Dataset数据上,与现有技术相比,本发明实现了最优的识别结果。同时,本发明提出的对偶的时空关 系建模模块,可以与任意时空建模的群体行为识别方法进行结合,并且结合基于多级别对比损失函数的特征约束可以进一步提升算法的群体行为识别准确率,并降低对训练数据量的依赖。
本发明可应用于多种场景,如智慧城市安防。监控视频中人群行为模式各异,可以更有效地对人群的高危行为,如斗殴,非法集会等进行识别。同时,本发明对训练数据需求量低,更适合被利用于罕见的高危群体行为识别。例如,在城市智能安防场景下,对监控范围内的人群行为进行精准判别,辅助对危害城市行为进行识别和警报。在自动驾驶中,在交通道口,对人群的行为模式判别,对安全的自动驾驶行为十分重要。又如,进行野外动物监控,在一些重点动物保护和野生动物研究区域设有大量的摄像头,群体行为识别技术是野生动物状态检测自动化的重要基础。
综上所述,相对于现有技术,本发明具有以下优势:
1)、第一次利用两种互补的对偶时空建模方法,对视频中的个体进行行为建模,提升了行为识别准确率。而现有技术对时空关系的建模仅通过时间-空间或空间-时间一种方式进行,忽略了时空建模两种顺序的互补性。
2)、本发明从帧与帧之间,帧与视频之间和视频与视频之间三个层级,约束个体特征在两个建模分支之间的一致性,即采用了多级对比损失函数,这种设计进一步提高了识别准确率和模型的鲁棒性,并降低了对训练数据的依赖。而传统的对比损失函数仅关注视频与视频之间的一致性。
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器 (SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++、Python等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本发明的各个方面。
这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解,流程图和/或框图 的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是,通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范 围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。

Claims (10)

  1. 一种基于互补时空信息建模的群体行为识别方法,包括以下步骤:
    获取目标视频中的个体特征向量;
    将所述个体特征向量输入到经训练的群体行为识别模型,获得群体行为识别结果,其中,所述群体行为识别模型包含第一建模分支和第二建模分支,第一建模分支对输入的个体特征依次通过第一空间自注意力模块和第一时间自注意力模块获得增强的个体特征,进而对所有增强的个体特征进行识别,获得第一群体行为识别结果;第二建模分支对输入的所述个体特征依次通过第二时间自注意力模块和第二空间自注意力模块获得增强的个体特征,进而对所有增强的个体特征进行识别,获得第二群体行为识别结果;所述群体行为识别结果是通过融合第一群体行为结果和第二群体行为识别结果获得。
  2. 根据权利要求1所述的方式,其特征在于,所述个体特征向量通过以下步骤获得:
    针对目标视频,抽取K个视频帧,每个视频帧中包含N个个体;
    将N个个体通过深度神经网络和感兴趣区域对齐RoiAlign得到N个个体的特征向量。
  3. 根据权利要求1所述的方式,其特征在于,在训练所述群体行为识别模型过程中,融合帧与帧之间的对比损失函数,帧与视频之间的对比损失函数以及视频与视频之间的对比损失函数,从三个层级约束特征在第一建模分支和第二建模分支之间的一致性。
  4. 根据权利要求3所述的方式,其中,所述帧与帧之间的对比损失函数设置为:
    Figure PCTCN2022136900-appb-100001
    其中,h代表余弦相似度CosSim,
    Figure PCTCN2022136900-appb-100002
    表示第一建模分支上第n个体第k帧的特征,
    Figure PCTCN2022136900-appb-100003
    表示第二建模分支上第n个体第k帧的特征,t表示帧的索引,K表示帧的数目。
  5. 根据权利要求3所述的方式,其中,所述帧与视频之间的对比损失 函数设置为:
    Figure PCTCN2022136900-appb-100004
    其中,
    Figure PCTCN2022136900-appb-100005
    表示第二建模分支上第n个体的视频级特征,
    Figure PCTCN2022136900-appb-100006
    表示第二建模分支上一个批次内所有视频中第i个体的视频级特征,B表示一个批次内包含的视频数量,N表示个体数量,
    Figure PCTCN2022136900-appb-100007
    表示第一建模分支上第n个体第k帧的特征。
  6. 根据权利要求3所述的方式,其特征在于,所述视频与视频之间的对比损失函数设置为:
    Figure PCTCN2022136900-appb-100008
    其中,
    Figure PCTCN2022136900-appb-100009
    分别表示第二建模分支和第一建模分支上视频中第n个体的视频级特征。
  7. 根据权利要求1所述的方式,其特征在于,第一建模分支和第二建模分支具有对偶互补关系,第一空间自注意力模块和第二空间自注意力模块具有相同或不同的结构,第一时间自注意力模块和第二时间自注意力模块具有相同或不同的结构。
  8. 根据权利要求1所述的方式,其特征在于,所述第一时间自注意力模块和所述第二时间自注意力模块用于时间关系建模,对每个个体的多帧特征,通过自注意力机制构建该多个特征间关系,进而输入至前馈神经网络增强特征表达,得到时间关系建模增强后的个体特征;所述第一空间自注意力模块和所述第二空间自注意力模块用于空间关系建模,对每个视频帧内的多个个体特征,通过自注意力机制构建该多个个特征间关系,输入前馈神经网络增强特征表达,得到空间关系建模增强后的个体特征。
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现根据权利要求1至8中任一项所述方法的步骤。
  10. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至8中任一项所述的方法的步骤。
PCT/CN2022/136900 2022-04-02 2022-12-06 一种基于互补时空信息建模的群体行为识别方法 WO2023185074A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210342854.2 2022-04-02
CN202210342854.2A CN114842411A (zh) 2022-04-02 2022-04-02 一种基于互补时空信息建模的群体行为识别方法

Publications (1)

Publication Number Publication Date
WO2023185074A1 true WO2023185074A1 (zh) 2023-10-05

Family

ID=82564688

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136900 WO2023185074A1 (zh) 2022-04-02 2022-12-06 一种基于互补时空信息建模的群体行为识别方法

Country Status (2)

Country Link
CN (1) CN114842411A (zh)
WO (1) WO2023185074A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842411A (zh) * 2022-04-02 2022-08-02 深圳先进技术研究院 一种基于互补时空信息建模的群体行为识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523462A (zh) * 2020-04-22 2020-08-11 南京工程学院 基于自注意增强cnn的视频序列表情识别系统及方法
CN112131943A (zh) * 2020-08-20 2020-12-25 深圳大学 一种基于双重注意力模型的视频行为识别方法及系统
US11195024B1 (en) * 2020-07-10 2021-12-07 International Business Machines Corporation Context-aware action recognition by dual attention networks
CN113947714A (zh) * 2021-09-29 2022-01-18 广州市赋安电子科技有限公司 一种视频监控和遥感的多模态协同优化方法及系统
CN114842411A (zh) * 2022-04-02 2022-08-02 深圳先进技术研究院 一种基于互补时空信息建模的群体行为识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523462A (zh) * 2020-04-22 2020-08-11 南京工程学院 基于自注意增强cnn的视频序列表情识别系统及方法
US11195024B1 (en) * 2020-07-10 2021-12-07 International Business Machines Corporation Context-aware action recognition by dual attention networks
CN112131943A (zh) * 2020-08-20 2020-12-25 深圳大学 一种基于双重注意力模型的视频行为识别方法及系统
CN113947714A (zh) * 2021-09-29 2022-01-18 广州市赋安电子科技有限公司 一种视频监控和遥感的多模态协同优化方法及系统
CN114842411A (zh) * 2022-04-02 2022-08-02 深圳先进技术研究院 一种基于互补时空信息建模的群体行为识别方法

Also Published As

Publication number Publication date
CN114842411A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
Ding et al. Semantic segmentation with context encoding and multi-path decoding
Li et al. Segmenting objects in day and night: Edge-conditioned CNN for thermal image semantic segmentation
CN108304788B (zh) 基于深度神经网络的人脸识别方法
Sixt et al. Rendergan: Generating realistic labeled data
Yuan et al. A gated recurrent network with dual classification assistance for smoke semantic segmentation
US20180114071A1 (en) Method for analysing media content
CN108960184B (zh) 一种基于异构部件深度神经网络的行人再识别方法
Masurekar et al. Real time object detection using YOLOv3
Rabiee et al. Crowd behavior representation: an attribute-based approach
Papaioannidis et al. Autonomous UAV safety by visual human crowd detection using multi-task deep neural networks
CN115223020B (zh) 图像处理方法、装置、设备、存储介质及计算机程序产品
Jemilda et al. Moving object detection and tracking using genetic algorithm enabled extreme learning machine
Yu et al. Abnormal event detection using adversarial predictive coding for motion and appearance
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
WO2023185074A1 (zh) 一种基于互补时空信息建模的群体行为识别方法
Tao et al. CENet: A channel-enhanced spatiotemporal network with sufficient supervision information for recognizing industrial smoke emissions
CN110111365B (zh) 基于深度学习的训练方法和装置以及目标跟踪方法和装置
Nida et al. Video augmentation technique for human action recognition using genetic algorithm
Hu et al. Deep learning for distinguishing computer generated images and natural images: A survey
Liu et al. Student behavior recognition from heterogeneous view perception in class based on 3-D multiscale residual dense network for the analysis of case teaching
Chen et al. Group Behavior Pattern Recognition Algorithm Based on Spatio‐Temporal Graph Convolutional Networks
CN111242114A (zh) 文字识别方法及装置
She et al. Contrastive self-supervised representation learning using synthetic data
Saif et al. Aggressive action estimation: a comprehensive review on neural network based human segmentation and action recognition
Li et al. Efficient thermal infrared tracking with cross-modal compress distillation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934892

Country of ref document: EP

Kind code of ref document: A1