CN111523421A

CN111523421A - Multi-person behavior detection method and system based on deep learning fusion of various interactive information

Info

Publication number: CN111523421A
Application number: CN202010289689.XA
Authority: CN
Inventors: 汤佳俊; 夏锦; 牟芯志; 庞博; 卢策吾
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2020-08-11
Anticipated expiration: 2040-04-14
Also published as: CN111523421B

Abstract

A behavior detection network is trained by constructing a video library with labels as a sample set, the trained network processes a video to be detected, and detection of object behaviors in an area is realized according to a final output vector. The invention fully considers the complexity of human behaviors, integrates the interactive relation between the human behaviors and other people, objects and long-term memory information while considering the self motion of the human, and effectively improves the precision of video behavior detection.

Description

Multi-person behavior detection method and system based on deep learning fusion of various interactive information

技术领域technical field

本发明涉及的是一种人工智能视频识别领域的技术，具体是一种基于深度学习融合各种交互信息的多人行为检测方法及系统。The invention relates to a technology in the field of artificial intelligence video recognition, in particular to a multi-person behavior detection method and system integrating various interactive information based on deep learning.

背景技术Background technique

计算机视觉的目标是利用计算机程序处理各种不同的视觉任务，往往涉及到图像、视频等多媒体。卷积神经网络是广泛运用在计算机视觉任务中的一种深度学习技术，它通过训练图像卷积操作中的滤波器参数，得到更加通用的深度鲁棒表征，这些表征的形式是高维的向量或矩阵，能够用于行为检测或分类，即对视频中出现的人的位置进行检测，并对其各自发生的行为进行判断。The goal of computer vision is to use computer programs to handle various visual tasks, often involving multimedia such as images and videos. Convolutional neural network is a deep learning technique widely used in computer vision tasks. It obtains more general deep robust representations by training filter parameters in image convolution operations. These representations are in the form of high-dimensional vectors. Or matrix, which can be used for behavior detection or classification, that is, to detect the positions of people appearing in the video, and to judge their respective behaviors.

现有的行为检测技术一般通过检测出人的边界框，通过三维卷积神经网络提取出视频的表征，通过线性插值的方式根据人的边界框从视频表征中提取出人的区域表征，最后通过人的区域表征进行最终的判断。其缺陷在于仅考虑单个人在其边界框内部的动作变化，没有对人与其他人或物体之间的交互信息进行利用，无法准确检测更为复杂的交互类行为，如开门、看电视、与他人对话等行为。The existing behavior detection technology generally detects the human bounding box, extracts the representation of the video through a three-dimensional convolutional neural network, and extracts the regional representation of the person from the video representation according to the human bounding box by linear interpolation. The regional representation of the person makes the final judgment. Its defect is that it only considers the movement changes of a single person inside its bounding box, does not utilize the interaction information between people and other people or objects, and cannot accurately detect more complex interactive behaviors, such as opening the door, watching TV, and interacting with others. Conversation with others, etc.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术存在的上述不足，提出一种基于深度学习融合各种交互信息的多人行为检测方法及系统，通过提取三种不同的表征信息，并进一步进行融合，在没有明显增加计算量的基础上，提高了多人行为检测的精度，具有较好的可行性和鲁棒性。Aiming at the above-mentioned shortcomings of the prior art, the present invention proposes a multi-person behavior detection method and system based on deep learning fusion of various interactive information. It improves the accuracy of multi-person behavior detection and has good feasibility and robustness.

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

本发明涉及一种基于深度学习融合各种交互信息的多人行为检测方法，通过构建带有标签的视频库作为样本集对行为检测网络进行训练，并将训练后的网络处理待测视频，根据最终的输出向量实现对区域中对象行为的检测。The invention relates to a multi-person behavior detection method based on deep learning and fusion of various interactive information. The behavior detection network is trained by constructing a video library with labels as a sample set, and the trained network processes the video to be tested, according to the The final output vector enables detection of object behavior in the region.

所述的带有标签的视频库，通过以下方式得到：将样本集中的视频等间隔标注后，将视频尺寸进行归一化处理，并依照以每个标注帧裁剪为若干片段，例如：对于每个标注帧，以该帧为中间帧前后各取32帧的内容，得到相应的一个64帧的片段。The labeled video library is obtained by the following methods: after the videos in the sample set are labeled at equal intervals, the video size is normalized, and each labeled frame is cut into several segments, for example: for each labeled frame. Each marked frame is taken as the middle frame, and the content of 32 frames is taken before and after the frame to obtain a corresponding segment of 64 frames.

所述的等间隔标注的内容包括：帧中每个人的边界框以及在该帧前后各1.5秒的时间区间内每个人各自发生的行为。The content of the equally spaced annotation includes: the bounding box of each person in the frame and the respective behaviors of each person in a time interval of 1.5 seconds before and after the frame.

所述的边界框采用但不限于Faster-R卷积神经网络、YOLO等一系列较为成熟的图像物体检测算法得到，检测后的每张标注帧同时具有了该帧上所出现的各类物体的边界框及类别。The bounding box is obtained by adopting but not limited to a series of relatively mature image object detection algorithms such as Faster-R convolutional neural network and YOLO. Bounding boxes and categories.

所述的行为检测网络包括：一个用于提取视频表征的三维卷积神经网络、带有记忆池的表征抽取模块、一个多交互关系建模融合网络、三个全连接层和一个sigmoid回归层，其中：三维卷积神经网络根据输入的视频片段提取出视频表征并输出至表征抽取模块，表征抽取模块利用RoIAlign对视频表征上的各个边界框区域进行线性插值并经过池化得到人和物体的区域表征的同时表征抽取模块通过记忆池得到记忆表征，多交互关系建模融合网络对人和物体的区域表征以及记忆表征进行建模融合得到鲁棒行为表征，通过全连接层和sigmoid回归层得到各个类别的预测概率。The behavior detection network includes: a three-dimensional convolutional neural network for extracting video representations, a representation extraction module with a memory pool, a multi-interaction relationship modeling fusion network, three fully connected layers and a sigmoid regression layer, Among them: the three-dimensional convolutional neural network extracts the video representation according to the input video clip and outputs it to the representation extraction module. The representation extraction module uses RoIAlign to linearly interpolate each bounding box area on the video representation and obtain the area of people and objects after pooling. The simultaneous representation extraction module obtains the memory representation through the memory pool, and the multi-interaction relationship modeling and fusion network models and fuses the regional representation of people and objects and the memory representation to obtain robust behavior representations. The fully connected layer and sigmoid regression layer are used to obtain each The predicted probability of the class.

所述的三维卷积神经网络采用但不限于I3D网络、SlowFast网络、C3D网络等常用的视频表征提取网络。The three-dimensional convolutional neural network adopts but is not limited to commonly used video representation extraction networks such as I3D network, SlowFast network, and C3D network.

根据边界框区域内的内容不同，表征抽取模块可以得到人的区域表征和物体的区域表征。According to the different contents in the bounding box area, the representation extraction module can obtain the regional representation of the person and the regional representation of the object.

所述的记忆池根据每个视频片段中人和物体的区域表征中人的区域表征，通过将当前片段的历史片段中人的区域表征的拼接，得到记忆表征。The memory pool obtains the memory representation by splicing the region representation of the person in the historical segment of the current segment according to the region representation of the person in the region representation of the person and object in each video segment.

所述的多交互关系建模融合网络包括：两个用于接收人的区域表征的人人交互建模模块、两个用于分别接收人的区域表征和物体的区域表征的人物交互建模模块、两个用于分别接收人的区域表征和记忆表征的人记忆建模交互模块，其中：第一人人交互建模模块、第一人物交互建模模块、第一人记忆建模交互模块、第二人人交互建模模块、第二人物交互建模模块、第二人记忆建模交互模块依次连接并传输经依次增强的人的区域表征，每个交互建模模块对人人交互、人物交互、人记忆交互中的一种交互关系进行建模，并与人的区域表征融合后传输至下一个模块中，最终输出的人的区域表征综合融合人人交互、人物交互、人记忆交互关系，即为最终输出的鲁棒行为表征。The multi-interaction relationship modeling fusion network includes: two human-human interaction modeling modules for receiving regional representations of people, and two human-human interaction modeling modules for respectively receiving regional representations of people and objects. , two human memory modeling interaction modules for receiving human regional representation and memory representation respectively, wherein: the first human interaction modeling module, the first character interaction modeling module, the first human memory modeling interaction module, The second human-human interaction modeling module, the second character interaction modeling module, and the second human memory modeling interaction module are connected in turn and transmit the sequentially enhanced regional representations of people. Modeling an interaction relationship between interaction and human memory interaction, and fused with the regional representation of the human, and then transferred to the next module. , which is the robust behavioral representation of the final output.

所述的人人交互是指：同一个视频片段中不同的行为人之间的交互。The human-human interaction refers to the interaction between different actors in the same video clip.

所述的人物交互是指：同一个视频片段中行为人与物体之间产生的交互。The character interaction refers to the interaction between an actor and an object in the same video clip.

所述的人记忆交互是指：当前片段中的行为人与历史较长期的临近片段中的行为人之间的交互。The human memory interaction refers to the interaction between the actor in the current segment and the actor in the adjacent segment with a longer history.

所述的建模是指：

其中：Q，K分别为输入的两种表征，W_Q，W_K1，W_K2，W_O是全连接层的权重，d是KW_K1的维度。By modeling is meant:

Among them: Q, K are the two representations of the input, W _Q , W _K1 , W _K2 , W _O are the weights of the fully connected layer, and d is the dimension of KW _K1 .

根据输入表征K的不同，模块处理不同的交互关系：K的取值包括人的区域表征、物体的区域表征以及记忆表征，对应的建模模块依次对应处理人人交互、人物交互和记忆交互并输出融合该类型交互信息的对应表征；当六个模块串联后，上一个建模模块的输出进一步作为下一个的Q进行输入，最终对多种不同的交互关系进行了融合。According to the different input representation K, the module processes different interaction relationships: the value of K includes the regional representation of people, the regional representation of objects, and the memory representation, and the corresponding modeling modules deal with human-human interaction, character interaction and memory interaction in turn. The output fuses the corresponding representations of this type of interaction information; when the six modules are connected in series, the output of the previous modeling module is further input as the next Q, and finally a variety of different interaction relationships are fused.

所述的三个全连接层包括两个隐含层和一个输出层。The three fully connected layers include two hidden layers and one output layer.

所述的sigmoid回归层包括sigmoid函数以及交叉熵损失函数，输出层的输出向量经过sigmoid层能够得到各个类别的预测概率，交叉熵损失函数用于训练整个网络。The sigmoid regression layer includes a sigmoid function and a cross-entropy loss function, the output vector of the output layer can obtain the prediction probability of each category through the sigmoid layer, and the cross-entropy loss function is used to train the entire network.

所述的训练是指：将样本集中的样本及相应的物体边界框以及设置于表征抽取模块中的记忆池中的临近视频片段的人的区域表征作为行为检测网络的输入，采用交叉熵损失函数，结合反向传播BP算法调整网络参数，同时将该视频片段中的人的区域表征更新到记忆池中。The training refers to: taking the samples in the sample set and the corresponding object bounding boxes and the regional representations of people in the adjacent video clips in the memory pool set in the representation extraction module as the input of the behavior detection network, using the cross-entropy loss function. , combined with the back-propagation BP algorithm to adjust the network parameters, and at the same time update the regional representation of the person in the video clip into the memory pool.

所述的处理待测视频是指：将待检测的视频输入到物体检测算法和训练后的行为检测网络中，利用sigmoid回归层得到最终对各个行为的预测概率。The processing of the video to be tested refers to inputting the video to be detected into the object detection algorithm and the trained behavior detection network, and using the sigmoid regression layer to obtain the final prediction probability of each behavior.

技术效果technical effect

本发明整体解决了长视频中出现的每个人的行为进行检测的技术问题，即：对于视频中的某一帧中所出现的人，需要给出每个人的边界框，以及每个人在该帧前后一小段时间内各自发生的行为；与现有技术相比，本发明充分考虑了人类行为的复杂性，在考虑人的自身运动的同时，综合了其与其他人、物体以及长期记忆信息的交互关系，有效地提高了视频行为检测的精度。The present invention solves the technical problem of detecting the behavior of each person appearing in the long video as a whole, that is: for a person appearing in a certain frame in the video, it is necessary to give the bounding box of each person, and each person in the frame Compared with the prior art, the present invention fully considers the complexity of human behavior, considers the human's own movement, and synthesizes its relationship with other people, objects and long-term memory information. The interaction relationship can effectively improve the accuracy of video behavior detection.

附图说明Description of drawings

图1为本发明网络训练流程图；Fig. 1 is the network training flow chart of the present invention;

图2为本发明测试待测视频流程图；Fig. 2 is the test video flow chart of the present invention to be tested;

图3为本发明中交互建模模块的示意图；Fig. 3 is the schematic diagram of interactive modeling module in the present invention;

图中：N代表视频片段中人的数量，N’代表视频片段中交互对象的数量，即人的区域表征的数量或物体的区域表征数量或记忆表征中所有的人的数量；In the figure: N represents the number of people in the video clip, and N' represents the number of interactive objects in the video clip, that is, the number of regional representations of people or the number of regional representations of objects or the number of all people in the memory representation;

图4为本发明中多交互关系建模融合网络的示意图；4 is a schematic diagram of a multi-interaction relationship modeling fusion network in the present invention;

图中每个小矩形代表一个交互建模模块，左侧的输入为Q，下方的输入为K，根据不同的K，对不同的交互进行建模。Each small rectangle in the figure represents an interaction modeling module. The input on the left is Q, and the input on the bottom is K. According to different K, different interactions are modeled.

具体实施方式Detailed ways

本实施例涉及一种基于深度学习融合各种交互信息的多人行为检测系统，包括：训练样本获取模块、物体检测模块、融合多种交互的行为检测网络模块，其中：训练样本获取模块的样本以及物体检测模块的物体检测框作为行为检测网络模块的输入，行为检测网络经训练后利用人和物体的边界框区域得到人、物的区域表征以及记忆表征的模型，并进一步在此表征上进行多分类判断，物体检测模块对待测视频中的人和物进行检测，行为检测网络模块根据物体检测模块的检测结果进一步测试推断得到对视频中每个人行为的判断。This embodiment relates to a multi-person behavior detection system based on deep learning that integrates various interactive information, including: a training sample acquisition module, an object detection module, and a behavior detection network module that integrates multiple interactions, wherein: the samples of the training sample acquisition module And the object detection frame of the object detection module is used as the input of the behavior detection network module. After the behavior detection network is trained, it uses the bounding box area of people and objects to obtain the regional representation of people and objects and the model of memory representation, and further carry out this representation. Multi-classification judgment, the object detection module detects people and objects in the video to be tested, and the behavior detection network module further tests and infers the judgment of each person's behavior in the video according to the detection results of the object detection module.

所述的行为检测网络包括：一个用于提取视频表征的三维卷积神经网络、表征抽取模块、一个多交互关系建模融合网络、三个全连接层和一个sigmoid回归层，其中：三维卷积神经网络根据输入的视频片段提取出视频表征并输出至表征抽取模块，表征抽取模块利用RoIAlign对视频表征上的各个边界框区域进行线性插值并经过池化得到人和物体的区域表征，同时表征抽取模块中的记忆池得到记忆表征，多交互关系建模融合网络根据人、物体的区域表征和记忆表征进行对应的建模融合得到鲁棒行为表征，通过全连接层和sigmoid回归层得到各个类别的预测概率。The behavior detection network includes: a three-dimensional convolutional neural network for extracting video representations, a representation extraction module, a multi-interaction relationship modeling fusion network, three fully connected layers and a sigmoid regression layer, wherein: three-dimensional convolutional The neural network extracts the video representation according to the input video clip and outputs it to the representation extraction module. The representation extraction module uses RoIAlign to linearly interpolate each bounding box area on the video representation and pools to obtain the regional representation of people and objects, and the representation extraction The memory pool in the module obtains the memory representation, and the multi-interaction relationship modeling fusion network performs corresponding modeling and fusion according to the regional representation and memory representation of people and objects to obtain a robust behavior representation. Predict the probability.

如图1所示，上述行为检测网络，具体通过以下步骤实现训练：As shown in Figure 1, the above behavior detection network is trained through the following steps:

步骤1，对三维卷积神经网络进行初始化，使用在其他视频行为分类数据集上进行过预训练的权重来进行初始化。Step 1, initialize the 3D convolutional neural network using weights pretrained on other video behavior classification datasets.

在本实施例中，采用了SlowFast网络作为三维卷积神经网络的具体结构，该网络首先在Kinetics行为分类数据集上进行预训练，使用预训练后的权重初始化三维卷积网络的结构。对于行为检测网络中的其他部分参数，使用一些不同的小随机数进行初始化。In this embodiment, the SlowFast network is used as the specific structure of the three-dimensional convolutional neural network. The network is firstly pre-trained on the Kinetics behavior classification data set, and the pre-trained weights are used to initialize the structure of the three-dimensional convolutional network. For other parts of the parameters in the behavior detection network, use some different small random numbers for initialization.

其他场合下可以使用某些行为分类数据集进行预训练，如Kinetics、UCF、HMDB等。In other cases, some behavior classification datasets can be used for pre-training, such as Kinetics, UCF, HMDB, etc.

步骤2，对设置于表征抽取模块内用于提供一个视频片段的长期记忆表征的记忆池进行初始化：在训练的开始阶段，使用全零的向量进行初始化。Step 2: Initialize the memory pool set in the representation extraction module for providing a long-term memory representation of a video segment: at the beginning of training, use an all-zero vector for initialization.

步骤3，数据处理及读入：Step 3, data processing and reading:

步骤3.1：收集不同场景的长视频，并将长视频按照一秒的间隔进行标注。即再对某一帧上进行标签的标注后，在与该帧间隔一秒的下一帧上再进行标注，依此类推。标注的内容包括这一帧图像上所有人的边界框，以及每个人在该帧前后各1.5秒的时间内所发生的行为类别。在本实施例中，采用了AVA(Atomic Visual Action)数据集作为作为验证本发明方法有效性的数据集。Step 3.1: Collect long videos of different scenes, and label the long videos at one-second intervals. That is, after labeling a frame, label it again on the next frame that is one second apart from the frame, and so on. The annotated content includes the bounding box of everyone on this frame of image, as well as the type of behavior that each person took 1.5 seconds before and after the frame. In this embodiment, AVA (Atomic Visual Action) data set is used as the data set for verifying the effectiveness of the method of the present invention.

步骤3.2：对于每一个带有标注的帧，在该帧上运行物体检测算法，检测出该帧中出现的常见物体类别，物体类别中应当不包括人。在本实施例中，本实施例采用Faster R-卷积神经网络算法作为物体检测模块。Step 3.2: For each frame with an annotation, run the object detection algorithm on the frame to detect the common object categories that appear in the frame, and the object category should not include people. In this embodiment, the Faster R-convolutional neural network algorithm is used as the object detection module in this embodiment.

步骤3.3：对于每一个带有标注的帧，抽取出该帧前后共64帧的视频片段，并归一化到256×464(高×宽)，输入到行为检测网络中的视频片段为64×256×464×3的张量，其中3为RGB颜色通道。Step 3.3: For each frame with an annotation, extract a total of 64 video clips before and after the frame, and normalize it to 256 × 464 (height × width), and the video clip input into the behavior detection network is 64 × A 256x464x3 tensor, where 3 are RGB color channels.

步骤3.4：将样本集中所有的视频片段随机打乱顺序，以增加训练时的随机性。样本集中包含多个长视频，因此训练时使用的视频片段可能来自于不同的长视频。每个迭代中从样本集中随机抽取一个视频片段用于训练。Step 3.4: Randomly shuffle all the video clips in the sample set to increase the randomness during training. The sample set contains multiple long videos, so the video clips used in training may come from different long videos. In each iteration, a video clip is randomly selected from the sample set for training.

步骤3.5：将视频片段及其对应的中间帧上的人的边界框、人的行为类别以及检测算法所检测的物体的边界框输入到行为检测网络中。Step 3.5: Input the bounding box of the person on the video clip and its corresponding intermediate frame, the behavior category of the person, and the bounding box of the object detected by the detection algorithm into the behavior detection network.

步骤4，训练迭代：Step 4, training iteration:

步骤4.1：将样本集中随机选取的视频片段输入到三维卷积网络中，得到整个视频片段的表征为16×29×2304(高×宽×深度)张量；使用表征抽取模块，根据人和物体的边界框，在视频片段的表征上进行插值得到7×7×2304的张量，进一步池化得到每个人或物体的区域表征为为2304维度的向量。Step 4.1: Input the randomly selected video clips from the sample set into the 3D convolutional network, and obtain the representation of the entire video clip as a 16 × 29 × 2304 (height × width × depth) tensor; The bounding box is interpolated on the representation of the video clip to obtain a 7×7×2304 tensor, and further pooled to obtain the region representation of each person or object as a 2304-dimensional vector.

步骤4.2；将步骤4.1得到的当前片段中人的区域表征更新到记忆池中，并判断：当记忆池中没有该片段的表征，直接存入即可，否则将记忆池中该片段的表征删除，更新为本次迭代中提取的表征。Step 4.2: Update the regional representation of the person in the current segment obtained in step 4.1 into the memory pool, and judge: when there is no representation of the segment in the memory pool, just store it directly, otherwise delete the representation of the segment in the memory pool , updated to the representation extracted in this iteration.

步骤4.3：从记忆池中读取历史片段的表征，构成记忆表征：记忆池中存在来自于不同长视频的视频片段中人的区域表征，从中读取与当前片段同属于一个长视频，且时间在当前片段前30秒内的30个视频片段中人的区域的表征；拼接所有表征形成该片段的记忆表征，当每个视频片段中存在5个人的区域表征，30×5＝150，则记忆表征的维度为150×2304的表征张量。Step 4.3: Read the representations of historical segments from the memory pool to form memory representations: There are regional representations of people in video segments from different long videos in the memory pool, read from the same long video as the current segment, and the time Representation of human regions in 30 video clips within 30 seconds before the current clip; splicing all representations to form the memory representation of this clip, when there are 5 regional representations of people in each video clip, 30×5=150, then memory The dimension of the representation is a representation tensor of 150×2304.

步骤4.4：将人的区域表征、物体的区域表征以及记忆表征输入到多交互关系建模融合网络中，该多交互关系建模融合网络中的各个模块分别用于建模不同的交互关系，如图3所示，具体为：

其中：Q，K分别为输入的两种表征，W_Q，W_K1，W_K2，W_O是全连接层的权重，维度均为1024×1024，d是KW_K1的维度，即1024。图4为多交互关系建模融合网络的整体结构示意图，上一个模块的输出作为下一个模块的输入Q，而K采用不同的输入表征：人的区域表征、物体的区域表征和记忆表征，以融合个中不同的交互关系，得到更为鲁棒的行为表征。输入到该结构中的各类表征被全连接层降维到1024维，最终该多交互关系建模融合网络得到多个人的行为表征，即N×1024，N代表该片段中人的数量。Step 4.4: Input the regional representation of the person, the regional representation of the object, and the memory representation into the multi-interaction relationship modeling fusion network. Each module in the multi-interaction relationship modeling fusion network is used to model different interaction relationships, such as As shown in Figure 3, the details are:

Among them: Q and K are the two representations of the input, W _Q , W _K1 , W _K2 , and W _O are the weights of the fully connected layer, and the dimensions are 1024×1024, and d is the dimension of KW _K1 , which is 1024. Figure 4 is a schematic diagram of the overall structure of the multi-interaction relationship modeling fusion network. The output of the previous module is used as the input Q of the next module, and K uses different input representations: the regional representation of people, the regional representation of objects, and the memory representation. Integrate different interactions among them to get a more robust behavioral representation. The various representations input into the structure are reduced to 1024 dimensions by the fully connected layer, and finally the multi-interaction relationship modeling fusion network obtains the behavior representations of multiple people, namely N×1024, where N represents the number of people in the segment.

步骤4.5：将步骤4.4得到的多个人的行为表征输入三个全连接层，即两个隐含层和一个输出层，并经过sigmoid回归层得到损失函数的值，其中两个隐含层的权重大小为1024×1024，输出层权重的规模为1024×C，C为涉及的行为类别总数，在AVA数据集中为80。根据损失函数，使用BP算法，对整个行为检测网络进行优化。Step 4.5: Input the behavior representation of multiple people obtained in step 4.4 into three fully connected layers, namely two hidden layers and one output layer, and obtain the value of the loss function through the sigmoid regression layer, where the weights of the two hidden layers are The size is 1024×1024, and the scale of the output layer weights is 1024×C, where C is the total number of action categories involved, which is 80 in the AVA dataset. According to the loss function, the BP algorithm is used to optimize the entire behavior detection network.

所述的优化的参数部分包括：三维卷积神经网络中的参数、多交互关系建模融合网络中各交互建模模块中的参数、三个全连接层中的参数。The optimized parameter part includes: parameters in the three-dimensional convolutional neural network, parameters in each interaction modeling module in the multi-interaction relationship modeling fusion network, and parameters in the three fully connected layers.

步骤5：当步骤4.5中的优化达到最大次数，则终止训练，否则返回步骤4.1继续进行训练迭代。Step 5: When the optimization in step 4.5 reaches the maximum number of times, terminate the training, otherwise return to step 4.1 to continue the training iteration.

如图2所示，所述的测试推断，包括以下步骤：As shown in Figure 2, the described test inference includes the following steps:

步骤i：获取待检测视频。Step i: Obtain the video to be detected.

步骤ii：对待测视频进行切分和归一化：从待测视频中连续抽取含64帧的视频片段，每个片段归一化到256×464(高×宽)，下一个片段与上一个片段的开始时间间隔为一秒钟。视频片段需要按照时间先后顺序依次输入到步骤5训练后的行为检测网络中。Step ii: Segment and normalize the video to be tested: continuously extract video clips containing 64 frames from the video to be tested, each clip is normalized to 256×464 (height×width), and the next clip is the same as the previous one. The segment start time interval is one second. The video clips need to be input into the behavior detection network trained in step 5 in chronological order.

步骤iii，在待测视频上进行测试推断，具体包括：Step iii, perform test inference on the video to be tested, including:

步骤a：从数据输入模块读取已处理的视频片段，在该片段的中间帧上运行物体检测算法，检测出该帧中出现的人和常见物体。在本实施例中，本实施例采用Faster R-卷积神经网络算法作为物体检测模块。Step a: Read the processed video clip from the data input module, run the object detection algorithm on the intermediate frame of the clip, and detect people and common objects appearing in the frame. In this embodiment, the Faster R-convolutional neural network algorithm is used as the object detection module in this embodiment.

步骤b：将该视频片段输入到三维卷积网络中，得到该视频片段的表征，为16×29×2304(高×宽×深度)的张量。使用表征抽取模块，根据人和物体的边界框，在视频片段的表征上利用RoIAlign进行插值得到7×7×2304的张量，进一步池化得到人的区域表征和物体的区域表征为2304维度的向量。Step b: Input the video clip into a 3D convolutional network to obtain the representation of the video clip, which is a tensor of 16×29×2304 (height×width×depth). Using the representation extraction module, based on the bounding boxes of people and objects, RoIAlign is used to interpolate the representation of the video clip to obtain a tensor of 7×7×2304, and further pool to obtain the regional representation of the person and the object. The regional representation is 2304-dimensional. vector.

步骤c：将该片段中人的区域表征保存到记忆池中。Step c: Save the regional representation of the person in the segment into the memory pool.

步骤d：从记忆池中读取该片段前30个已测视频片段的人的区域表征，并拼接组成当前片段的记忆表征。若某个已测视频片段的表征不存在，或与当前片段不属于同一段待测视频，则使用零向量补充。假定每个视频片段中存在5个人的区域表征，30×5＝150，则记忆表征的维度为150×2304的表征张量。Step d: Read the regional representations of people in the first 30 tested video clips of the clip from the memory pool, and splicing together the memory representations of the current clip. If the representation of a tested video segment does not exist, or does not belong to the same video to be tested as the current segment, it is supplemented with a zero vector. Assuming that there are regional representations of 5 people in each video segment, 30×5=150, the dimension of the memory representation is a representation tensor of 150×2304.

步骤e：将人的区域表征、物体的区域表征、记忆表征输入到多交互关系建模融合网络中。输出得到该片段中每个人的行为表征，维度为N×1024，N代表该片段中人的数量。Step e: Input the regional representation of the person, the regional representation of the object, and the memory representation into the multi-interaction relationship modeling fusion network. The output is the behavioral representation of each person in the segment, with a dimension of N×1024, where N represents the number of people in the segment.

步骤f：行为表征经过三个全连接层和sigmoid函数层，得到每个人发生不同行为的概率，一个维度为N×C的矩阵，其中：N为该片段中人的数量，C为行为类别的数量，在AVA数据集中为80。该矩阵中的每个数值为0至1的小数，代表某个人在该片段中进行某个行为的判断概率。当概率大于阈值，则判断为发生该行为，否则为未发生该行为。存在同时发生一种以上行为的情况。Step f: Behavior representation After three fully connected layers and sigmoid function layers, the probability of each person having different behaviors is obtained, a matrix of dimension N×C, where: N is the number of people in the segment, and C is the behavior category. number, which is 80 in the AVA dataset. Each value in the matrix is a decimal from 0 to 1 and represents the probability that a person will perform a certain action in that segment. When the probability is greater than the threshold, it is determined that the behavior occurs, otherwise the behavior does not occur. There are situations where more than one behavior occurs at the same time.

步骤iv：当该待测视频的最后一个片段处理完成，则结束或处理其它待测视频，否则返回步骤3继续处理下一个视频片段。Step iv: when the processing of the last segment of the video to be tested is completed, end or process other videos to be tested, otherwise return to step 3 to continue processing the next video segment.

本实施例在AVA数据集的验证视频上进行验证，根据测试步骤f中阈值的不同选取，对本方法的性能评测数据如表1所示。该性能评测的标准为，在某项检测结果与某个标签行为类别一致，且二者边界框IoU(Intersection over Union)大于0.5时，该项检测结果被认为正确。所述IoU的计算方式为，两个框所占区域的并集区域面积与两个框所占区域的交集区域面积之比。In this embodiment, the verification is performed on the verification video of the AVA data set. According to the different selection of the threshold in the test step f, the performance evaluation data of this method is shown in Table 1. The standard for this performance evaluation is that when a detection result is consistent with a label behavior category, and the IoU (Intersection over Union) of the two bounding boxes is greater than 0.5, the detection result is considered correct. The calculation method of the IoU is the ratio of the area of the union area of the areas occupied by the two frames to the area of the intersection area of the areas occupied by the two frames.

表1检测性能评测数据Table 1 Detection performance evaluation data

阈值threshold 0.30.3 0.40.4 0.50.5 0.60.6 召回率recall 42.45％42.45% 33.85％33.85% 26.76％26.76% 20.82％20.82% 精确度Accuracy 31.61％31.61% 42.94％42.94% 56.63％56.63% 72.84％72.84%

表1中所述阈值，为检测待测视频时对某个行为所得预测概率大于该值时，判定为发生该行为，通过召回率和精确率两个指标判断检测类任务的效果：召回率

精确度

其中TP为真阳性的数量，FN为假阴性的数量，即预测判定为发生某类行为的样本在实际发生这类行为的样本总数中所占比例，FP为假阳性的数量，即预测判定为发生某类行为的样本中实际发生这类行为的样本的所占比例。The threshold mentioned in Table 1 is that when the predicted probability of a certain behavior is greater than this value when detecting the video to be tested, it is determined that the behavior occurs, and the effect of the detection task is judged by the recall rate and the precision rate: recall rate

Accuracy

Among them, TP is the number of true positives, FN is the number of false negatives, that is, the proportion of samples that are predicted to have a certain type of behavior in the total number of samples that actually occur such behaviors, and FP is the number of false positives, that is, the prediction is determined as The proportion of samples that actually engage in a certain type of behavior among the samples that engage in that behavior.

在共同考虑召回率和精确率的基础上，综合可视化结果分析，本实施例中所选阈值为0.5。在该阈值下，AVA数据集中80个类别的全部验证视频上的检测性能数据为：召回率26.76％，精确度42.94％。而AVA数据集中最为常见的10个类别的验证视频上的检测性能评测数据为：召回率63.64％，精确度76.51％，即：10000个发生行为的待测的人中，能够正确检测出其中6364个人的边界框和行为；检测出的10000个人中，有7651个人的边界框和行为是正确的。On the basis of jointly considering the recall rate and the precision rate, and comprehensively analyzing the visualization results, the selected threshold value in this embodiment is 0.5. Under this threshold, the detection performance data on all validation videos of 80 categories in the AVA dataset are: recall 26.76% and precision 42.94%. The detection performance evaluation data on the most common 10 categories of verification videos in the AVA data set are: recall rate of 63.64% and accuracy of 76.51%, that is, among the 10,000 people to be tested who have behaviors, 6,364 of them can be correctly detected. Individual bounding boxes and behaviors; 7651 of the 10,000 people detected have correct bounding boxes and behaviors.

上述具体实施可由本领域技术人员在不背离本发明原理和宗旨的前提下以不同的方式对其进行局部调整，本发明的保护范围以权利要求书为准且不由上述具体实施所限，在其范围内的各个实现方案均受本发明之约束。The above-mentioned specific implementation can be partially adjusted by those skilled in the art in different ways without departing from the principle and purpose of the present invention. The protection scope of the present invention is subject to the claims and is not limited by the above-mentioned specific implementation. Each implementation within the scope is bound by the present invention.

Claims

1. A multi-user behavior detection method based on deep learning and fusion of various interactive information is characterized in that a behavior detection network is trained by constructing a video library with labels as a sample set, the trained network processes a video to be detected, and the detection of object behaviors in an area is realized according to a final output vector;

the behavior detection network comprises: the video representation extraction system comprises a three-dimensional convolutional neural network for extracting video representations, a representation extraction module with a memory pool, a multi-interaction relationship modeling fusion network, three full-connection layers and a sigmoid regression layer, wherein: the three-dimensional convolutional neural network extracts video representations according to input video clips and outputs the video representations to the representation extraction module, the representation extraction module utilizes RoIAlign to perform linear interpolation on each bounding box area on the video representations and obtains the area representations of people and objects through pooling, meanwhile, the representation extraction module obtains memory representations through a memory pool, the multi-interaction relation modeling fusion network performs modeling fusion on the area representations of people and objects and the memory representations to obtain robust behavior representations, and prediction probabilities of all categories are obtained through a full connection layer and a sigmoid regression layer.

2. The method of claim 1, wherein the tagged video library is obtained by: and after the videos in the sample set are labeled at equal intervals, normalizing the sizes of the videos, and cutting the videos into a plurality of segments according to each labeled frame.

3. The method of claim 1, wherein said equally spaced annotations comprise: the bounding box for each person in the frame and the behavior that each person has individually occurred during the 1.5 second time interval before and after the frame.

4. The method of claim 1, wherein the three-dimensional convolutional neural network is selected from the group consisting of an I3D network, a SlowFast network, and a C3D network.

5. The method of claim 1, wherein the memory pool is based on the person region representation in the region representations of the person and the object in each video segment, and the memory representations are obtained by concatenating the person region representations in the historical segment of the current segment.

6. The method of claim 1, wherein the multiple interaction relationship modeling fusion network comprises: two human-human interaction modeling modules for receiving the region representation of the person, two human interaction modeling modules for respectively receiving the region representation of the person and the region representation of the object, and two human memory modeling interaction modules for respectively receiving the region representation and the memory representation of the person, wherein: the system comprises a first human-human interaction modeling module, a first human memory modeling interaction module, a second human-human interaction modeling module, a second human character interaction modeling module and a second human memory modeling interaction module, wherein the first human-human interaction modeling module, the first human memory modeling interaction module, the second human memory modeling interaction module and the second human memory modeling interaction module are sequentially connected and transmit region representations of human which are sequentially enhanced, each interaction modeling module models one interaction relationship among human interaction, human character interaction and human memory interaction, the interaction relationships are fused with the region representations of the human and then transmitted to the next module, and the finally output region representations of the human comprehensively fuse human interaction, human character interaction and human memory interaction relationships, namely the finally output robust behavior representations;

the human-human interaction is as follows: interaction between different agents in the same video segment;

the character interaction is as follows: interaction between an agent and an object in the same video clip;

the human memory interaction is as follows: interaction between an agent in the current segment and an agent in an adjacent segment of longer history.

7. The method of claim 6, wherein said modeling is by:

wherein: q and K are respectively two representations of input, W_Q，W_K1，W_K2，W_OIs the weight of the fully connected layer, d is KW_K1Dimension (d);

according to the difference of input representation K, the module processes different interaction relations: the value of K comprises a human region representation, an object region representation and a memory representation, and the corresponding modeling module sequentially processes human-human interaction, human interaction and memory interaction correspondingly and outputs corresponding representations fused with the type interaction information; when the six modules are connected in series, the output of the last modeling module is further used as the input of the next Q, and finally, a plurality of different interaction relations are fused.

8. The method of claim 1, wherein the three fully-connected layers include two hidden layers and an output layer.

9. The method as claimed in claim 1, wherein the sigmoid regression layer comprises a sigmoid function and a cross entropy loss function, the output vector of the output layer can obtain the prediction probability of each category through the sigmoid layer, and the cross entropy loss function is used for training the whole network.

10. The method of claim 1, wherein the training is: taking the samples in the sample set, the corresponding object boundary frames and the human region representation of the adjacent video segments in the memory pool arranged in the representation extraction module as the input of the behavior detection network, adjusting network parameters by adopting a cross entropy loss function and combining a back propagation BP algorithm, and updating the human region representation in the video segments into the memory pool.

11. A multi-person behavior detection system according to the method of any of claims 1 to 10, comprising: training sample acquisition module, object detection module, the action detection network module that fuses multiple interaction, wherein: the method comprises the steps that a sample of a training sample acquisition module and an object detection frame of an object detection module are used as input of a behavior detection network module, the behavior detection network obtains models of area characteristics and memory characteristics of people and objects by utilizing boundary frame areas of the people and the objects after training, multi-classification judgment is further carried out on the characteristics, the object detection module detects the people and the objects in a video to be detected, and the behavior detection network module further tests and deduces to obtain judgment on behaviors of each person in the video according to a detection result of the object detection module.