CN114882523A

CN114882523A - Task target detection method and system based on fragmented video information

Info

Publication number: CN114882523A
Application number: CN202210375278.1A
Authority: CN
Inventors: 陈志�; 何丽; 岳文静; 周晨; 王悦; 艾虎
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2022-08-09
Anticipated expiration: 2042-04-11
Also published as: CN114882523B

Abstract

The invention discloses a task target detection method and system based on fragmented video information. Based on an effective frame sequence extraction module, a depth convolution feature map extraction module, an optical flow information module, a deformation feature module, and a weight coefficient calculation module, a target is constructed. The person detection model completes the detection of the preset target person. The present invention improves the weight distribution of the aggregated frame through the fuzzy prior method, and calculates the weight of each frame of the image instead of assigning the same weight to each frame, which effectively improves the character. Accuracy and reliability of detection.

Description

A task target detection method and system based on fragmented video information

技术领域technical field

本发明涉及一种基于碎片化视频信息的任务目标检测方法，还涉及实现基于碎片化视频信息的任务目标检测方法的系统。The invention relates to a task target detection method based on fragmented video information, and also relates to a system for realizing the task target detection method based on fragmented video information.

背景技术Background technique

深度学习网络在目标检测方面取得了显著进展，近年来，优秀的基于图像的目标检测算法被直接转移到视频目标检测中。与静态图像目标检测相比，视频目标检测具有更大的挑战性。由于视频检测场景通常较为复杂，比如被检测人员的不配合、获取的视频信息不连续，导致提取出来的图像信息不完全，如运动模糊、离焦和罕见的姿态等，这在很大程度上降低了检测精度。Deep learning networks have made significant progress in object detection, and in recent years, excellent image-based object detection algorithms have been directly transferred to video object detection. Compared with still image object detection, video object detection is more challenging. Because the video detection scene is usually complex, such as the uncooperative of the detected person and the discontinuous video information obtained, the extracted image information is incomplete, such as motion blur, defocus, and rare gestures, which to a large extent Reduced detection accuracy.

现有的基于特征聚合的方法是通过聚合多个相邻帧的特征来补偿帧间的不对齐，一个关键的问题是这些帧是否应该被平等对待。现在有两种方法去解决这个问题，一种解决方案是平等地对待每一帧，并赋予它们相同的权重，另一种方法是在训练过程中采用一种轻网络来学习权重，这两种解决方案都缺乏对模糊影响的特殊考虑。Existing feature aggregation-based methods compensate for the misalignment between frames by aggregating features from multiple adjacent frames, and a key question is whether these frames should be treated equally. Now there are two ways to solve this problem, one solution is to treat each frame equally and give them the same weight, the other way is to use a light network to learn the weights during training, these two The solutions all lack special consideration for the effects of ambiguity.

在本发明中，我们提出了一种基于碎片化视频信息的人物目标检测方法。利用模糊先验改进聚合帧的权值分配，特别是，引入了一个模糊映射网络来标记每个像素为模糊或非模糊，由于本发明只关心目标的模糊程度而不考虑背景，所以采用显著性检测网络对目标进行聚焦，通过显著图进行标定，得到以目标模糊度为焦点的标定模糊图，计算出每帧图像的权重；该方法在增加计算量的前提下，比目前最先进的视频目标检测算法有更好的性能。In the present invention, we propose a human object detection method based on fragmented video information. Using fuzzy priors to improve the weight assignment of aggregated frames, in particular, a fuzzy mapping network is introduced to mark each pixel as fuzzy or non-blurred. The detection network focuses on the target and calibrates the saliency map to obtain a calibrated blur map focusing on the blur of the target, and calculates the weight of each frame of image; this method is more computationally intensive than the current most advanced video targets. The detection algorithm has better performance.

发明内容SUMMARY OF THE INVENTION

本发明目的：在于提供一种基于碎片化视频信息的任务目标检测方法及系统，通过模糊先验方法改进聚合帧的权值分配，计算出每帧图像的权重，而不是赋予每帧相同的权重，有效地提升了人物检测的准确性与可靠性。The purpose of the present invention is to provide a task target detection method and system based on fragmented video information, improve the weight distribution of aggregated frames through the fuzzy prior method, and calculate the weight of each frame of image, instead of assigning the same weight to each frame , which effectively improves the accuracy and reliability of person detection.

为实现以上功能，一种基于碎片化视频信息的任务目标检测方法，按预设周期执行步骤S1-步骤S7，获得目标人物检测模型，然后应用目标人物检测模型，完成对目标人物的检测；In order to realize the above functions, a task target detection method based on fragmented video information, execute steps S1-step S7 according to a preset cycle, obtain a target person detection model, and then apply the target person detection model to complete the detection of the target person;

S1.实时采集包含目标人物行走的视频，将包含目标人物行走的视频转换为按时序排列的视频帧序列，提取视频帧序列中预设位置的预设帧数的连续视频帧作为有效帧序列，以视频帧序列为输入，以有效帧序列为输出，构建有效帧序列提取模块；S1. Real-time acquisition of a video containing the walking of the target person, converting the video containing the walking of the target person into a sequence of video frames arranged in time sequence, and extracting continuous video frames of a preset number of frames at a preset position in the video frame sequence as a valid frame sequence, Taking the video frame sequence as the input and the valid frame sequence as the output, constructs the valid frame sequence extraction module;

S2.以有效帧序列提取模块输出的有效帧序列为输入，基于深度卷积神经网络，以有效帧序列中各视频帧所对应的各深度卷积特征图为输出，构建深度卷积特征图提取模块；S2. Taking the valid frame sequence output by the valid frame sequence extraction module as the input, based on the deep convolutional neural network, and using the depth convolution feature map corresponding to each video frame in the valid frame sequence as the output, construct the depth convolution feature map extraction module;

S3.以有效帧序列提取模块输出的有效帧序列为输入，基于光流神经网络，针对由有效帧序列中彼此间隔预设帧数的两视频帧所构成的各组视频帧对，计算该各组视频帧对的光流参数作为目标人物的运动信息，以有效帧序列中各组视频帧对的光流参数为输出，构建光流信息模块；S3. Taking the valid frame sequence output by the valid frame sequence extraction module as input, based on the optical flow neural network, for each group of video frame pairs formed by two video frames separated by a preset number of frames from each other in the valid frame sequence, calculate the The optical flow parameters of the group of video frame pairs are used as the motion information of the target person, and the optical flow parameters of each group of video frame pairs in the valid frame sequence are used as the output to construct an optical flow information module;

S4.以深度卷积特征图提取模块输出的各深度卷积特征图、光流信息模块输出的有效帧序列中各组视频帧对的光流参数为输入，基于双线性扭曲函数，以各组视频帧对的变形特征为输出，构建变形特征模块；S4. Take the depth convolution feature maps output by the depth convolution feature map extraction module and the optical flow parameters of each group of video frame pairs in the valid frame sequence output by the optical flow information module as input, based on the bilinear distortion function, take each The deformation features of the group video frame pairs are output, and the deformation feature module is constructed;

S5.以有效帧序列提取模块输出的有效帧序列为输入，基于模糊映射网络、显著性检测网络，分别获得有效帧序列中各视频帧所对应的模糊特征、显著性特征，根据模糊特征、显著性特征，并基于softmax分类网络，获得有效帧序列中各视频帧的权重系数；以有效帧序列中各视频帧的权重系数为输出，构建权重系数计算模块；S5. Taking the valid frame sequence output by the valid frame sequence extraction module as the input, based on the fuzzy mapping network and the saliency detection network, respectively obtain the fuzzy features and saliency features corresponding to each video frame in the valid frame sequence. The weight coefficient of each video frame in the valid frame sequence is obtained based on the softmax classification network; the weight coefficient calculation module of each video frame in the valid frame sequence is used as the output to construct a weight coefficient calculation module;

S6.以变形特征模块输出的各组视频帧对的变形特征、深度卷积特征图提取模块输出的各视频帧所对应的各深度卷积特征图、以及权重系数计算模块输出的各视频帧的权重系数为输入，获得由视频帧对构成的视频帧组所对应的聚合特征，并通过检测神经网络，以目标人物的预设信息为输出，构建检测网络模块；S6. Deformation features of each group of video frame pairs output by the deformation feature module, each depth convolution feature map corresponding to each video frame output by the depth convolution feature map extraction module, and each video frame output by the weight coefficient calculation module. The weight coefficient is used as the input, and the aggregation features corresponding to the video frame group composed of video frame pairs are obtained, and the detection network module is constructed with the preset information of the target person as the output through the detection neural network;

S7.以实时采集包含目标人物行走的视频所对应的视频帧序列为输入，以目标人物的预设信息为输出，基于有效帧序列提取模块、深度卷积特征图提取模块、光流信息模块、变形特征模块、权重系数计算模块，构建目标人物检测待训练模型，并基于包含目标人物行走的视频样本的参与训练，获得目标人物检测模型，完成对目标人物的检测。S7. Take the real-time acquisition of the video frame sequence corresponding to the video of the target person walking as input, and take the preset information of the target person as output, based on the effective frame sequence extraction module, the depth convolution feature map extraction module, the optical flow information module, The deformation feature module and the weight coefficient calculation module are used to construct the target person detection model to be trained, and the target person detection model is obtained based on the participating training of the video samples containing the target person walking, and the target person detection is completed.

作为本发明的一种优选技术方案：步骤S3以有效帧序列提取模块输出的有效帧序列为输入，基于光流神经网络，针对由有效帧序列中彼此间隔预设帧数的两视频帧所构成的各组视频帧对，计算该各组视频帧对的光流参数作为目标人物的运动信息，以有效帧序列中各组视频帧对的光流参数为输出，构建光流信息模块的具体步骤如下：As a preferred technical solution of the present invention: step S3 takes the valid frame sequence output by the valid frame sequence extraction module as the input, based on the optical flow neural network, for two video frames in the valid frame sequence that are separated by a preset number of frames from each other. Each group of video frame pairs is calculated, the optical flow parameters of each group of video frame pairs are calculated as the motion information of the target person, and the optical flow parameters of each group of video frame pairs in the valid frame sequence are used as the output to construct the specific steps of the optical flow information module as follows:

S31：定义有效帧序列中的第t帧视频帧I_t为参考帧，有效帧序列中的第t-τ帧视频帧I_t-τ、第t+τ帧视频帧I_t+τ为支撑帧，将参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ输入光流神经网络；S31: Define the t-th video frame It in the valid frame sequence as a reference frame, and the _t -τ-th video frame It _-τ and the t+τ-th video frame It ₊ τ in the valid frame sequence as support frames , input the reference frame It, the support frame It _-τ _, and the support frame It _+τ into the optical flow neural network;

S32：光流神经网络包括卷积层、扩大层，参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ通过光流神经网络卷积层组成的收缩部分，获得参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图；S32: The optical flow neural network includes a convolutional layer, an expansion layer, the reference frame It, the support frame It _-τ _, and the support frame It _+τ pass through the contraction part composed of the convolutional layer of the optical flow neural network to obtain the reference frame It _. , the feature maps corresponding to the support frame It _-τ and the support frame It _+τ respectively;

S33：参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图经过光流神经网络的扩大层，获得尺寸扩大至原图大小的参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图；S33: The feature maps corresponding to the reference frame It, the support frame It _-τ _, and the support frame It _+τ respectively pass through the expansion layer of the optical flow neural network to obtain the reference frame It and the support frame whose size is expanded to the original image size _. The feature maps corresponding to It _-τ and the support frame It _+τ respectively;

S34：基于步骤S33所获得的参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图进行光流预测，分别以参考帧I_t、支撑帧I_t-τ作为一个视频帧对，以及参考帧I_t、支撑帧I_t+τ作为一个视频帧对，分别获得参考帧I_t、支撑帧I_t-τ分别所对应的特征图之间的光流参数M_t-τ→t、以及参考帧I_t、支撑帧I_t+τ分别所对应的特征图之间的光流参数M_t+τ→t如下式：S34: Perform optical flow prediction based on the feature maps corresponding to the reference frame It, the support frame It _-τ _, and the support frame It _+τ obtained in step S33, respectively, using the reference frame It, the support frame It _- _τ As a video frame pair, and the reference frame It and the support frame It _+τ _are used as a video frame pair, the optical flow parameters M between the feature maps corresponding to the reference frame It and the support frame It _-τ _are obtained respectively. _t-τ→t and the optical flow parameter M _t+τ→t _between the feature maps corresponding to the reference frame It and the support frame It _+τ respectively are as follows:

M_t-τ→t＝FlowNet(I_t-τ,I_t)M _t-τ→t =FlowNet(I _t-τ ,I _t )

M_t+τ→t＝FlowNet(I_t+τ,I_t)M _t+τ→t =FlowNet(I _t+τ ,I _t )

式中，M_t-τ→t为参考帧I_t、支撑帧I_t-τ分别所对应的特征图之间的光流参数，t-τ→t表示参考帧I_t、支撑帧I_t-τ的对应关系，M_t+τ→t为参考帧I_t、支撑帧I_t+τ分别所对应的特征图之间的光流参数，t+τ→t表示参考帧I_t、支撑帧I_t+τ的对应关系，FlowNet表示光流神经网络计算。In the formula, M _t-τ→t is the optical flow parameter between the feature maps corresponding to the reference frame It and the support frame It _-τ respectively, and _t -τ→ _t represents the reference frame It and the support frame _It- The correspondence between _τ , M _t+τ→t is the optical flow parameter between the feature maps corresponding to the reference frame It and the support frame It _+τ respectively, _t +τ→ _t is the reference frame It, the support frame I The corresponding relationship of _t+τ , FlowNet represents the optical flow neural network calculation.

作为本发明的一种优选技术方案：步骤S4以深度卷积特征图提取模块输出的各深度卷积特征图、光流信息模块输出的有效帧序列中各组视频帧对的光流参数为输入，基于双线性扭曲函数，以各组视频帧对的变形特征为输出，构建变形特征模块的具体方法如下式：As a preferred technical solution of the present invention: in step S4, the optical flow parameters of each group of video frame pairs in the effective frame sequence output by the depth convolution feature map extraction module and each depth convolution feature map output by the optical flow information module are input as input , based on the bilinear warping function, with the deformation features of each group of video frame pairs as the output, the specific method of constructing the deformation feature module is as follows:

f_t-τ→t＝W(f_t-τ,M_t-τ→t)f _t-τ→t =W(f _t-τ ,M _t-τ→t )

f_t+τ→t＝W(f_t+τ,M_t+τ→t)f _t+τ→t =W(f _t+τ ,M _t+τ→t )

式中，f_t-τ→t为参考帧I_t、支撑帧I_t-τ之间的变形特征，f_t+τ→t为参考帧I_t、支撑帧I_t+τ之间的变形特征，W表示基于双线性扭曲函数计算，f_t-τ为深度卷积特征图提取模块输出的支撑帧I_t-τ所对应的特征图，f_t+τ为深度卷积特征图提取模块输出的支撑帧I_t-τ所对应的特征图。In the formula, f _t-τ→t is the deformation feature between the reference frame It and the support frame It _-τ , and f _t _+τ→t _is the deformation feature between the reference frame It and the support frame It _+τ , W represents the calculation based on the bilinear distortion function, f _t-τ is the feature map corresponding to the support frame I _t-τ output by the depthwise convolution feature map extraction module, and f _t+τ is the output of the depthwise convolution feature map extraction module The feature map corresponding to the supporting frame It _-τ of .

作为本发明的一种优选技术方案：步骤S5以有效帧序列提取模块输出的有效帧序列为输入，基于模糊映射网络、显著性检测网络，分别获得有效帧序列中各视频帧所对应的模糊特征、显著性特征，根据模糊特征、显著性特征，并基于softmax分类网络，获得有效帧序列中各视频帧的权重系数；以有效帧序列中各视频帧的权重系数为输出，构建权重系数计算模块的具体步骤如下：As a preferred technical solution of the present invention: step S5 takes the valid frame sequence output by the valid frame sequence extraction module as input, and obtains the fuzzy features corresponding to each video frame in the valid frame sequence based on the fuzzy mapping network and the saliency detection network, respectively. , salient features, according to fuzzy features, salient features, and based on the softmax classification network, the weight coefficients of each video frame in the valid frame sequence are obtained; the weight coefficients of each video frame in the valid frame sequence are used as the output, and the weight coefficient calculation module is constructed The specific steps are as follows:

S51:将有效帧序列中的各视频帧分别输入模糊映射网络、显著性检测网络，获得分别获得各视频帧所对应的模糊特征、显著性特征；S51: input each video frame in the valid frame sequence into the fuzzy mapping network and the saliency detection network respectively, and obtain respectively the fuzzy feature and the saliency feature corresponding to each video frame;

S52:通过点乘步骤S51所获得的各视频帧所对应的模糊特征、显著性特征，获得各视频帧所对应的校正后的模糊映射M_blur-sali；S52: obtain the corresponding fuzzy map M _blur-sali of each video frame corresponding to each video frame by point multiplication step S51 corresponding fuzzy feature, salient feature;

S53:基于阈值为0.5的阶跃函数对各视频帧所对应的校正后的模糊映射M_blur-sali进行二值化，其中阶跃函数如下式：S53: Binarize the corrected blur map M _blur-sali corresponding to each video frame based on a step function with a threshold of 0.5, wherein the step function is as follows:

式中，m为各视频帧所对应的校正后的模糊映射M_blur-sali的值，u(m)为经过二值化处理后各视频帧所对应的校正后的模糊映射M_blur-sali的值；In the formula, m is the value of the corrected blur map M _blur-sali corresponding to each video frame, and u(m) is the corrected blur map M _blur-sali corresponding to each video frame after binarization processing. value;

S54:分别针对各视频帧，将步骤S53获得的所有u(m)相加，获得各视频帧的模糊度参数Vcb，并对各视频帧的模糊度参数Vcb进行标准化处理，标准化处理方法如下式：S54: for each video frame, add all u(m) obtained in step S53 to obtain the blur parameter Vcb of each video frame, and standardize the blur parameter Vcb of each video frame, the standardization method is as follows :

式中，Vcb_i表示视频帧i的模糊度参数，VcbNorm_i表示经过标准化处理的视频帧i的模糊度参数，i的取值为{t-τ,t,t+τ}；In the formula, Vcb _i represents the blurriness parameter of video frame i, VcbNorm _i represents the blurriness parameter of the normalized video frame i, and the value of i is {t-τ, t, t+τ};

S55:将步骤S54获得的经过标准化处理的各视频帧的模糊度参数VcbNorm_i输入至softmax分类网络，获得支撑帧I_t-τ、参考帧I_t、支撑帧I_t+τ分别所对应的权重系数ω_t-τ、ω_t、ω_t+τ。S55: Input the ambiguity parameter _VcbNorm _i of each standardized video frame obtained in step S54 into the softmax classification network, and obtain the weights corresponding to the support frame It _-τ , the reference frame It, and the support frame It _+τ respectively Coefficients ω _t-τ , ω _t , ω _t+τ .

作为本发明的一种优选技术方案：步骤S6以变形特征模块输出的各组视频帧对的变形特征、深度卷积特征图提取模块输出的各视频帧所对应的各深度卷积特征图、以及权重系数计算模块输出的各视频帧的权重系数为输入，获得由视频帧对构成的视频帧组所对应的聚合特征，并通过检测神经网络，以目标人物的预设信息为输出，构建检测网络模块的具体步骤如下：As a preferred technical solution of the present invention: step S6 uses the deformation features of each group of video frame pairs output by the deformation feature module, the depth convolution feature maps corresponding to each video frame output by the depth convolution feature map extraction module, and The weight coefficient of each video frame output by the weight coefficient calculation module is used as input, and the aggregation features corresponding to the video frame group composed of video frame pairs are obtained, and through the detection neural network, the preset information of the target person is used as the output to construct a detection network. The specific steps of the module are as follows:

S61：基于形特征模块输出的变形特征f_t-τ→t、f_t+τ→t，深度卷积特征图提取模块输出的参考帧I_t所对应的特征图f_t，以及权重系数计算模块输出的各变形特征所对应的权重系数ω_t-τ、ω_t、ω_t+τ，根据下式获得由支撑帧I_t-τ、参考帧I_t、支撑帧I_t+τ所构成的视频帧组的聚合特征J；S61: Based on the deformed features f _t-τ→t and f _t+τ→t _{output by the shape feature module, the feature map f t} _{corresponding} to the reference frame It output by the depth convolution feature map extraction module, and the weight coefficient calculation module The weight coefficients ω _t-τ , ω _t , ω _t+τ corresponding to the output deformation features are obtained according to the following formula to obtain the video composed of the support frame It _-τ , the reference frame It , and the support frame It ₊ _τ The aggregated feature J of the frame group;

J＝f_t-τ→tω_t-τ+f_tω_t+f_t+τ→tω_t+τ J=f _t-τ→t ω _t-τ +f _t ω _t +f _t+τ→t ω _t+τ

S62：将聚合特征输入检测神经网络，获得目标人物的预设信息。S62: Input the aggregated features into the detection neural network to obtain preset information of the target person.

本发明还设计一种基于碎片化视频信息的任务目标检测系统，包括：The present invention also designs a task target detection system based on fragmented video information, including:

一个或多个处理器；one or more processors;

存储器，存储可被操作的指令，所述指令在通过所述一个或多个处理器执行时使得所述一个或多个处理器执行操作，通过以下步骤获得目标人物检测模型，然后应用目标人物检测模型，完成对预设目标人物的检测：a memory storing instructions operable that, when executed by the one or more processors, cause the one or more processors to perform operations to obtain a target person detection model by the following steps, and then apply the target person detection The model completes the detection of the preset target person:

本发明还设计一种存储软件的计算机可读取介质，其特征在于，所述可读取介质包括能通过一个或多个计算机执行的指令，所述指令在被所述一个或多个计算机执行时，执行所述一种基于碎片化视频信息的任务目标检测方法的操作。The present invention also provides a computer-readable medium for storing software, wherein the readable medium includes instructions that can be executed by one or more computers, and the instructions are executed by the one or more computers. When , the operations of the task target detection method based on fragmented video information are performed.

有益效果：相对于现有技术，本发明的优点包括：Beneficial effects: Compared with the prior art, the advantages of the present invention include:

1.本发明引入了光流神经网络计算任意两帧之间的流畅，不是只关注于某一帧的特征，而是注重上下帧之间的关系。1. The present invention introduces the optical flow neural network to calculate the smoothness between any two frames, not only focusing on the features of a certain frame, but focusing on the relationship between the upper and lower frames.

2.本发明提出了一种新的视频目标检测算法，主要研究模糊对视频目标检测的影响，物体外观清晰的帧比物体外观模糊的帧对结果的贡献更大。2. The present invention proposes a new video target detection algorithm, which mainly studies the influence of blur on video target detection. Frames with clear object appearance contribute more to the results than frames with blurred object appearance.

3.本发明通过基于碎片化视频信息的人物目标检测方法，有助于在视频不连续时对人物进行检测，提高视频目标检测的精度。3. The present invention, through the method for detecting human objects based on fragmented video information, helps to detect human characters when the video is discontinuous, and improves the detection accuracy of video objects.

附图说明Description of drawings

图1是根据本发明实施例提供的基于碎片化视频信息的任务目标检测方法的流程图；1 is a flowchart of a task target detection method based on fragmented video information provided according to an embodiment of the present invention;

图2是根据本发明实施例提供的基于碎片化视频信息的任务目标检测网络框架示意图。FIG. 2 is a schematic diagram of a network framework for task target detection based on fragmented video information provided according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solutions of the present invention more clearly, and cannot be used to limit the protection scope of the present invention.

参照图1、图2，本发明实施例提供的一种基于碎片化视频信息的任务目标检测方法，按预设周期执行步骤S1-步骤S7，获得目标人物检测模型，然后应用目标人物检测模型，完成对预设目标人物的检测；Referring to FIG. 1 and FIG. 2, an embodiment of the present invention provides a task target detection method based on fragmented video information. Steps S1 to S7 are performed according to a preset cycle to obtain a target person detection model, and then the target person detection model is applied, Complete the detection of the preset target person;

在一个实施例中，所述深度卷积神经网络为Restnet-101。In one embodiment, the deep convolutional neural network is Restnet-101.

由于光流神经网络输出的光流参数的分辨率与深度卷积特征图的分辨率不匹配，因此需要调整光流参数的大小来匹配特征图。Since the resolution of the optical flow parameters output by the optical flow neural network does not match the resolution of the depthwise convolutional feature maps, it is necessary to adjust the size of the optical flow parameters to match the feature maps.

步骤S3中以有效帧序列提取模块输出的有效帧序列为输入，基于光流神经网络，针对由有效帧序列中彼此间隔预设帧数的两视频帧所构成的各组视频帧对，计算该各组视频帧对的光流参数作为目标人物的运动信息，以有效帧序列中各组视频帧对的光流参数为输出，构建光流信息模块的具体步骤如下：In step S3, the valid frame sequence output by the valid frame sequence extraction module is used as input, and based on the optical flow neural network, for each group of video frame pairs formed by two video frames separated by a preset number of frames from each other in the valid frame sequence, calculate the The optical flow parameters of each group of video frame pairs are used as the motion information of the target person, and the optical flow parameters of each group of video frame pairs in the valid frame sequence are used as the output. The specific steps of constructing the optical flow information module are as follows:

S33：由于步骤S32使各特征图尺寸缩小，因此需要通过一个扩大层，将各特征图的尺寸扩大到原图尺寸。参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图经过光流神经网络的扩大层，获得尺寸扩大至原图大小的参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图；S33: Since the size of each feature map is reduced in step S32, an expansion layer is required to expand the size of each feature map to the original size. The feature maps corresponding to the reference frame It, the support frame It _-τ _, and the support frame It _+τ respectively pass through the expansion layer of the optical flow neural network to obtain the reference frame It and the support frame It whose size _is expanded to the original image size _. _-τ and the feature maps corresponding to the support frame It _+τ respectively;

S34：所谓光流参数，就是利用有效帧序列中各视频帧中的像素在时间域上的变化，以及两帧视频帧之间的相关性来寻找两帧视频帧之间所存在的对应关系，从而计算出目标人物的运动信息。S34: The so-called optical flow parameter is to use the change of the pixels in each video frame in the valid frame sequence in the time domain and the correlation between the two video frames to find the corresponding relationship between the two video frames. Thereby, the motion information of the target person is calculated.

基于步骤S33所获得的参考帧I_t、支撑帧I_t-τ、支撑帧I_t+τ分别所对应的特征图进行光流预测，分别以参考帧I_t、支撑帧I_t-τ作为一个视频帧对，以及参考帧I_t、支撑帧I_t+τ作为一个视频帧对，分别获得参考帧I_t、支撑帧I_t-τ分别所对应的特征图之间的光流参数M_t-τ→t、以及参考帧I_t、支撑帧I_t+τ分别所对应的特征图之间的光流参数M_t+τ→t如下式：Optical flow prediction is performed based on the feature maps corresponding to the reference frame It, the support frame It _-τ _, and the support frame It _+τ obtained in step S33, respectively, and the reference frame It and the support frame It _-τ _are taken as one The video frame pair, the reference frame It and the support frame It _+τ _are taken as a video frame pair to obtain the optical flow parameter M _t- between the feature maps corresponding to the reference frame It and the support frame It _- _τ respectively _τ→t and the optical flow parameter M _t+τ→t _between the feature maps corresponding to the reference frame It and the support frame It _+τ respectively are as follows:

M_t-τ→t＝FlowNet(I_t-τ,I_t)M _t-τ→t =FlowNet(I _t-τ ,I _t )

M_t+τ→t＝FlowNet(I_t+τ,I_t)M _t+τ→t =FlowNet(I _t+τ ,I _t )

S4.参考图2，图中WARP表示双线性扭曲函数，aggregation表示变形特征，以深度卷积特征图提取模块输出的各深度卷积特征图、光流信息模块输出的有效帧序列中各组视频帧对的光流参数为输入，基于双线性扭曲函数，以各组视频帧对的变形特征为输出，构建变形特征模块；S4. Referring to Figure 2, in the figure WARP represents the bilinear distortion function, aggregation represents the deformation feature, and each group in the effective frame sequence output by the depth convolution feature map extraction module and the effective frame sequence output by the optical flow information module is used. The optical flow parameters of the video frame pair are input, and the deformation feature module is constructed based on the bilinear distortion function with the deformation features of each group of video frame pairs as the output;

步骤S4以深度卷积特征图提取模块输出的各深度卷积特征图、光流信息模块输出的有效帧序列中各组视频帧对的光流参数为输入，基于双线性扭曲函数，以各组视频帧对的变形特征为输出，构建变形特征模块的具体方法如下式：Step S4 takes each depthwise convolutional feature map output by the depthwise convolutional feature map extraction module and the optical flow parameters of each group of video frame pairs in the valid frame sequence outputted by the optical flow information module as input, based on the bilinear distortion function, with each The deformation features of the video frame pair are output, and the specific method of constructing the deformation feature module is as follows:

f_t-τ→t＝W(f_t-τ,M_t-τ→t)f _t-τ→t =W(f _t-τ ,M _t-τ→t )

f_t+τ→t＝W(f_t+τ,M_t+τ→t)f _t+τ→t =W(f _t+τ ,M _t+τ→t )

其中模糊映射网络为DBM，显著性检测网络为CSNet，模糊映射网络用于获得视频帧的模糊程度，显著性检测网络用于消除图像中的背景干扰。The fuzzy mapping network is DBM, the saliency detection network is CSNet, the fuzzy mapping network is used to obtain the blur degree of the video frame, and the saliency detection network is used to eliminate the background interference in the image.

步骤S5以有效帧序列提取模块输出的有效帧序列为输入，基于模糊映射网络、显著性检测网络，分别获得各视频帧所对应的模糊特征、显著性特征，根据模糊特征、显著性特征，并基于softmax分类网络，获得各视频帧的权重系数；以有效帧序列中各视频帧的权重系数为输出，构建权重系数计算模块的具体步骤如下：Step S5 takes the valid frame sequence output by the valid frame sequence extraction module as the input, and based on the fuzzy mapping network and the saliency detection network, respectively obtains the fuzzy features and salient features corresponding to each video frame. Based on the softmax classification network, the weight coefficient of each video frame is obtained; with the weight coefficient of each video frame in the valid frame sequence as the output, the specific steps of constructing the weight coefficient calculation module are as follows:

在一个实施例中，所述检测神经网络为Faster R-CNN。In one embodiment, the detection neural network is Faster R-CNN.

步骤S6以变形特征模块输出的各组视频帧对的变形特征、深度卷积特征图提取模块输出的各视频帧所对应的各深度卷积特征图、以及权重系数计算模块输出的各视频帧的权重系数为输入，获得由视频帧对构成的视频帧组所对应的聚合特征，并通过检测神经网络，以目标人物的预设信息为输出，构建检测网络模块的具体步骤如下：Step S6 uses the deformation features of each group of video frame pairs output by the deformation feature module, each depth convolution feature map corresponding to each video frame output by the depth convolution feature map extraction module, and each video frame output by the weight coefficient calculation module. The weight coefficient is used as the input, and the aggregation features corresponding to the video frame group composed of video frame pairs are obtained, and the preset information of the target person is used as the output through the detection neural network. The specific steps of constructing the detection network module are as follows:

本发明实施例提供的一种基于碎片化视频信息的任务目标检测系统，包括：A task target detection system based on fragmented video information provided by an embodiment of the present invention includes:

一个或多个处理器；one or more processors;

本发明实施例提供的一种存储软件的计算机可读取介质，所述可读取介质包括能通过一个或多个计算机执行的指令，所述指令在被所述一个或多个计算机执行时，执行所述一种基于碎片化视频信息的任务目标检测方法的操作。An embodiment of the present invention provides a computer-readable medium for storing software, where the readable medium includes instructions that can be executed by one or more computers, and when the instructions are executed by the one or more computers, The operations of the method for task target detection based on fragmented video information are performed.

上面结合附图对本发明的实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned embodiments, and can also be made within the scope of knowledge possessed by those of ordinary skill in the art without departing from the purpose of the present invention. Various changes.

Claims

1. a task target detection method based on fragmented video information, is characterized in that, execute step S1-step S7 by preset cycle, obtain target person detection model, then apply target person detection model, complete the detection to target person;

S1. Real-time acquisition of a video containing the walking of the target person, converting the video containing the walking of the target person into a sequence of video frames arranged in time sequence, and extracting continuous video frames of a preset number of frames at a preset position in the video frame sequence as a valid frame sequence, Taking the video frame sequence as the input and the valid frame sequence as the output, constructs the valid frame sequence extraction module;

S2. Taking the valid frame sequence output by the valid frame sequence extraction module as the input, based on the deep convolutional neural network, and using the depth convolution feature map corresponding to each video frame in the valid frame sequence as the output, construct the depth convolution feature map extraction module;

S3. Taking the valid frame sequence output by the valid frame sequence extraction module as input, based on the optical flow neural network, for each group of video frame pairs formed by two video frames separated by a preset number of frames from each other in the valid frame sequence, calculate the The optical flow parameters of the group of video frame pairs are used as the motion information of the target person, and the optical flow parameters of each group of video frame pairs in the valid frame sequence are used as the output to construct an optical flow information module;

S4. Take the depth convolution feature maps output by the depth convolution feature map extraction module and the optical flow parameters of each group of video frame pairs in the valid frame sequence output by the optical flow information module as input, based on the bilinear distortion function, take each The deformation features of the group video frame pairs are output, and the deformation feature module is constructed;

S5. Taking the valid frame sequence output by the valid frame sequence extraction module as the input, based on the fuzzy mapping network and the saliency detection network, respectively obtain the fuzzy features and saliency features corresponding to each video frame in the valid frame sequence. The weight coefficient of each video frame in the valid frame sequence is obtained based on the softmax classification network; the weight coefficient calculation module of each video frame in the valid frame sequence is used as the output to construct a weight coefficient calculation module;

S6. Deformation features of each group of video frame pairs output by the deformation feature module, each depth convolution feature map corresponding to each video frame output by the depth convolution feature map extraction module, and each video frame output by the weight coefficient calculation module. The weight coefficient is used as the input, and the aggregation features corresponding to the video frame group composed of video frame pairs are obtained, and the detection network module is constructed with the preset information of the target person as the output through the detection neural network;

S7. Take the real-time acquisition of the video frame sequence corresponding to the video of the target person walking as input, and take the preset information of the target person as output, based on the effective frame sequence extraction module, the depth convolution feature map extraction module, the optical flow information module, The deformation feature module and the weight coefficient calculation module are used to construct the target person detection model to be trained, and the target person detection model is obtained based on the participating training of the video samples containing the target person walking, and the target person detection is completed.

2. a kind of task target detection method based on fragmented video information according to claim 1, is characterized in that, step S3 takes the effective frame sequence outputted by effective frame sequence extraction module as input, based on optical flow neural network, for by For each group of video frame pairs formed by two video frames separated by a preset number of frames in the valid frame sequence, the optical flow parameters of each group of video frame pairs are calculated as the motion information of the target person, and each group of video frames in the valid frame sequence is used. The correct optical flow parameter is the output, and the specific steps for constructing the optical flow information module are as follows:

S31: Define the t-th video frame It in the valid frame sequence as a reference frame, and the _t -τ-th video frame It _-τ and the t+τ-th video frame It ₊ τ in the valid frame sequence as support frames , input the reference frame It, the support frame It _-τ _, and the support frame It _+τ into the optical flow neural network;

S32: The optical flow neural network includes a convolutional layer, an expansion layer, the reference frame It, the support frame It _-τ _, and the support frame It _+τ pass through the contraction part composed of the convolutional layer of the optical flow neural network to obtain the reference frame It _. , the feature maps corresponding to the support frame It _-τ and the support frame It _+τ respectively;

S33: The feature maps corresponding to the reference frame It, the support frame It _-τ _, and the support frame It _+τ respectively pass through the expansion layer of the optical flow neural network to obtain the reference frame It and the support frame whose size is expanded to the original image size _. The feature maps corresponding to It _-τ and the support frame It _+τ respectively;

S34: Perform optical flow prediction based on the feature maps corresponding to the reference frame It, the support frame It _-τ _, and the support frame It _+τ obtained in step S33, respectively, using the reference frame It, the support frame It _- _τ As a video frame pair, and the reference frame It and the support frame It _+τ _are _used as a video frame _pair , the optical flow parameters M _t- _τ→t and the optical flow parameter M _t+ τ _→t _between the feature maps corresponding to the reference frame It and the support frame It _+τ respectively are as follows:

M _t-τ→t =FlowNet(I _t-τ , I _t )

M _t+τ→t =FlowNet(I _t+τ , I _t )

In the formula, M _t-τ→t is the optical flow parameter between the feature maps corresponding to the reference frame It and the support frame It _-τ respectively, and _t -τ→ _t represents the reference frame It and the support frame _It- The correspondence between _τ , M _t+τ→t is the optical flow parameter between the feature maps corresponding to the reference frame It and the support frame It _+τ respectively, _t +τ→ _t is the reference frame It, the support frame I The corresponding relationship of _t+τ , FlowNet represents the optical flow neural network calculation.

3. a kind of task target detection method based on fragmented video information according to claim 2, is characterized in that, step S4 outputs each depth convolution feature map, optical flow information module output with depth convolution feature map extraction module output The optical flow parameters of each group of video frame pairs in the valid frame sequence of , are input, and based on the bilinear distortion function, the deformation features of each group of video frame pairs are used as the output, and the specific method of constructing the deformation feature module is as follows:

f _t-τ→t =W(f _t-τ , M _t-τ→t )

f _t+τ→t =W(f _t+τ , M _t+τ→t )

In the formula, f _t-τ→t is the deformation feature between the reference frame It and the support frame It _-τ , and f _t _+τ→t _is the deformation feature between the reference frame It and the support frame It _+τ , W represents the calculation based on the bilinear distortion function, f _t-τ is the feature map corresponding to the support frame I _t-τ output by the depthwise convolution feature map extraction module, and f _t+τ is the output of the depthwise convolution feature map extraction module The feature map corresponding to the supporting frame It _-τ of .

4. a kind of task target detection method based on fragmented video information according to claim 3, is characterized in that, step S5 takes the effective frame sequence outputted by effective frame sequence extraction module as input, based on fuzzy mapping network, saliency detection network, respectively obtain the fuzzy features and salient features corresponding to each video frame in the valid frame sequence, and obtain the weight coefficients of each video frame in the valid frame sequence according to the fuzzy features and salient features, and based on the softmax classification network; The weight coefficient of each video frame in the sequence is the output, and the specific steps for constructing the weight coefficient calculation module are as follows:

S51: Input each video frame in the valid frame sequence into a fuzzy mapping network and a saliency detection network, respectively, to obtain fuzzy features and saliency features corresponding to each video frame respectively;

S52: obtain the corrected blur map M _blur-sali corresponding to each video frame by the fuzzy feature and the salient feature corresponding to each video frame obtained by the point multiplication step S51;

S53: Binarize the corrected blur map M _blur-sali corresponding to each video frame based on a step function with a threshold of 0.5, where the step function is as follows:

In the formula, m is the value of the corrected blur map M _blur-sali corresponding to each video frame, and u(m) is the corrected blur map M _blur-sali corresponding to each video frame after binarization processing. value;

S54: For each video frame, add all u(m) obtained in step S53 to obtain the blur degree parameter Vcb of each video frame, and perform normalization processing on the blur degree parameter Vcb of each video frame. The standardization processing method is as follows: :

In the formula, Vcb _i represents the ambiguity parameter of video frame i, VcbNorm _i represents the ambiguity parameter of the normalized video frame i, and the value of i is {t-τ, t, t+τ};

S55: Input the ambiguity parameter _VcbNorm _i of each normalized video frame obtained in step S54 into the softmax classification network, and obtain the weights corresponding to the support frame It _-τ , the reference frame It, and the support frame It _+τ respectively Coefficients ω _t-τ , ω _t , ω _t+τ .

5. a kind of task target detection method based on fragmented video information according to claim 4, is characterized in that, step S6 with the deformation feature of each group of video frame pairs output by deformation feature module, depth convolution feature map extraction module Each depth convolution feature map corresponding to each output video frame and the weight coefficient of each video frame output by the weight coefficient calculation module are input, and the aggregated features corresponding to the video frame group composed of video frame pairs are obtained, and by detecting The neural network takes the preset information of the target person as the output, and the specific steps of constructing the detection network module are as follows:

S61: Based on the deformed features f _t-τ→t and f _t+τ→t _{output by the shape feature module, the feature map f t} _{corresponding} to the reference frame It output by the depth convolution feature map extraction module, and the weight coefficient calculation module The weight coefficients ω _t-τ , ω _t , ω _t+τ corresponding to the output deformation features are obtained according to the following formula to obtain the video composed of the support frame It _-τ , the reference frame It , and the support frame It ₊ _τ The aggregated feature J of the frame group;

J=f _t-τ→t ω _t-τ +f _t ω _t +f _t+τ→t ω _t+τ

S62: Input the aggregated features into the detection neural network to obtain preset information of the target person.

6. A task target detection system based on fragmented video information is characterized in that, comprising:

one or more processors;

a memory storing instructions operable that, when executed by the one or more processors, cause the one or more processors to perform operations to obtain a target person detection model by the following steps, and then apply the target person detection The model completes the detection of the preset target person:

S5. Taking the valid frame sequence output by the valid frame sequence extraction module as the input, based on the fuzzy mapping network and the saliency detection network, obtain the fuzzy features and saliency features corresponding to each video frame in the valid frame sequence, respectively. The weight coefficient of each video frame in the valid frame sequence is obtained based on the softmax classification network; the weight coefficient of each video frame in the valid frame sequence is used as the output, and the weight coefficient calculation module is constructed;

S7. Take the real-time acquisition of the video frame sequence corresponding to the video of the target person walking as input, and take the preset information of the target person as output, based on the effective frame sequence extraction module, the depth convolution feature map extraction module, the optical flow information module, The deformation feature module and the weight coefficient calculation module are used to construct the target person detection model to be trained, and the target person detection model is obtained based on the participation training of the video samples including the target person walking, and the target person detection is completed.

7. A computer-readable medium storing software, wherein the readable medium comprises instructions executable by one or more computers, the instructions, when executed by the one or more computers, The operation of the task target detection method based on fragmented video information according to any one of claims 1 to 5 is performed.