CN113888604A

CN113888604A - Target tracking method based on depth optical flow

Info

Publication number: CN113888604A
Application number: CN202111138039.6A
Authority: CN
Inventors: 张卡; 何佳; 尼秀明
Original assignee: Anhui Qingxin Internet Information Technology Co ltd
Current assignee: Anhui Qingxin Internet Information Technology Co ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-01-04

Abstract

The invention discloses a target tracking method based on a depth optical flow, which belongs to the technical field of target tracking and comprises the steps of S31, selecting an initial tracking image as a previous frame image; s32, forming a moving image pair by the previous frame image and the current frame image, inputting the moving image pair into a depth neural network model, and predicting the positions of all targets and corresponding moving optical flow fields in the current frame image; s33, updating the positions of the targets to be tracked in the previous frame image according to the positions of all the targets in the current frame image and the corresponding motion optical flow field to obtain an updated frame image; and S34, taking the updated frame image as the previous frame image, and repeatedly executing the steps S32-S33 to realize the continuous tracking of the target. The invention realizes the end-to-end target detection and tracking without increasing the calculation cost.

Description

Target tracking method based on depth optical flow

Technical Field

The invention relates to the technical field of target tracking, in particular to a target tracking method based on a depth optical flow.

Background

The target tracking means that the boundary position of an interested target in a current frame image is determined according to the boundary position of the interested target in a previous frame image and the space-time correlation, is a core technology in the field of computer vision, has a very wide application field, and is a necessary technology for many downstream applications, such as action analysis, behavior recognition, monitoring, human-computer interaction and the like.

Currently, target tracking technologies are mainly classified into 2 categories, which are specifically as follows:

(1) the target tracking technology based on the traditional technology mainly comprises Kalman filtering tracking, optical flow tracking, template matching tracking, TLD tracking, CT tracking, KCF tracking and the like. The technology has the advantages of simple principle, higher running speed, good effect in simpler scene and suitability for short-time tracking; the method has the defects that the robustness is poor, the target is easy to be tracked by lost and wrong under a slightly complex scene, and the method cannot adapt to long-time tracking.

(2) The target tracking technology based on the deep learning technology mainly adopts a strategy of target detection and target matching to complete a target tracking process. The method comprises the steps of firstly positioning the target position in each frame of image by means of a strong target detection framework (such as fast-rcnn, ssd and yolo) based on deep learning, then performing association of the same target of the previous frame of image and the next frame of image by means of a nearest neighbor matching algorithm or a feature vector matching algorithm, and further completing the target tracking process. The technology has the advantages of strong robustness and capability of tracking for a long time; the disadvantages are that it is overly dependent on the target detection framework, the target running speed cannot be too fast and the two-step algorithm stacking is time consuming.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and realize end-to-end target detection and tracking without increasing the calculation cost.

To achieve the above object, the present invention adopts a depth optical flow-based target tracking method, comprising:

s31, selecting an initial tracking image as a previous frame image;

s32, forming a moving image pair by the previous frame image and the current frame image, inputting the moving image pair into a depth neural network model, and predicting the positions of all targets and corresponding moving optical flow fields in the current frame image;

s33, updating the positions of the targets to be tracked in the previous frame image according to the positions of all the targets in the current frame image and the corresponding motion optical flow field to obtain an updated frame image;

and S34, taking the updated frame image as the previous frame image, and repeatedly executing the steps S32-S33 to realize the continuous tracking of the target.

Further, the deep neural network model comprises a feature extraction module, a detection module and an optical flow module;

the feature extraction module is used for acquiring a high-level feature map of the moving image pair;

the detection module is used for predicting whether the current frame image has the target and the positions of all the targets according to the high-level feature map;

and the optical flow module is used for predicting the moving optical flow field of the moving image pair based on the high-level feature map.

Further, the feature extraction module includes a splicing layer concat, a backbone network backbone, a feature pyramid network FPN, and output feature layers out _ feature1 and out _ feature2, where the motion image pair is used as an input of the splicing layer concat, the splicing layer concat splices the motion image pair according to a channel dimension and outputs a splicing map, the splicing layer concat outputs are connected with the feature pyramid network FPN through the backbone network backbone, the feature pyramid network FPN outputs a feature map fused with features of different scales, and outputs the feature map through the output feature layers out _ feature1 and out _ feature 2.

Further, the detection module includes convolution layers dconv1_0, dconv2_0, dconv1_1, dconv2_1 and a target information analysis layer yolo, outputs of the output feature layer out _ feature1 and the output feature layer out _ feature2 are respectively connected to convolution layers dconv1_0 and dconv2_0, outputs of the convolution layers dconv1_0 and dconv2_0 are respectively connected to the target information analysis layer yolo via convolution layers dconv1_1 and dconv2_1, and the target information analysis layer yolo is used for extracting effective target position information.

Further, the optical flow detection module outputs a forward optical flow field and a backward optical flow field which are a pair of moving images, and the optical flow detection module comprises a splicing layer concat1, an upsampling layer upsample0, upsample1, upsample2, and convolutional layer lconv0, lconv1, lconv2, and lconv 3; the output feature layers out _ feature1 and out _ feature2 are respectively connected with a splicing layer concat1 and an upsampling layer upsample0, the upsampling layer upsample0 is connected with a splicing layer concat1, and the splicing layer concat1, the upsampling layer upsample1, a convolutional layer lconv1, the upsampling layer upsample2, the convolutional layer lconv2 and the convolutional layer lconv3 are sequentially connected.

Further, the updating the target position to be tracked in the previous frame image according to the positions of all targets in the current frame image and the corresponding moving optical flow field to obtain an updated frame image includes:

acquiring the rough position of the target to be tracked in the current frame image according to the position of the target to be tracked in the previous frame image and the motion optical flow field;

performing correlation matching according to the rough position of the target to be tracked in the current frame image and the positions of all targets in the current frame image to obtain the accurate position of the target to be tracked in the current frame image;

and generating the updated frame image according to the accurate position of the target to be tracked in the current frame image.

Further, the obtaining a rough position of the target to be tracked in the current frame image according to the position of the target to be tracked in the previous frame image and the moving optical flow field includes:

reducing the position of the target detected in the current frame image, and intercepting corresponding optical flow areas in all motion flow optical fields;

comparing the motion displacement and direction of the same-bit sub-pixel in the forward motion optical flow field and the backward motion optical flow field of the intercepted corresponding optical flow area to determine a correctly tracked pixel;

obtaining the motion displacement of the target to be tracked by a statistical method according to the correctly tracked pixels;

and accumulating the position of the target to be tracked in the previous frame image with the motion displacement of the target to be tracked to obtain the rough position of the target to be tracked in the current frame image.

Further, before the step of forming a moving image pair from the previous frame image and the current frame image into the deep neural network model and predicting the positions of all targets and the corresponding moving optical flow fields in the current frame image, the method further includes:

collecting pedestrian videos;

marking pedestrian movement position information on each frame image in the pedestrian video, and constructing to obtain a movement image pair set;

and training the deep neural model by utilizing the motion image set to learn model parameters.

Further, the step of marking the pedestrian movement position information on each frame image in the pedestrian video to construct a moving image pair set includes:

acquiring and marking a target position in each frame of image in the pedestrian video;

randomly selecting a frame image containing a target as a previous frame image, randomly selecting an image from image frames behind the previous frame image as a current frame image, and forming a moving image pair with the previous frame image;

and downsampling each generated motion image pair, acquiring a forward motion optical flow field and a backward motion optical flow field of the image pair based on an optical flow field generation tool, labeling, and constructing to obtain the motion image pair set.

Further, the loss function L adopted during the deep neural network model training is as follows:

L＝αL_loc+βL_offset

wherein L is_locRepresenting the loss function of detection, L_offsetThe loss function of the optical flow module is shown, and α and β represent weighting coefficients.

Compared with the prior art, the invention has the following technical effects: according to the invention, by means of a deep neural network model, a target matching strategy is integrated into a target detection framework based on deep learning, under the condition of hardly increasing any calculation cost, end-to-end target detection tracking can be realized, the universality is strong, the real-time performance is high, the error sources are fewer, the tracking can be carried out for a long time, and the tracking effect is robust.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a block diagram of a deep neural network model population;

FIG. 2 is a network architecture diagram of a feature extraction module;

FIG. 3 is a network architecture diagram of a detection tracking module;

FIG. 4 is a network architecture diagram of an optical flow tracking module;

fig. 5 is a target tracking flow chart.

Wherein, the mark on the left side of each neural network structural layer graph represents the size of an output characteristic diagram of the network structure: the feature map width x the feature map height x the number of feature map channels.

Detailed Description

To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.

The embodiment is suitable for all multi-target tracking scenes, and for convenience of description, the multi-target tracking method takes the multi-target tracking of pedestrians as an example for description. The embodiment discloses a target tracking method based on a depth optical flow, which comprises the following steps:

s1, designing a deep neural network model:

the deep neural network model designed by the invention has the main function of directly completing the detection and tracking of the pedestrian target in each frame of image by means of the deep neural network model with a fusion mechanism, and the whole pedestrian tracking system has higher operation speed, fewer error sources and more robust tracking effect because the steps of pedestrian detection and positioning, pedestrian correlation matching and the like are not distinguished intentionally. The present invention uses a Convolutional Neural Network (CNN), which defines some terms for convenience of describing the present invention: feature resolution refers to feature height x feature width, feature size refers to feature width x feature height x number of feature channels, kernel size refers to kernel width x kernel height, and span refers to width span x height span, and each convolutional layer is followed by a bulk normalization layer and a nonlinear activation layer.

As shown in FIG. 1, the invention performs optimization and improvement based on a single-stage object detection framework yolov4-tiny with superior performance, and the designed deep neural network model comprises four modules: the device comprises a feature extraction module, a detection module, a detect module, an optical flow module and an update module, wherein the update module does not participate in training and only plays a role in testing. The specific design steps are as follows:

s11, designing a feature extraction module:

the feature extraction module is mainly used for obtaining high-level features with high abstraction and rich expression capability of the input motion image pair, and the quality of the high-level feature extraction directly influences the performance of follow-up pedestrian target tracking. The present invention uses the same feature extraction module as yolov4-tiny, as shown in fig. 2, the input of the feature extraction network is a moving image pair, the moving image pair is composed of 23 channel RGB images with the image resolution of 320 × 320, wherein one image is the current frame image and the other is the previous frame image. concat is a splicing layer and mainly functions to splice 2 input 3-channel RGB images into a 6-channel image with the same resolution according to channel dimensions. The backbone network of yolov4-tiny is the backbone network of the back bone, the FPN is the characteristic pyramid network, mainly used for fusing the characteristics of different scales, and the concrete network structure is the same as yolov 4-tiny. The out _ feature1, out _ feature2 are output feature layers of the feature extraction module for subsequent detection and tracking of pedestrian objects, where the feature map resolution of out _ feature1 is 20x20x384 and the feature map resolution of out _ feature2 is 10x10x 256.

S12, designing a detection module:

the detection module is mainly used for predicting whether a pedestrian target exists in the current frame image and the position of the pedestrian target on the basis of the feature map output by the feature extraction module. The invention is improved on the basis of a yolov4-tiny detection module, and a specific network structure is shown in FIG. 3, wherein dconv1_0 and dconv2_0 are convolutional layers with the core size of 3x3 and the span of 1x1, and dconv1_1 and dconv2_1 are convolutional layers with the core size of 1x1 and the span of 1x 1. The yolo layer is a pedestrian object information analysis layer for extracting effective pedestrian object information, and functions only at the time of test, and is the same as that in yolov4-tiny, and the feature map resolution of the yolo layer is Nx5, where N represents the number of detected pedestrian objects.

S13, designing an optical flow module:

the optical flow module is mainly used for predicting the motion displacement and direction of each pixel between the front frame image and the rear frame image in the input image pair on the basis of the feature map output by the feature extraction module, namely predicting the motion optical flow field of the input image pair. The invention adopts a front-back bidirectional optical flow field design, the optical flow prediction is more accurate, the specific network structure is shown in figure 4, concat1 is a splicing layer, and the main function is to splice a plurality of input feature maps into an output feature map according to the channel dimension. The upsamplle 0, the upsamplle 1 and the upsamplle 2 are 2 times of upsampling layers, and the specific principle is the same as that in the yolov4-tiny structure. lconv0, lconv1, and lconv2 are convolutional layers with a core size of 3x3 and a span of 1x 1. lconv3 is a convolutional layer with a kernel size of 1x1 and a span of 1x1, and its output feature map represents the corresponding motion optical flow field of the input image pair, wherein the 1 st output feature map represents the motion optical flow field in the x-coordinate direction from the previous frame image to the current frame image in the input image pair, the 2 nd output feature map represents the motion optical flow field in the y-coordinate direction from the previous frame image to the current frame image in the input image pair, the 3 rd output feature map represents the motion optical flow field in the x-coordinate direction from the current frame image to the previous frame image in the input image pair, the 4 th output feature map represents the motion optical flow field in the y-coordinate direction from the current frame image to the previous frame image in the input image pair, the first 2 output feature maps constitute a forward motion optical flow field, and the last 2 output feature maps constitute a backward motion optical flow field.

S13, designing an update module:

the updating module is mainly used for calculating the accurate position of the pedestrian target to be tracked in the previous frame image in the current frame image according to the output information of the detection module and the optical flow module so as to update the previous frame image information, and the specific steps are as follows:

s131, acquiring a rough new position of the pedestrian target to be tracked, and accumulating corresponding motion offset acquired by the upper optical flow module according to the position of the pedestrian target to be tracked in the previous frame image to acquire the rough position of the pedestrian target to be tracked in the current frame image. The corresponding motion offset is obtained according to the output information of the optical flow module, and the specific method is as follows:

s1311, obtaining a moving optical flow field of the pedestrian target to be tracked, wherein the main method is to reduce the position of the pedestrian target detected in the current frame image by 4 times, and then intercepting corresponding optical flow areas in all output moving optical flow fields of the optical flow module.

And S1312, acquiring a correct motion pixel, comparing the motion displacement and the direction of the pixel at the same position in the forward motion optical flow field and the backward motion optical flow field, and when the difference between the motion displacement and the motion direction is small, determining that the pixel is a correctly tracked pixel.

And S1313, acquiring the running displacement of the pedestrian target, wherein the main method is to acquire the accurate running pixel according to the step S1312, and acquire the running displacement of the pedestrian target by a statistical method.

And S132, acquiring an accurate new position of the pedestrian target to be tracked, wherein the main method is to perform correlation matching on the rough new position of the pedestrian target to be tracked acquired in the step S131 and each pedestrian target position acquired by the detection module, and select the best matched pedestrian target position as the accurate new position of the pedestrian target to be tracked in the current frame image. The IOU (interaction over Union) matching degree adopted by the invention is used as an association matching function, and any other similarity measurement function can be used as the association matching function. And when the matching degree of the best matched pedestrian target is greater than a certain threshold value, recognizing that the pedestrian target in the current frame image has a corresponding history record in the previous frame image, namely that the pedestrian target in the current frame image is a trackable pedestrian target. And when the matching degree of the best matched pedestrian target is lower than a certain threshold value, determining that the pedestrian target in the current frame image has no corresponding history record in the previous frame image, namely that the pedestrian target in the current frame image is a newly appeared pedestrian target.

And S133, updating the pedestrian target to be tracked, and mainly generating a new previous frame image and a new existing pedestrian target in the previous frame according to the position of the pedestrian target in the current frame image. Firstly, changing the current frame image into the previous frame image, and then updating the position information of the known pedestrian target in the previous frame image according to the trackable pedestrian target and the newly appeared pedestrian target acquired in the step S132; when a certain pedestrian target in the previous frame image is not associated with the pedestrian target in the current frame image in step S132, the pedestrian target is considered to disappear in the video frame, and the corresponding tracking record should be deleted.

S2, training a deep neural network model:

after the deep neural network model is designed, pedestrian video images under various scenes are collected and sent into the deep neural network model to learn relevant model parameters, and the method specifically comprises the following steps:

and S21, collecting pedestrian videos, wherein the pedestrian videos are mainly collected under various scenes, various light rays and various angles.

S22, marking pedestrian movement position information, mainly marking the pedestrian position information in each frame of image in the video and the movement information between different frames of movement image pairs, and specifically comprising the following steps:

s221, marking pedestrian target position information, wherein the pedestrian position in each frame of image in the video is acquired as the pedestrian position information by using the existing pedestrian detection frame based on deep learning.

S222, establishing a moving image pair, mainly changing the video into an image sequence, randomly selecting one frame of image containing the pedestrian object as a previous frame of image, then randomly selecting one image as a current frame of image in the next 120 frames, and forming the moving image pair together with the previous frame of image.

S223, obtaining the motion information of the pedestrian target, wherein the main method comprises the steps of firstly carrying out 4-time down-sampling on each motion image pair, and then obtaining a forward motion optical flow field and a backward motion optical flow field of the image pair based on the existing optical flow field generating tool. Each moving optical flow field is represented by 2 gray scale graphs with the same image resolution, and the moving optical flow field in the x coordinate direction and the moving optical flow field in the y coordinate direction of the moving optical flow field are respectively represented.

And S23, training the deep neural network model, sending the sorted moving image pair set into the well-defined deep neural network model, and learning related model parameters. The loss function L during network model training is shown by the following formula:

L＝αL_loc+βL_offset

wherein L is_locRepresents a loss function detected, the significance of which remains the same as yolov4-tiny, L_offsetAnd the loss function of the optical flow module is represented by mean square error loss functions alpha and beta to represent weighting coefficients.

S3, using the deep neural network model, after training the deep neural network model, performing pedestrian tracking in an actual environment, forming a moving image pair for any given frame of pedestrian images, sending the moving image pair to the trained deep neural network model, and directly outputting a new position of the pedestrian target in the current frame of image, so as to perform the continuous tracking of the pedestrian target, as shown in fig. 5, the specific steps are as follows:

s31, selecting an initial tracking image, wherein one frame of pedestrian image is selected as a previous frame of image;

s32, predicting the pedestrian position motion information of the current frame image, wherein the main method comprises the steps of forming an image pair by the previous frame image and the current frame image, sending the image pair into a depth neural network model, and directly predicting all the pedestrian target positions in the current frame image and the forward and backward motion optical flow fields of the image pair;

s33, updating the position of the pedestrian target to be tracked, and acquiring a new previous frame image and a new existing pedestrian target by means of an update module according to the position of the pedestrian target predicted in the step S32 and a corresponding motion optical flow field;

and S34, continuously tracking, and repeating the steps S32-S33 to realize the continuous tracking of the pedestrian target.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A target tracking method based on a depth optical flow is characterized by comprising the following steps:

s31, selecting an initial tracking image as a previous frame image;

2. The depth optical flow-based target tracking method according to claim 1, wherein the depth neural network model includes a feature extraction module, a detection module, and an optical flow module;

3. The method for tracking an object based on a deep optical flow as claimed in claim 2, wherein the feature extraction module includes a splicing layer concat, a backbone network backbone, a feature pyramid network FPN, and output feature layers out _ feature1 and out _ feature2, the motion image pair is used as an input of the splicing layer concat, the splicing layer concat splices the motion image pair according to channel dimensions and outputs a splicing map, the splicing layer concat output is connected with the feature pyramid network FPN via the backbone network backbone, the feature pyramid network FPN outputs a feature map fused with features of different scales and outputs the feature map via the output feature layers out _ feature1 and out _ feature 2.

4. The method for tracking an object based on a depth optical flow as claimed in claim 3, wherein the detection module comprises convolutional layers dconv1_0, dconv2_0, dconv1_1, dconv2_1 and an object information resolving layer yolo, the outputs of the output feature layer out _ feature1 and the output feature layer out _ feature2 are respectively connected with convolutional layers dconv1_0 and dconv2_0, the convolutional layers dconv1_0 and dconv2_0 are respectively connected with the object information resolving layer yolo through convolutional layers dconv1_1 and dconv2_1, and the object information resolving layer yolo is used for extracting effective object position information.

5. The depth-based optical flow target tracking method according to claim 3, wherein the optical flow detection module outputs a forward optical flow field and a backward optical flow field as a pair of moving images, the optical flow detection module includes a stitching layer concat1, an upsampling layer upsample0, upsample1, upsample2, and a convolution layer lconv0, lconv1, lconv2, lconv 3; the output feature layers out _ feature1 and out _ feature2 are respectively connected with a splicing layer concat1 and an upsampling layer upsample0, the upsampling layer upsample0 is connected with a splicing layer concat1, and the splicing layer concat1, the upsampling layer upsample1, a convolutional layer lconv1, the upsampling layer upsample2, the convolutional layer lconv2 and the convolutional layer lconv3 are sequentially connected.

6. The method for tracking an object based on a deep optical flow as claimed in claim 1, wherein the step of updating the position of the object to be tracked in the previous frame image according to the positions of all objects in the current frame image and the corresponding moving optical flow field to obtain an updated frame image comprises:

7. The method for tracking an object based on a depth optical flow as claimed in claim 6, wherein said obtaining the rough position of the object to be tracked in the current frame image according to the position of the object to be tracked in the previous frame image and the moving optical flow field comprises:

8. The method for tracking an object based on a depth optical flow as claimed in claim 6, wherein before the step of forming a moving image pair from the previous frame image and the current frame image into the depth neural network model and predicting the positions of all objects in the current frame image and the corresponding moving optical flow field, the method further comprises:

collecting pedestrian videos;

9. The method for tracking an object based on a depth optical flow as claimed in claim 8, wherein the step of labeling the pedestrian motion position information for each frame image in the pedestrian video to construct a set of moving image pairs comprises:

10. The method for tracking an object based on a deep optical flow as claimed in claim 8, wherein the loss function L adopted in the deep neural network model training is:

L＝αL_loc+βL_offset