CN113362371A

CN113362371A - Target tracking method and device, electronic equipment and storage medium

Info

Publication number: CN113362371A
Application number: CN202110543857.8A
Authority: CN
Inventors: 何琦; 鲍一平
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-09-07

Abstract

The application provides a target tracking method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an image frame sequence; obtaining a difference image sequence by calculating the interframe difference value of the image frame sequence; taking the image frame sequence and the differential image sequence as the input of a trained target detection network to obtain a detection result sequence output by the target detection network; and tracking the target according to the detection result sequence. The scheme can improve the accuracy of target tracking.

Description

Target tracking method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target tracking method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Tracking of moving objects is generally divided into two main parts: object detection and object tracking.

The object detection means inputting a picture or a picture with depth and outputting all object positions in the picture. Most of the existing methods for detecting objects are based on convolutional neural networks, and the commonly used neural network structures include DenseBox, RetinaNet and the like. These networks directly input the entire picture, output the object rectangle box and confidence score. The object detection can be completed through simple post-treatment (such as Non-maximum inhibition: Non-maximum inhibition).

The object tracking means that object detection results of a plurality of continuous frames are input, the motion tracks of the objects are output, and False positives (False positives) in the detection results are suppressed as much as possible. According to the principle that the distance between adjacent frames is not too far, the traditional target tracking algorithm connects the detection results of the adjacent frames through an intersection over unit (Intersect over unit) and carries out motion estimation. After the motion trajectory is generated, the too short part is considered as a false positive and removed. And inputting the images deducted in the detection stage by a target tracking algorithm based on a neural network, outputting the characteristics of the images, and connecting the same objects of multiple frames by comparing the consistency of the characteristics to obtain a motion track.

The object detection based on the neural network does not utilize the property of a 'moving object', so that all objects in a visual field need to be detected, a large amount of training data is required, and the probability of false positive is high. The moving object tracking related method has limited utilization on original image information, most of the moving object tracking related methods only perform tracking based on detection results, and the accuracy is low.

Disclosure of Invention

The embodiment of the application provides a target tracking method, which is used for improving the tracking accuracy.

The embodiment of the application provides a target tracking method, which comprises the following steps:

acquiring an image frame sequence;

obtaining a difference image sequence by calculating the interframe difference value of the image frame sequence;

taking the image frame sequence and the differential image sequence as the input of a trained target detection network to obtain a detection result sequence output by the target detection network;

and tracking the target according to the detection result sequence.

In an embodiment, the performing target tracking according to the detection result sequence includes:

removing the detection result which does not belong to the target object according to the image frame sequence, the differential image sequence and the detection result sequence to obtain a target position sequence;

and obtaining a target motion track according to the target position sequence.

In an embodiment, the detecting result sequence includes a moving object position sequence and a local image sequence, and the removing, according to the image frame sequence, the difference image sequence and the detecting result sequence, the detecting result not belonging to the target object to obtain the target position sequence includes:

taking the local image sequence as the input of a trained feature extraction network to obtain an image feature sequence corresponding to the local image sequence output by the feature extraction network;

and removing the position of the moving object which does not belong to the target object according to the image frame sequence, the difference image sequence, the position sequence of the moving object and the image characteristic sequence to obtain a target position sequence.

In an embodiment, the removing, according to the image frame sequence, the differential image sequence, the moving object position sequence, and the image feature sequence, the moving object position that does not belong to the target object to obtain the target position sequence includes:

taking the image frame sequence and the differential image sequence as the input of an optical flow calculation network to obtain the optical flow information offset between adjacent frames output by the optical flow calculation network;

superposing the optical flow information offset between the adjacent frames to the target object position of the previous frame to obtain the target object prediction position of the next frame;

and according to the moving object position sequence and the image feature sequence, removing the moving object position of which the difference between the image feature and the moving object position in the next frame and the image feature and the target object prediction position of the target object is larger than a first threshold value to obtain the target object position of the next frame, wherein the target object position of each frame forms the target position sequence.

and taking the position and the movement speed of the target object of the previous frame as the input of a Kalman filtering model, and obtaining the predicted position of the target object of the next frame output by the Kalman filtering model.

Calculating the intersection ratio between the target object prediction position of the next frame and the moving object position of the next frame image, and removing the moving object position of which the intersection ratio is smaller than a second threshold value in the next frame;

and according to the moving object position sequence and the image feature sequence, removing the moving object position of which the difference between the image feature and the moving object position in the next frame and the image feature and the target object prediction position of the target object is larger than a third threshold value to obtain the target object position of the next frame, wherein the target object position of each frame forms the target position sequence.

In an embodiment, the method further comprises:

calculating to obtain the target average depth according to the target depth information of each frame of image in the image frame sequence;

and removing the target object position of which the difference between the target depth information and the target average depth exceeds a fourth threshold value.

In an embodiment, the method further comprises:

generating continuous multi-frame target distribution heat maps according to the differential image sequence;

calculating a weighted average value of the target object positions in a target distribution heat map corresponding to the appointed frame image according to the target object positions of the appointed frame image;

carrying out weighted average on the weighted average value of the target object positions in the target distribution heat map corresponding to each frame image to obtain a heat response average value;

and removing the target object position with the average value of the thermal response smaller than a fifth threshold value.

In an embodiment, the obtaining a difference image sequence by calculating an inter-frame difference value of the image frame sequence includes:

aiming at each two adjacent frames of images in the image frame sequence, calculating the pixel difference value of each channel of each pixel point between the two adjacent frames of images to obtain the differential image information corresponding to each two adjacent frames of images;

and arranging the difference image information corresponding to each two adjacent frames of images in sequence to obtain the difference image sequence.

calculating a first mean image of a plurality of previous frame images and a second mean image of a plurality of next frame images by taking a designated frame image in the image frame sequence as a reference;

calculating a pixel difference value between the first mean value image and the second mean value image to obtain difference image information corresponding to the appointed frame image;

and sequencing the differential image information corresponding to the appointed frame image in sequence to obtain the differential image sequence.

In an embodiment, the taking the image frame sequence and the differential image sequence as inputs of a trained target detection network to obtain a detection result sequence output by the target detection network includes:

fusing the image frame sequence and the differential image sequence to obtain a fused image sequence;

and taking the fused image sequence as the input of the trained target detection network to obtain the detection result sequence of the fused image sequence output by the target detection network.

In an embodiment, the image frame sequence and the difference image sequence are fused to obtain a fused image sequence, and any one of the following manners is adopted:

in the first mode, the image frame sequence and the differential image sequence are spliced on a channel to obtain the fusion image sequence;

in a second mode, multiplying the difference image sequence and the image frame sequence pixel by pixel to obtain an intermediate image sequence;

splicing the difference image sequence, the image frame sequence and the intermediate image sequence on a channel to obtain a fusion image sequence;

performing pooling operation on the difference image sequence to obtain a reduced image sequence;

amplifying the reduced image sequence to the same size as the image frame sequence to obtain an image sequence with the same scale;

and carrying out pixel-by-pixel addition operation on the same-scale image sequence and the image frame sequence to obtain the fusion image sequence.

An embodiment of the present application further provides a target tracking apparatus, including:

the image acquisition module is used for acquiring an image frame sequence;

the difference calculation module is used for calculating the interframe difference value of the image frame sequence to obtain a difference image sequence;

the target detection module is used for taking the image frame sequence and the differential image sequence as the input of a trained target detection network to obtain a detection result sequence output by the target detection network;

and the target tracking module is used for tracking the target according to the detection result sequence.

An embodiment of the present application further provides an electronic device, where the electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the above-described target tracking method.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program is executable by a processor to implement the above-mentioned target tracking method.

According to the technical scheme provided by the embodiment of the application, the difference image sequence is obtained by calculating the interframe difference value of the image frame sequence; the method can filter out static objects, all detected objects are in a motion state and are suitable for follow-up moving object tracking, the difference image sequence and the original image frame sequence are input into the target detection model together, the frame-level detection result sequence can be obtained, the accuracy of moving object detection is improved, further the target tracking is carried out based on the detection result sequence, the false detection rate can be inhibited, and the accuracy of target tracking is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a schematic flowchart of a target tracking method provided in an embodiment of the present application;

FIG. 3 is a detailed flowchart of step S220 in the corresponding embodiment of FIG. 2;

FIG. 4 is a detailed flowchart of step S230 in the corresponding embodiment of FIG. 2;

FIG. 5 is a detailed flowchart of step S240 in the corresponding embodiment of FIG. 2;

FIG. 6 is a detailed flowchart of step S241 in the corresponding embodiment of FIG. 5;

FIG. 7 is a flow chart diagram of a target tracking method based on an optical flow computing network;

FIG. 8 is a schematic flow chart of a target tracking method based on a Kalman filtering model;

FIG. 9 is a flowchart of a target tracking method provided by an embodiment of the present application based on the corresponding embodiment of FIG. 7 or FIG. 8;

FIG. 10 is a schematic block diagram of a target tracking method according to an embodiment of the present application;

fig. 11 is a block diagram of a target tracking device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. The electronic device 100 may be configured to perform the target tracking method provided by the embodiment of the present application. As shown in fig. 1, the electronic device 100 includes: one or more processors 102, and one or more memories 104 storing processor-executable instructions. Wherein the processor 102 is configured to execute a target tracking method provided by the following embodiments of the present application.

The processor 102 may be a gateway, or may be an intelligent terminal, or may be a device including a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other form of processing unit having data processing capability and/or instruction execution capability, and may process data of other components in the electronic device 100, and may control other components in the electronic device 100 to perform desired functions.

The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 102 to implement the target tracking method described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

In one embodiment, the electronic device 100 shown in FIG. 1 may also include an input device 106, an output device 108, and a data acquisition device 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device 100 may have other components and structures as desired.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The data acquisition device 110 may acquire an image of a subject and store the acquired image in the memory 104 for use by other components. Illustratively, the data acquisition device 110 may be a camera.

In one embodiment, the components of the exemplary electronic device 100 for implementing the object tracking method of the embodiment of the present application may be integrally disposed or disposed in a decentralized manner, such as integrally disposing the processor 102, the memory 104, the input device 106 and the output device 108, and disposing the data acquisition device 110 separately.

In an embodiment, the example electronic device 100 for implementing the target tracking method of the embodiment of the present application may be implemented as a smart terminal, such as a smart phone, a tablet computer, a smart watch, an in-vehicle device, and the like.

Fig. 2 is a schematic flowchart of a target tracking method according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps S210 to S250.

Step S210: a sequence of image frames is acquired.

The image frame sequence refers to a plurality of frames of images which are continuously shot, and the images can be RGB (red, green and blue) format images or RGBD (red, green and blue + depth) format images. For example, the sequence of image frames may be a sequence of frames of images of a video. Each frame image may include one or more moving objects, which in the embodiment of the present application may be moving people, animals, vehicles, or other objects.

Step S220: and calculating the interframe difference value of the image frame sequence to obtain a difference image sequence.

The inter-frame difference value refers to a pixel difference value between each channel (for example, red, green and blue depth) of each pixel point between the image frames, and absolute values of the inter-frame difference values of all the pixel points form difference image information. Since the image frame sequence includes a plurality of image frames, the difference image sequence may include a plurality of difference image information. In one embodiment, the sequence of image frames may be input into a differential image model that outputs a sequence of differential images that includes information for one or more of the differential images, using the differential image model.

In an embodiment, a pixel difference value of each channel of each pixel point between two adjacent frames of images can be calculated for each two adjacent frames of images in an image frame sequence, so as to obtain difference image information corresponding to each two adjacent frames of images; and then arranging the difference image information corresponding to each two adjacent frames of images in sequence to obtain the difference image sequence.

It should be noted that, at the pixel position of the stationary object, the pixel difference is 0. And the pixel difference value of the pixel point position of the moving object is not 0, so the pixel difference value of each channel of each pixel point between frames is calculated, and the capability of detecting the position of the moving object in the input image is realized.

Each two adjacent frames of images may include a first frame and a second frame, a second frame and a third frame, a third frame and a fourth frame, and so on. The differential image information refers to absolute values of pixel differences of all channels of all pixel points between two adjacent frames of images, and the differential image information and the image frames have the same size and the same channel number. For example, the absolute value of the pixel difference of each channel (R, G, B, D four channels) of each pixel point between the first frame and the second frame is calculated to obtain difference image information 1, the absolute value of the pixel difference of each channel of each pixel point between the second frame and the third frame is calculated to obtain difference image information 2, and so on, all the difference image information 1 and the difference image information 2 … … are sequentially arranged to form a difference image sequence. According to the requirement, before the pixel difference value between each channel of each pixel point is calculated, Gaussian blur can be performed on each frame of image in advance, and then the pixel difference value of each channel of each pixel point between every two adjacent frames of image is calculated. The processing can reduce the interference caused by the pixel value change caused by weather, external illumination and the like, and improve the accuracy of the positioning of the moving object.

In another embodiment, as shown in fig. 3, the sequence of difference images can also be obtained by the following steps.

Step S221: and calculating a first mean image of a plurality of previous frame images and a second mean image of a plurality of next frame images by taking the appointed frame images in the image frame sequence as a reference.

Wherein, assuming that the first several frames are the first three frames, the specified frame image may be each frame image starting from the third frame. Assuming that the first several frame images are the first five frames, the specified frame image may be each frame image starting from the fifth frame, and so on. The first several frame images refer to several frame images before the designated frame image. The latter several frame images refer to several frame images after the designated frame image.

Assuming that each frame image in the image frame sequence is numbered in sequence, assuming that the designated frame image is the a-th frame, for example, the first frames of images may be the images of the a-1-a frames, and the last frames of images may be the images of the a-4-a frames. For example, the first frames of images may be images of frames a-5 to a-1, and the second frames of images may be images of frames a +1 to a + 5.

The first mean image is an image obtained by calculating the mean value of pixels of a plurality of previous frames of images of each pixel point sequentially and taking the mean value as the pixel value of the pixel point.

The second mean image is an image obtained by calculating the pixel mean of the pixel in the next frames of images for each pixel in sequence and taking the mean as the pixel value of the pixel.

Step S222: and calculating the pixel difference value between the first mean value image and the second mean value image to obtain the difference image information corresponding to the appointed frame image.

Specifically, the pixel difference value of each pixel point between the first mean image and the second mean image may be calculated sequentially for each pixel point, and the absolute values of the pixel difference values of all the pixel points constitute difference image information. Assuming that a certain difference image information is based on the pixel difference value of the first mean image of the previous frames of images and the second mean image of the next frames of images based on the a-th frame, the difference image information can be regarded as the difference image information corresponding to the a-th frame.

Step S223: and sequencing the differential image information corresponding to the appointed frame image in sequence to obtain the differential image sequence.

For example, starting from frame 5, a first mean image of frames 1-5 and a second mean image of frames 6-10 are calculated, and an absolute value of a pixel difference between the first mean image and the second mean image is calculated to obtain difference image information 1; aiming at the 6 th frame, calculating a first mean image of the first 2-6 frames and a second mean image of the last 7-11 frames, and calculating the absolute value of the pixel difference between the first mean image and the second mean image to obtain difference image information 2; and aiming at the 7 th frame, calculating a first mean image of the first 3-7 frames and a second mean image of the last 8-12 frames, calculating the absolute value of the pixel difference between the first mean image and the second mean image to obtain difference image information 3, and so on, arranging all the difference image information in sequence to obtain a difference image sequence.

Step S230: and taking the image frame sequence and the differential image sequence as the input of the trained target detection network to obtain a detection result sequence output by the target detection network.

The target detection network can be obtained by utilizing image training of a large number of known moving object positions. More than one moving object may be included in the images, so the sequence of detection results may include the position of one or more moving objects in each frame of image.

In an embodiment, as shown in fig. 4, the step S230 specifically includes steps S231 to S232.

Step S231: and fusing the image frame sequence and the differential image sequence to obtain a fused image sequence.

The fused image sequence comprises a plurality of continuous fused images, and the fused image is an image obtained by fusing image frames in the image frame sequence and corresponding differential image information.

In an embodiment, assuming that the differential image information is the differential image information between the a-th frame and the a + 1-th frame, the a-th frame and the differential image information may be fused to obtain a fused image corresponding to the a-th frame.

In another embodiment, assuming that the differential image information is the differential image information between the a-1 th frame and the a-th frame, the a-th frame and the differential image information may be fused to obtain a fused image corresponding to the a-th frame.

In other embodiments, assuming that the difference image information is a pixel difference value between a first mean image of a plurality of frames before the a-th frame and a second mean image of a plurality of frames after the a-th frame, the a-th frame and the difference image information may be fused to obtain a fused image of the a-th frame. Therefore, the fused image corresponding to each frame of image in the image sequence forms a fused image sequence.

The fusion mode may be one of the following three modes.

The first method is as follows: and splicing the image frame sequence and the differential image sequence on a channel to obtain the fusion image sequence.

For example, the difference image information corresponding to the designated frame image and the designated frame image may be spliced on the channel to obtain a fused image corresponding to the designated frame image. And the fused image corresponding to each frame of image forms a fused image sequence.

Assuming that the designated frame image is four channels, the difference image information is also four channels because the difference value of each pixel point is directly calculated, and splicing on the channels means that eight channels are formed by splicing, namely, each pixel point in the fusion image contains R, G, B, D values and also contains R difference values, G difference values, B difference values and D difference values.

The second method comprises the following steps: multiplying the difference image sequence and the image frame sequence pixel by pixel to obtain an intermediate image sequence; splicing the difference image sequence, the image frame sequence and the intermediate image sequence on a channel to obtain a fusion image sequence;

for example, the difference image information corresponding to the specified frame image and the specified frame image may be multiplied by each pixel to obtain an intermediate image corresponding to the specified frame image; and splicing the appointed frame image, the difference image information corresponding to the appointed frame image and the intermediate image on a channel to obtain a fused image corresponding to the appointed frame image, so that the fused image corresponding to each frame image forms a fused image sequence.

Since the inter-frame difference may have a negative number, the absolute value of the inter-frame difference may be taken to obtain the difference image information, and then the difference image information is multiplied by the specified frame image pixel by pixel, so that the obtained image may be referred to as an intermediate image. Assuming that the designated frame image is four channels, the difference image information is also four channels, and the intermediate image is also four channels. Therefore, the specified frame image, the differential image information and the intermediate image can be spliced on the channels to obtain a fused image with 12 channels of original 4 channels, differential 4 channels and dot product 4 channels of each pixel point. In this way, the area with higher differential response can be extracted from the designated frame image, so that the target detection network can better detect the target position.

The third method comprises the following steps: performing pooling operation on the difference image sequence to obtain a reduced image sequence; amplifying the reduced image sequence to the same size as the image frame sequence to obtain an image sequence with the same scale; and carrying out pixel-by-pixel addition operation on the same-scale image sequence and the image frame sequence to obtain the fusion image sequence.

For example, the difference image information corresponding to the designated frame image is pooled to obtain a reduced-size image. The reduced image sequence includes reduced-size images of difference image information for each frame image. Enlarging the size-reduced image to the same size as the specified frame image to obtain an image with the same size; the same-scale image sequence comprises the same-scale image corresponding to each scale-reduced image; and performing pixel-by-pixel addition operation on the same-scale image and the appointed frame image to obtain a fused image corresponding to the appointed frame image. And the fused image corresponding to each frame of image forms a fused image sequence.

Among them, the Pooling (Pooling) operation is an operation of reducing the size of an image without changing the number of image channels. Taking a commonly used Max-Pooling as an example, assuming that the parameter is 2 × 2, and a 1920 × 1080 × 4 picture is input, the output is 960 × 540 × 4, which is simply: each channel calculates that each 2 x2 pixel area takes a maximum value (Max) as a result of which the image is scaled down. Therefore, the image obtained by pooling the difference image information may be referred to as a reduced-size image.

In order to overlay the designated frame image, the size-reduced image is subjected to a subsequent enlarging operation, enlarged to the same size as the designated frame image, and for distinction, the image is referred to as a co-scale image. Then, the pixel value of the pixel point in the image with the same scale and the pixel value of the pixel point in the image of the designated frame can be added in sequence aiming at each pixel point to obtain the pixel value of the pixel point in the fused image. Therefore, the fused image can still be 4 channels, and moving object information can be added into the specified frame image under the condition of not increasing the calculation amount, so that the target detection network is more prone to moving object detection.

Step S232: and taking the fused image sequence as the input of the trained target detection network to obtain the detection result sequence of the fused image sequence output by the target detection network.

The target detection network can be a pre-trained convolutional neural network model of a RetinaNet structure. The fused image is used as an input, and the target detection network can output a moving object rectangular frame and confidence. The position of the rectangular frame of the moving object can be credibly obtained through Non-maximum suppression (Non-maximum suppression). The detection result sequence may include rectangular frame positions of the moving object detected from each frame of the fused image, i.e., a moving object position sequence. In an embodiment, the position of the rectangular frame in the fused image may also be cropped to obtain a local image corresponding to the running object, and each frame obtains a local image sequence from the cropped local image. The local images are input into a feature extraction network, and the image features of the local images can be extracted, so that an image feature sequence is obtained. The image features are used to characterize whether they belong to the same object (e.g., the same vehicle, the same person), and the image features of the same object are similar in different images.

Step S240: and tracking the target according to the detection result sequence.

The detection result sequence may include a motion position of the running object in each frame. The target tracking is to connect the positions of the targets at different moments to form a target motion track. And if only one moving object is provided, the connecting line of the moving position of the moving object in each frame is the target motion track. If a plurality of moving objects are provided, the positions of the moving objects which do not belong to the same target can be filtered out based on the positions and the characteristics of the moving objects, and the connecting line of the positions of the moving objects in each frame is remained as the target motion track.

A detailed description of how to filter out moving object positions that do not belong to the same object is provided below.

In an embodiment, as shown in fig. 5, the step S240 specifically includes: step S241, according to the image frame sequence, the difference image sequence and the detection result sequence, removing the detection result which does not belong to the target object to obtain a target position sequence; step S242: and obtaining a target motion track according to the target position sequence.

The target object may be any one of a plurality of moving objects, such as a vehicle. The target position sequence refers to the position of the target object in each frame of image. And connecting the positions of the target object in each frame of image to form a target motion track.

In an embodiment, the detection result sequence may include a motion position of the running object in each frame, and removing the detection result not belonging to the target object means removing the motion object position not belonging to the target object. For example, according to the position of the target object in the first frame image, the position of the target object in the subsequent frame can be predicted, and the moving object with a larger difference between the position of the target object in the subsequent frame and the predicted position is removed; according to the image characteristics of the target object in the first frame image, the moving object with a larger difference with the image characteristics of the target object in the subsequent frame can be removed, so in an embodiment, the image characteristics can be extracted by adopting a characteristic extraction network.

In an embodiment, the detection result sequence may include a moving object position sequence and a local image sequence. The local image refers to an image where the position of the moving object is located, and the image where the position of the moving object is located in each frame of image forms a local image sequence.

As shown in fig. 6, the step S241 specifically includes steps S2411 to S2422.

Step S2411: and taking the local image sequence as the input of the trained feature extraction network to obtain an image feature sequence corresponding to the local image sequence output by the feature extraction network.

The image feature sequence is a sequence of image features of each local image. The image features are used for representing the features of the moving objects, and the features of different moving objects are different, so that whether the moving objects belong to the same moving object can be distinguished through the image features. The local image may be an image of a region where the moving object is located, which is cut out from the original image frame, based on the position of the moving object indicated by the detection result, or may be an image of a region where each moving object is located, which is cut out from the fusion image.

In one embodiment, feature extraction may be performed using twin Network (simple Network) inference, the loss function during training is designed as a feature difference, and the computation may be performed using L1 distance (L1-distance) or L2 distance (L2-distance). The feature difference extracted after different Data enhancement (Data Augmentation) is carried out on the same input can be used as a positive sample loss function to extract the same or similar features from the positive sample as much as possible, and the opposite number of the feature difference extracted from different inputs is used as a negative sample loss function to extract different features from the negative sample as much as possible. Neural networks trained in this way tend to extract the same features from the information of the same object in different images.

In other embodiments, the annotated video may also be used to extract different frames for the same object as positive samples to be input to the training twin network. This method is more prone to extracting the same features from the same objects in the same video than data enhancement. In another embodiment, the twin network can also be pre-trained by using the input with the differential image information as training data, which can improve the extraction capability of the frame differential information during reasoning.

Step S2422: and removing the position of the moving object which does not belong to the target object according to the image frame sequence, the difference image sequence, the position sequence of the moving object and the image characteristic sequence to obtain a target position sequence.

The image frame sequence, the difference image sequence, the moving object position sequence and the image characteristic sequence can be used as input of a tracking algorithm to output a target motion track. Tracking algorithms can be divided into two categories using neural networks (e.g., optical flow computational networks, infra) and not using neural networks (e.g., using kalman filter models, infra). The two modes are described in detail below.

In one embodiment, as shown in FIG. 7, the solution using the optical flow computing network includes the following steps S710-S730.

Step S710: and taking the image frame sequence and the differential image sequence as the input of the optical flow calculation network to obtain the optical flow information offset between adjacent frames output by the optical flow calculation network.

The optical flow computation network may be a convolutional neural network of FlowNet 2.0 structure. The input is a fused image of two images, a previous frame image and a subsequent frame image. The fused image is obtained according to the image frame sequence and the difference image sequence, which are described in detail above and are not described herein again. The image frame may be an RGBD image, D represents depth information, so the fused image may include the depth information and difference image information of the image frame, and optical flow calculation accuracy may be improved. The output of the optical flow calculation network is called optical flow information offset and is used for representing the offset (increment in the x direction and increment in the y direction) of each pixel point in the two images.

Step S720: and superposing the optical flow information offset between the adjacent frames to the target object position of the previous frame to obtain the target object predicted position of the next frame.

Specifically, the optical flow information offset is added to the target object position of the previous frame, so as to obtain the target object predicted position of the next frame. In one embodiment, before the addition, operations such as weighted average or gaussian smoothing may be performed within the range of the target object position to improve the accuracy, and then the target object position in the previous frame and the optical flow information offset may be added to obtain the predicted target object predicted position in the next frame. The target object predicted position refers to an estimated value of the position of the target object in the next frame. The target object position of the previous frame refers to a detection result of the target object in the previous frame, and can be obtained through a target detection network, and the target object can be any moving object. The previous frame and the next frame are relative terms and may be previous and next frames of two adjacent image frames.

Step S730: and according to the moving object position sequence and the image feature sequence, removing the moving object position of which the difference between the image feature and the moving object position in the next frame and the image feature and the target object prediction position of the target object is larger than a first threshold value to obtain the target object position of the next frame, wherein the target object position of each frame forms the target position sequence.

Wherein the above difference may include a feature difference and a position difference, removing a moving object position where a difference d1 between the image feature and the image feature of the target object in the next frame is greater than a threshold a, removing a moving object position where a difference d2 between the moving object position and the target object predicted position in the next frame is greater than a threshold b. The difference d1 and the difference d2 can be calculated in a L1 distance paradigm or a L2 distance paradigm. In one embodiment, since the position information is a rectangular frame, it is further possible to calculate an intersection ratio (IoU) as an evaluation index, and if the intersection ratio between the position of the moving object in the next frame and the predicted position of the target object is greater than a threshold c, it is considered that the motion estimation is satisfied, and conversely, the position of the moving object whose intersection ratio is less than the threshold c in the next frame is removed. The moving object position finally remaining in the subsequent frame may be regarded as the target object position of the subsequent frame. By analogy, the target object position of each frame is obtained, a target position sequence is obtained, and the target object position of each frame is connected to obtain the motion trajectory of the target object (i.e. the target motion trajectory).

In one embodiment, as shown in fig. 8, the scheme using the kalman filter model includes the following steps S810 to S830.

Step S810: and taking the position and the movement speed of the target object of the previous frame as the input of a Kalman filtering model, and obtaining the predicted position of the target object of the next frame output by the Kalman filtering model.

Kalman filtering is a method of first order linear estimation of a number of variables, where the variables may be target object position and velocity of motion. When the historical variation of these variables (i.e., the motion trajectories of the previous frames in this task) is known, the kalman filter can provide future predicted values of these variables (i.e., the predicted position and predicted speed of the next frame in this task). In an embodiment, the motion trajectories of the previous frames can also be obtained by using the above steps S710 to S730, and then the target object position and speed of the next frame are predicted based on the target object position and motion speed of the last frame in the trajectories of the previous frames. Wherein, the movement speed can be obtained by dividing the movement distance by the movement time.

Step S820: and calculating the intersection ratio between the target object prediction position of the next frame and the moving object position of the next frame image, and removing the moving object position of which the intersection ratio is smaller than a second threshold value in the next frame.

Since the target object position can be represented in the form of a rectangular box. Assuming that the previous frame has a positions of a moving objects and the next frame has b positions of b moving objects, theoretically there may be a × b different motion trajectories. Some completely disjoint results may be removed in advance to reduce the amount of computation. Specifically, after the predicted position of the target a in the next frame is estimated based on the position of the target a in the previous frame, the position detection result of the moving object in the next frame, the intersection ratio of which is smaller than the second threshold, may be removed by calculating the intersection ratio between the predicted position of the target a in the next frame and the positions of all the moving objects in the next frame. For the purpose of discrimination, the threshold corresponding to the cross-over ratio is referred to as a second threshold.

Step S830: and according to the moving object position sequence and the image feature sequence, removing the moving object position of which the difference between the image feature and the moving object position in the next frame and the image feature and the target object prediction position of the target object is larger than a third threshold value to obtain the target object position of the next frame, wherein the target object position of each frame forms the target position sequence.

Step S830 can be implemented with reference to step S730, and will not be described herein again.

Because the distance between the same target and the camera does not change too much in a short time, in order to improve the accuracy of target tracking, in an embodiment, the target average depth can be further calculated according to the target depth information of each frame of image in the image frame sequence; and removing the target object position of which the difference between the target depth information and the target average depth exceeds a fourth threshold value.

It should be noted that the image frames may be in RGBD format, i.e. the moving object in each frame image has depth information in addition to color information. The target depth information may be depth information of a certain target (e.g. a certain vehicle), the depth information being used to characterize the distance between the target and the camera. If the difference between the target depth information of the target object in a certain frame and the target average depth of the target object is large, for example, larger than the fourth threshold, the target object position of the current frame can be considered to be wrong, and the target object position of the current frame can be removed, so that false positives are further suppressed, and the tracking accuracy is improved.

To further improve the tracking accuracy and suppress false positives, as shown in fig. 9, the following steps S910 to S940 may be further performed.

Step S910: and generating continuous multi-frame target distribution heat maps according to the differential image sequence.

In an embodiment, the inter-frame difference value can be removed from the position where the inter-frame difference of the depth information exceeds the threshold value, so that the difference caused by foreground and background switching is reduced, and at the moment, the remaining pixel points have a larger ratio to contain the moving object, so that the remaining pixel points can be used as the distribution heat map of the moving object. The operation may be performed on each frame of differential image information in the differential image sequence in sequence, the inter-frame difference value is removed for the position where the inter-frame difference of the depth information exceeds the threshold, the absolute value of the inter-frame difference value at other positions is taken, or the numerical value at other positions is calculated by a nonlinear function 1/(| x | +1), and the obtained image may be referred to as a target distribution heat map. The gray value of the pixel point where the moving object is located in the target distribution heat map is larger than 0, and the gray values of other pixel points are 0.

Step S920: and calculating the weighted average value of the positions of the target objects in the target distribution heat map corresponding to the appointed frame image according to the positions of the target objects in the appointed frame image.

For example, assuming that the target object position of the target a in the a-th frame is calculated based on the above, a target distribution heat map of the differential image information may be obtained from the differential image information between the a-th frame and the a + 1-th frame. Or obtaining the target distribution heat map of the differential image information according to the differential image information between the a-1 th frame and the a-th frame. Different target distribution heat maps can be obtained by different calculation modes of the differential image information. And further calculating the weighted average value of the gray values of all pixel points in the target object position (the target object position can be represented by a rectangular frame) area in the target distribution heat map. By analogy, a weighted average value can be obtained by corresponding calculation according to the target object position in each frame of image.

Step S930: and carrying out weighted average on the weighted average value of the target object positions in the target distribution heat map corresponding to each frame image to obtain a heat response average value.

For example, according to the target a, the weighted average value x1 is calculated in the first frame image, the weighted average value x2 is calculated in the second frame image, and the weighted average value x3 … … is calculated in the third frame image, which may be called a thermal response average value, for weighted average of x1, x2, and x3 … ….

Step S940: and removing the target object position with the average value of the thermal response smaller than a fifth threshold value.

For example, if the average value of the thermal response of the target a is smaller than the fifth threshold, the target object position of the target a in each frame of image may be removed, that is, the motion trajectory of the target a is deleted, so that the trajectory with too low response may be removed, and the tracking accuracy may be improved.

Fig. 10 is a schematic structural diagram of a target tracking method according to an embodiment of the present application. As shown in fig. 10, the method comprises the following steps:

step S1001: the video in RGB or RGBD format is used as the input of the frame differential model, and the differential image sequence is output. The calculation process of the frame differential model is referred to step S220.

Step S1002: the still image sequence (i.e., the image frame sequence) and the differential image sequence are used as inputs of a two-input detection convolutional network (i.e., the target detection network) to output the object position and the local image, see step S230.

Step S1003: and taking the local image as the input of a feature extraction network to obtain the object position and the feature vector, wherein the feature extraction mode refers to the above.

Step S1004: and taking the static image sequence, the difference image sequence, the object position and the feature vector as the input of a three-input tracking algorithm, and outputting the tracking track of the moving object. The three-input tracking algorithm may be referred to above as the optical flow computation network and the Kalman filtering model.

According to the embodiment of the application, video input with depth can be acquired through video acquisition and radar equipment, the difference image sequence is acquired through frame difference calculation, the difference image sequence and the original input are respectively input into a convolutional neural network model for reasoning, the positions of the frame-level moving objects are respectively acquired, and accurate moving object tracks can be acquired through various filtering conditions such as cross-over ratio calculation, position difference, characteristic difference, depth difference, distribution heat map density and the like, so that the false detection rate is suppressed, and the accuracy of a moving object detection system can be improved.

The following are embodiments of the apparatus of the present application, which may be used to implement the above-mentioned embodiments of the target tracking method of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the target tracking method of the present application.

Fig. 11 is a block diagram of a target tracking device according to an embodiment of the present application. As shown in fig. 11, the apparatus includes: an image acquisition module 1110, a difference calculation module 1120, a target detection module 1130, and a target tracking module 1140.

An image acquisition module 1110 for acquiring a sequence of image frames;

a difference calculating module 1120, configured to obtain a difference image sequence by calculating an inter-frame difference of the image frame sequence;

a target detection module 1130, configured to use the image frame sequence and the difference image sequence as inputs of a trained target detection network, to obtain a detection result sequence output by the target detection network;

and a target tracking module 1140, configured to perform target tracking according to the detection result sequence.

The implementation processes of the functions and actions of each module in the device are specifically described in the implementation processes of the corresponding steps in the target tracking method, and are not described herein again.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A target tracking method, comprising:

acquiring an image frame sequence;

and tracking the target according to the detection result sequence.

2. The method of claim 1, wherein the performing target tracking according to the sequence of detection results comprises:

and obtaining a target motion track according to the target position sequence.

3. The method according to claim 2, wherein the detection result sequence comprises a moving object position sequence and a local image sequence, and the removing the detection result not belonging to the target object according to the image frame sequence, the differential image sequence and the detection result sequence to obtain the target position sequence comprises:

4. The method of claim 3, wherein the removing the moving object position not belonging to the target object according to the image frame sequence, the differential image sequence, the moving object position sequence and the image feature sequence to obtain the target position sequence comprises:

5. The method of claim 3, wherein the removing the moving object position not belonging to the target object according to the image frame sequence, the differential image sequence, the moving object position sequence and the image feature sequence to obtain the target position sequence comprises:

taking the position and the movement speed of the target object of the previous frame as the input of a Kalman filtering model, and obtaining the predicted position of the target object of the next frame output by the Kalman filtering model;

6. The method according to claim 4 or 5, characterized in that the method further comprises:

7. The method of claim 4 or 5, further comprising:

8. The method of claim 1, wherein said obtaining a sequence of difference images by calculating an inter-frame difference value for the sequence of image frames comprises:

9. The method of claim 1, wherein said obtaining a sequence of difference images by calculating an inter-frame difference value for the sequence of image frames comprises:

10. The method of claim 1, wherein the taking the image frame sequence and the difference image sequence as inputs of a trained target detection network to obtain a detection result sequence output by the target detection network comprises:

11. The method according to claim 10, wherein the fusing the image frame sequence with the difference image sequence to obtain a fused image sequence is performed by any one of:

12. An object tracking device, comprising:

the image acquisition module is used for acquiring an image frame sequence;

13. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the object tracking method of any one of claims 1-7.

14. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the object tracking method of any one of claims 1-7.